Data Mining II. Prof. Dr. Karsten Borgwardt, Department Biosystems, ETH Zürich. Basel, Spring Semester 2016 D-BSSE

D-BSSE Data Mining II Prof. Dr. Karsten Borgwardt, Department Biosystems, ETH Zürich Basel, Spring Semester 2016

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 2 / 117 Our course - The team Dr. Damian Roqueiro, Dr. Dean Bodenham, Dr. Dominik Grimm, Dr. Xiao He

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 3 / 117 Our course - Background information Schedule Structure Moodle link Lecture: Wednesdays 14:15-15:50 Tutorial: Wednesdays 16:00-16:45 Room: Manser Written exam to get the certificate in summer 2016 Key topics: dimensionality reduction, text mining, graph mining, association rules, and others Biweekly homework to apply the algorithms in practice, weekly tutorials, rotating between example tutorials and homework solution tutorials https://moodle-app2.let.ethz.ch/course/view.php?id=2072

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 4 / 117 Our course - Background information More information Homework: Avoid plagiarism and be aware of the consequences. Next week, March 2, we will host the Exam Inspection of the Data Mining 1 Exam, from 15:30-16:30 in this room.

1. Dimensionality Reduction D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 5 / 117

Why Dimensionality Reduction? D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 6 / 117

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 7 / 117 Why Dimensionality Reduction? Defintion Dimensionality Reduction is the task to find an r-dimensional representation of a d-dimensional dataset, with r << d such that the d-dimensional information is maximally preserved. Motivation uncovering the intrinsic dimensionality of the data, visualization, reduction of redundancy, correlation and noise, computational or memory savings.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 8 / 117 Why Dimensionality Reduction? class 1 class 2 class 3 class 4 class 5 6 4 2 0 z 2 4 6 2 1 0 1 2 x 3 4 5 6 4 6 8 10 2 4 0 2 y

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 9 / 117 Why Dimensionality Reduction? 2 Variance Explained - PC1: 0.55, PC2: 0.32, PC3: 0.13 1.0 2.0 3.0 4.0 5.0 0 Transformed Y Values 2 4 6 8 10 6 4 2 0 2 4 6 8 Transformed X Values

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 10 / 117 Reminder: Feature Selection Feature Selection versus Dimensionality Reduction Feature Selection tries to select a subset of relevant variables: for reduced computational, experimental and storage costs, for better data understanding and visualization Univariate feature selection ignores correlations or even redundancies between the features. The same underlying signal can be represented by several features. Dimensionality reduction tries to find these underlying signals and to represent the original data in terms of these signals. Unlike feature selection, these underlying signals are not identical to the original features, but typically combinations of them.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 11 / 117 1.1 Principal Component Analysis based on: Shai Shalev-Shwartz and Shai Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press 2014, Chapter 23.1 Mohammed J. Zaki and Wagner Meira Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press 2014, Chapter 7.2

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 12 / 117 Dimensionality Reduction: Principal Component Analysis Goals To understand why principal component analysis can be interpreted as a compression-recovery scheme. To understand the link between principal component analysis and the eigenvectors and eigenvalues of the covariance matrix. To understand what a principal component is. To understand how to compute principal component analysis efficiently, both for large n and d. To learn an objective way to determine the number of principal components. To understand in which sense principal component analysis is capturing the major axis of variation in the data.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 13 / 117 Dimensionality Reduction: Principal Component Analysis Concept Let x 1,..., x n be n vectors in R d. We assume that they are mean-centered. The goal of PCA is to reduce the dimensionality of these vectors using a linear transformation. The matrix W R r d, where r < d, induces a mapping x W x, where W x is the lower dimensional representation of x. A second matrix U R d r can be used to approximately recover each original vector x from its compressed form. That is, for a compressed vector y = W x, where y is in the low-dimensional space R r, we can construct x = Uy, so that x so that x is the recovered version of x in the original space R d.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 14 / 117 Principal Component Analysis Objective In PCA, the corresponding objective of finding the compression matrix W and the recovering matrix U is then phrased as arg min W R r d,u R d r n x i UW x i 2 2 (1) That is, PCA tries to minimize the total squared distance between the original vectors and the recovered vectors. i=1

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 15 / 117 Principal Component Analysis How to find U and W? Lemma Let (U, W ) be a solution to Objective (1). Then the columns of U are orthonormal (that is, U U is the identity matrix of R n ) and W = U.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 16 / 117 Principal Component Analysis Proof We make the following assumptions: Choose any U and W and consider the mapping x UW x. The range of this mapping, R = {UW x : x R d } is an r-dimensional linear subspace of R d. Let V R d r be a matrix whose columns form an orthonormal basis of this subspace (i.e. V V = I ). Each vector in R can be written as V y, where y R r. For every x R d and y R r we have This difference is minimized for y = V x. x V y 2 2 = x 2 + y 2 2y (V x) (2)

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 17 / 117 Principal Component Analysis Proof Continued: Therefore, for each x we have that: VV x = arg min x x 2 2 (3) x R This holds for all x i and we can replace U, W by V, V without increasing the objective: n x i UW x i 2 2 i=1 n x i VV x i 2 2 (4) i=1 As this holds for any U, W, the proof is complete.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 18 / 117 Principal Component Analysis How to find U and W? Due to the previous lemma, we can rewrite the Objective (1) as arg min U R d r :U U=I n x i UU x i 2 2 (5) i=1

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 19 / 117 Principal Component Analysis How to find U and W? We can further simplify the objective by the following transformation: x UU x 2 = x 2 2x UU x + x UU UU x (6) = x 2 x UU x (7) = x 2 trace(u xx U). (8) Note: The trace of a matrix is the sum of its diagonal entries.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 20 / 117 Principal Component Analysis How to find U and W? This allows us to turn Objective (5) into a trace maximization problem: ( ) n arg max trace U x i x i U U R d r :U U=I If we define Σ = 1 n n i=1 x ix i, this is equivalent (up to a constant factor) to i=1 ( ) arg max U R d r :U U=I trace U ΣU (9) (10)

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 21 / 117 Principal Component Analysis Theorem Let Σ = VDV be the spectral decomposition of Σ. D is a diagonal matrix, such that D i,i is the i-th largest eigenvalue of Σ. The columns of V are the corresponding eigenvectors, and V V = VV = I. Then the solution to Objective (10) is the matrix U whose columns are the r first eigenvectors of Σ.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 22 / 117 Principal Component Analysis Proof Choose a matrix U R d,r with orthonormal columns and let B = V U. Then, VB = VV U = U. It follows that and therefore U ΣU = B V VDV VB = B DB (11) trace(u ΣU) = d j=1 D j,j r Bj,i. 2 (12) i=1 Note that B B = U VV U = U U = I. Hence the columns of B are orthonormal and d r j=1 i=1 B2 j,i = r.

Principal Component Analysis Proof Define B R d d to be the matrix such that the first r columns are the columns of B and in addition, B B = I. Then for every j : d B i=1 j,i 2 = 1, which implies that r i=1 B2 j,i 1. It follows that trace(u ΣU) max β [0,1] d : β 1 r d D j,j β j =. Therefore for every matrix U R d r with orthonormal columns, the following inequality holds: trace(u ΣU) r j=1 D j,j. But if we set U to the matrix with the r leading eigenvectors of Σ as its columns, we obtain trace(u ΣU) = r j=1 D j,j and thereby the optimal solution. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 23 / 117 j=1 r j=1 D j,j

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 24 / 117 Principal Component Analysis Runtime properties The overall runtime of PCA is O(d 3 + d 2 n): O(d 3 ) for calculating the eigenvalues of Σ, O(d 2 n) for constructing the matrix Σ.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 25 / 117 Principal Component Analysis Speed-up for d >> n Often the number of features d greatly exceeds the number of samples n. The standard runtime of O(d 3 + d 2 n) is very expensive for large d. However, there is a workaround to perform the same calculations in O(n 3 + n 2 d).

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 26 / 117 Principal Component Analysis Speed-up for d >> n Workaround: X R d n, Σ = 1 n n i=1 x ix i or Σ = 1 n XX. Consider K = X X, such that K ij = x i, x j Suppose v is an eigenvector of K: Kv = λv for some λ R. Multiplying the equation by 1 n X and using the definition of K we obtain 1 n XX X v = 1 n λx v. (13) But, using the definition of Σ, we get that Σ(X v ) = λ n (X v ). X v Hence X v is an eigenvector of Σ with eigenvalue λ n. Therefore, one can calculate the PCA solution by calculating the eigenvalues of K rather than Σ.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 27 / 117 Principal Component Analysis Pseudocode Require: A matrix X of n examples R d n, number of components r 1: for i = 1,..., n set x i x i 1 n n i=1 x i 2: if n > d then 3: Σ = 1 n XX 4: Compute the r leading eigenvectors v 1,..., v r of Σ. 5: else 6: K = X X 7: Compute the r leading eigenvectors v 1,..., v r of K. 8: for i = 1,..., r set v i := 1 X v i X v i. 9: end if 10: return Compression matrix W = [v 1,..., v r ] or compressed points WX

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 28 / 117 Principal Component Analysis How to choose the number of principal components? The principal components should capture α per cent of the total variance in the data (α is often set to be 90%). Total variance in the data: d i=1 λ i. Total variance captured by first r eigenvectors: r i=1 λ i The variance explained α is the ratio between the two: α = r i=1 λ i d i=1 λ i.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 29 / 117 Principal Component Analysis Theorem The variance captured by the first r eigenvectors of Σ is the sum over its r largest eigenvalues r i=1 λ i.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 30 / 117 Principal Component Analysis Proof The variance in a dataset X is defined as var(x ) = 1 n = 1 n n x i 0 2 = (14) i=1 n x i 2 = 1 n i=1 n x i, x i = (15) i=1 = tr(σ) = tr(v DV ) = tr(vv D) = tr(d) = (16) d d = D i,i = λ i. (17) i=1 i=1

Principal Component Analysis Proof - Continued The variance in a projected dataset WX, with W = [v 1,..., v r ], is defined as var(wx ) = 1 n = 1 n n W x j 2 = 1 n j=1 n W x j, W x j = 1 n j=1 n x j (v 1 v1 +... + v r vr )x j = j=1 = (v 1 Σv 1 +... + v r Σv r ) = r i=1 v i r vi Σv i = i=1 n x j W W x j = (18) j=1 n j=1 1 n (x jx j )v i = (19) r vi λ i v i = i=1 r λ i. (20) Therefore the variance explained can be written as a ratio over sums over eigenvalues of the covariance matrix Σ.. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 31 / 117 i=1

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 32 / 117 Principal Component Analysis Geometric Interpretation An alternative interpretation of PCA is that it finds the major axis of variation in the data. This means that the 1st principal component defines the direction in the data with the greatest variance. The 2nd principal component defines a direction that i) is orthogonal to the 1st principal component and ii) captures the major direction of the remaining variance in the data. In general, the i-th principal component is orthogonal to all previous i 1 principal components and represents the direction of maximum variance remaining in the data.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 33 / 117 Principal Component Analysis Proof: Variance Maximization along 1st principal component We start by trying to find one principal component v 1 that maximizes the variance of X : arg max v1 Var(v 1 X ) = arg max v 1 v 1 Σv 1 To avoid picking a v 1 with arbitrarily large entries, we enforce v 1 v 1 = 1. We form the Lagrangian to solve this problem and take the derivative with respect to v 1 and set it to zero v 1 Σv 1 λ(v 1 v 1 1). (21) v 1 (v 1 Σv 1 λ(v 1 v 1 1)) = 0; (22)

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 34 / 117 Principal Component Analysis Proof: Variance Maximization along 1st principal component v 1 (v 1 Σv 1 λ(v 1 v 1 1)) = 0; (23) 2Σv 1 2λv 1 = 0; (24) Σv 1 = λv 1. (25) The solution v 1 is an eigenvector of Σ. As v 1 Σv 1 = v 1 λv 1 = λv 1 v 1 = λ 1, the variance is maximized by picking the eigenvector corresponding to the largest eigenvalue. Hence the 1st eigenvector v 1 is the principal component/direction that maximizes the variance of X v 1.

Principal Component Analysis Proof: Variance Maximization along 2nd principal component We will now show the solution for the 2nd principal component. The second direction of projection should be independent from the first one: cov(v 2 X, v 1 X ) = 0. This can be written as cov(v 2 X, v 1 X ) = v 2 XX v 1 = v 2 Σv 1 = v 2 λv 1 = λv 2 v 1 = 0. We form the Lagrangian v 2 Σv 2 λ(v 2 v 2 1) ρ(v 2 v 1) and set the derivative with respect to v 2 to zero: 2Σv 2 2λv 2 ρv 1 = 0. If we multiply this from the left by v1, the first two terms vanish: ρ = 0; As ρ = 0, we are left with Σv 2 = λv 2, showing that v 2 is again an eigenvector of Σ, and we again pick the eigenvector of the - now second largest - eigenvalue to maximize the variance along the second principle component. The proofs for the other principal components k > 2 follows the same scheme. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 35 / 117

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 36 / 117 Principal Component Analysis 15 10 5 y 0 5 10 15 10 5 0 5 10 15 20 25 x

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 37 / 117 Principal Component Analysis 15 10 PC1 PC2 5 y 0 5 10 15 10 5 0 5 10 15 20 25 x

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 38 / 117 Principal Component Analysis Transformed Y Values 14 12 10 8 6 4 2 0 2 4 15 10 5 0 5 10 15 20 25 30 Transformed X Values 1.0

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 39 / 117 Principal Component Analysis Summary Principal Component Analysis (Pearson, 1901) is the optimal way to perform dimensionality reduction to a lower-dimensional space with subsequent reconstruction with minimal squared reconstruction error. The principal components are the eigenvectors of the covariance matrix of the data. The eigenvalues capture the variance that is described by the corresponding eigenvector. The number of principal components can be chosen such that they capture α per cent of the total variance. The principal components capture the orthogonal directions of maximum variance in the data. PCA can be computed in O(min(n 3 + n 2 d, d 3 + d 2 n)).

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 40 / 117 1.2 Singular Value Decomposition based on: Charu Aggarwal, Data Mining - The Textbook, Springer 2015, Chapter 2.4.3.2 Mohammed J. Zaki and Wagner Meira Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press 2014, Chapter 7.4 Dr. R. Costin. Lecture Notes of MATHEMATICS 5101: Linear Mathematics in Finite Dimensions. OSU 2013.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 41 / 117 Singular Value Decomposition Goals Get to know a fundamental matrix decomposition that is widely used throughout data mining. Learn how SVD can be used to obtain a low-rank approximation to a given matrix. Get to know how to compute Principal Component Analysis via SVD. Learn how SVD can speed up fundamental matrix operations in data mining.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 42 / 117 Singular Value Decomposition Definition A Singular Value Decomposition is defined as a factorization of a given matrix D R n d into three matrices: D = L R, (26) where L is an n n matrix with orthonormal columns, the left singular vectors, is an n d diagonal matrix containing the singular values, and R is a d d matrix with orthonormal columns, the right singular vectors. The singular values are non-negative and by convention arranged in decreasing order. Note that is not (necessarily) a square matrix.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 43 / 117 Singular Value Decomposition D d L n Δ d R T d n D {1,1}... D {1,d}.. =.. D {n,1}... D {n,d} n L {1,1}... L {1,n}.... L {n,1}... L {n,n} n δ 1 0 0 0... 0 δ 2 0 0............... d R {1,1}... R {1,d}...... R {d,1}... R {d,d}

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 44 / 117 Singular Value Decomposition Reduced SVD One can discard the singular vectors that correspond to zero singular values, to obtain the reduced SVD: D = L r r R r, where L r is the n r matrix of the left singular vectors, R r is the d r matrix of the right singular vectors, and δ r is the r r diagonal matrix containing the positive singular vectors.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 45 / 117 Singular Value Decomposition Reduced SVD The reduced SVD gives rise to the spectral decomposition of D: D = L r r Rr = (27) δ 1 0... 0 r 1 = ([l 1, l 2,..., l r ]) 0 δ 2... 0 r 2............... = (28) 0 0... δ r rr r = δ 1 l 1 r1 + δ 2 l 2 r2 +... + δ r l r rr = δ i l i ri (29) Hence D is represented as a sum of rank-one matrices of the form δ i l i r i. i=1

Singular Value Decomposition Eckart-Young Theorem By selecting the r largest singular values δ 1, δ 2,..., δ r, and the corresponding left and right singular vectors, we obtain the best rank-r approximation to the original matrix D. Theorem (Eckart-Young Theorem, 1936) If D r is the matrix defined as r i=1 δ il i r i, then D r is the rank-r matrix that minimizes the objective D D r F. The Frobenius Norm of a matrix A, A F, is defined as n n A F = A 2 i,j. D-BSSE Karsten Borgwardt i=1 j=1 Data Mining II Course, Basel Spring Semester 2016 46 / 117

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 47 / 117 Singular Value Decomposition Proof of Eckart-Young Theorem (Part 1/2) Assume D is of rank k (k > r). Since D D r F = L R D r F = L D r R F. Denoting N = L D r R, we can compute the Frobenius norm as N 2 F = i,j i,j N i,j 2 = k δ i N i,i 2 + N i,i 2 + i>k i j N i,j 2. i=1 This is minimized if all off-diagonal terms of N and all N i,i for i > k are zero. The minimum of δ i N i,i 2 is obtained for N i,i = δ i for i = 1,..., r and all other N i,i are zero.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 48 / 117 Singular Value Decomposition Proof of Eckart-Young Theorem (Part 2/2) We do not need the full L and R for computing D r, only their first r columns. This can be seen by splitting L and R into blocks: L = [L r, L 0 ] and R = [R r, R 0 ] and [ D r = L r R r 0 = [L r, L 0 ] 0 0 = [L r, L 0 ] [ r R r 0 ] [ ] R r R0 ] (30) = L r r Rr. (31)

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 49 / 117 Singular Value Decomposition Geometric interpretation: Column and row space Any n d matrix D represents a linear transformation, D : R d R n, from the space of d-dimensional vectors to the space of n-dimensional vectors, because for any x R d, there exists a y R n, such that Dx = y. The column space of D is defined as the set of all vectors y R n such that Dx = y over all possible x R d. The row space of D is defined as the set of all vectors x R d such that D y = x over all possible y R n. The row space of D is the column space of D.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 50 / 117 Singular Value Decomposition Geometric interpretation: Null spaces The set of all vectors x R d, such that Dx = 0 is called the (right) null space of D. The set of all vectors y R n, such that D y = 0 is called the left null space of D.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 51 / 117 Singular Value Decomposition Geometric interpretation: SVD as a basis-generator SVD gives a basis for each of the four spaces associated with a matrix D. If D has rank r, then it has only r independent columns and only r independent rows. The r left singular vectors l 1, l 2,..., l r corresponding to the r nonzero singular values of D represent a basis for the column space of D. The remaining n r left singular vectors l r+1,..., l n represent a basis for the left null space of D. The r right singular vectors r 1, r 2,..., r r, corresponding to the r non-zero singular values represent a basis for the row space of D. The remaining d r right singular vectors r r+1,..., r d represent a basis for the (right) null space of D.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 52 / 117 Singular Value Decomposition Geometric interpretation: SVD as a basis-generator Proof of right null space: Dx = 0 (32) L L R x = 0 (33) R x =: x (34) x = (δ 1 x 1, δ 2 x 2,..., δ r x r, 0,..., 0) = 0 (35) x = R x (36) The weights for the first r right singular vectors r 1,..., r r is zero. Hence x is a linear combination over the d r right singular vectors r r+1,..., r d.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 53 / 117 Singular Value Decomposition Geometric interpretation: SVD as a basis-generator Consider the reduced SVD expression from Equation (27). Right-multiplying both sides by R r and noting that R r R r = I, we obtain: DR r = L r r R r R r (37) DR r = L r r (38) δ 1 0... 0 DR r = L r 0 δ 2... 0............ (39) 0 0... δ r D([r 1, r 2,..., r r ]) = ([δ 1 l 1, δ 2 l 2,..., δ r l r ]) (40)

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 54 / 117 Singular Value Decomposition Geometric interpretation: SVD as a basis-generator Hence Dr i = δ i l i for all i = 1,..., r. That means, SVD is a special factorization of the matrix D such that any basis vector r i for the row space is mapped to the corresponding basis vector l i for the column space, scaled by δ i. SVD can be thought of as a mapping from an orthonormal basis (r 1, r 2,..., r r ) in R d, the row space, to an orthonormal basis (l 1, l 2,..., l r ) in R n, the column space, with the corresponding axes scaled according to the singular values δ 1, δ 2,..., δ r.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 55 / 117 Singular Value Decomposition Link to PCA There is a direct link between PCA and SVD, which we will elucidate next. This link allows the computation of PCA via SVD. We assume that D is a d n matrix.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 56 / 117 Singular Value Decomposition Link to PCA: The covariance matrix via SVD The columns of matrix L, which are also the left singular vectors, are the orthonormal eigenvectors of DD. Proof: DD = L (R R) L = L L Note that R R = 1. The diagonal entries of, that is, the squared nonzero singular values, are therefore the nonzero-eigenvalues of DD.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 57 / 117 Singular Value Decomposition Link to PCA: The covariance matrix via SVD The covariance matrix of mean-centered data is 1 n DD and the left singular vectors of SVD are eigenvectors of DD. It follows that the eigenvectors of PCA are the same as the left singular vectors of SVD for mean-centered data. Furthermore, the square singular values of SVD are n times the eigenvalues of PCA.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 58 / 117 Singular Value Decomposition Link to PCA: The kernel matrix via SVD The columns of matrix R, which are the right singular vectors, are the orthonormal eigenvectors of D D. Proof: D D = R (L L) R = R R Note that L L = 1. The diagonal entries of, that is, the squared nonzero singular values of, are therefore the nonzero-eigenvalues of D D.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 59 / 117 Singular Value Decomposition Link to PCA: The kernel matrix via SVD The kernel matrix on D is D D and the right singular vectors of SVD are eigenvectors of D D. It follows that each eigenvector v of PCA can be expressed in terms of a right singular vectors r of SVD: v = Dr Dr Again, the square singular values of SVD are n times the eigenvalues of PCA (λ Σ = λ K n = δ2 n ).

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 60 / 117 Singular Value Decomposition Applications of SVD and PCA beyond dimensionality reduction Noise reduction: Removing smaller eigenvectors or singular vectors often improves data quality. Data imputation: The reduced SVD matrices L r, r and R r can be computed even for incomplete matrices and used to impute missing values ( link prediction) Linear equations: Obtain basis of the solution Dx = 0. Matrix inversion: Assume D is square. D = L R D 1 = R 1 L Powers of a matrix: Assume D is square and positive semidefinite. Then D = L L and D k = L k L.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 61 / 117 Singular Value Decomposition Summary Singular Value Decomposition (SVD) is a decomposition of a given matrix D into three submatrices. SVD can be used to obtain an optimal low-rank approximation of a matrix in terms of the Frobenius norm. SVD generates bases for the four spaces associated with a matrix. D can be thought of as a mapping between the basis vectors of its row space and the scaled basis vectors of its column space. SVD can be used to implement PCA. SVD is used throughout data mining for efficient matrix computations.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 62 / 117 1.3 Kernel Principal Component Analysis based on: B. Schölkopf, A. Smola, K.R. Müller, Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Max Planck Institute for Biological Cybernetics, Technical Report No.44, 1996 Mohammed J. Zaki and Wagner Meira Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press 2014, Chapter 7.3

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 63 / 117 Kernel Principal Component Analysis 1.5 Nonlinear 2D Dataset 1.0 0.5 y 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 x

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 64 / 117 Kernel Principal Component Analysis 1.5 Nonlinear 2D Dataset 1.0 0.5 y 0.0 0.5 1.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 x

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 65 / 117 Kernel Principal Component Analysis Goals Learn how to perform non-linear dimensionality reduction. Learn how to perform PCA in feature space solely in terms of kernel functions. Learn how to compute the projections of points onto principal components in feature space.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 66 / 117 Kernel Principal Component Analysis Kernel Principal Component Analysis (Schölkopf, Smola, Müller, 1996) PCA is restrictive in the sense that it only allows for linear dimensionality reduction. What about non-linear dimensionality reduction? Idea: Move the computation of principal components to feature space. This approach exists and is called Kernel PCA (Schölkopf, Smola and Müller, 1996)! Define a mapping φ : X H, x φ(x).

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 67 / 117 Kernel Principal Component Analysis Kernel Principal Component Analysis We assume that we are dealing with centered data n i=1 φ(x i) = 0. The covariance matrix then takes the form C = 1 n n i=1 φ(x i)φ(x i ). Then we have to find eigenvalues λ 0 and nonzero eigenvectors v H \ {0} satisfying: λv = Cv. All solutions v with λ 0 lie in the span of φ(x 1 ),..., φ(x n ), due to the fact that λv = Cv = 1 n n i=1 (φ(x i) v)φ(x i ).

Kernel Principal Component Analysis Kernel Principal Component Analysis The first useful consequence is that we can consider the following equations instead: λ φ(x j ), v = φ(x j ), Cv for all j = 1,..., n and the second is that there exist coefficients α j (j = 1,... n) : v = n α i φ(x i ). i=1 Combining these two consequences, we get for all j = 1,..., n: n λ α i φ(x j ), φ(x i ) = 1 n n n α i φ(x j ), φ(x k ) φ(x k ), φ(x i ). i=1 i=1 k=1 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 68 / 117

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 69 / 117 Kernel Principal Component Analysis Kernel Principal Component Analysis Combining these two consequences, we get for all j = 1,..., n: n λ α i φ(x j ), φ(x i ) = 1 n i=1 n α i i=1 k=1 n φ(x j ), φ(x k ) φ(x k ), φ(x i ).

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 70 / 117 Kernel Principal Component Analysis Kernel PCA as an eigenvector problem In terms of the n n Gram matrix K j,i := φ(x j ), φ(x i ), this can be rewritten as nλkα = K 2 α, (41) where α denotes the column vector with entries α 1,..., α n. To find solutions of Equation (41), we solve the problem nλα = Kα, (42) which we obtain by multiplying (41) by K 1 from the left.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 71 / 117 Kernel Principal Component Analysis Normalizing the coefficients We require the eigenvectors v k to have unit length, that is v k, v k = 1 for all k = 1,..., r. That means that 1 = v k, v k (43) n = αi k αj k φ(x i ), φ(x j ) (44) = i,j=1 n αi k αj k K i,j (45) i,j=1 = α k, Kα k = λ k α k, α k. (46)

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 72 / 117 Kernel Principal Component Analysis Normalizing the coefficients As eigenvectors of K, the α k have unit norm. 1 Therefore we have to rescale them by λ to enforce that their norm is 1 λ.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 73 / 117 Kernel Principal Component Analysis Projecting points onto principal components A point x can be projected onto the principal component v k (for k = 1,..., r) via v k, φ(x) = n αi k φ(x i ), φ(x). i=1 resulting in a r-dimensional representation of x based on KernelPCA in H.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 74 / 117 Kernel Principal Component Analysis How to center the kernel matrix in H? The kernel matrix K can be centered via K = K 1 n 1 n n K 1 n K 1 n n + 1 n 2 1 n n K 1 n n, (47) = (I 1 n 1 n n) K (I 1 n 1 n n) (48) using the notation (1 n n ) ij := 1 for all i, j.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 75 / 117 Kernel Principal Component Analysis Pseudocode: KernelPCA(X, r) Require: A matrix X of n examples R d,n, number of components r 1: K i,j = {k(x i, x j ) i,j=1,...,n } 2: K := (I 1 n 1 n n)k(i 1 n 1 n n) 3: (λ 1,..., λ r ) = eigenvalues(k) 4: (α 1,..., α r ) = eigenvectors(k) 5: α i := 1 λ i α i for all i = 1,..., r 6: A = (α 1,..., α r ) 7: return Set of projected points: Z = {z i z i = A K(:, i), for i = 1,..., n}

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 76 / 117 Kernel Principal Component Analysis Summary Non-linear dimensionality reduction can be performed in feature space in KernelPCA. At the heart of KernelPCA is finding eigenvectors of the kernel matrix. One can compute projections of the data onto the principal components explicitly, but not the principal components themselves (unless function φ is explicitly known).

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 77 / 117 1.4 Multidimensional Scaling based on: Wolfgang Karl Härdle, Leopold Simar. Applied Multivariate Statistical Analysis. Springer 2015, Chapter 17 Trevor Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning, Springer 2008, Second Edition, Chapters 14.8 and 14.9

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 78 / 117 Multidimensional Scaling 0 214 279 610 596 237 214 0 492 533 496 444 279 492 0 520 772 140 610 533 520 0 521 687 596 496 772 521 0 771 237 444 140 687 771 0 Source of Table and subsequent figure: Wolfgang Karl Härdle, Leopold Simar. Applied Multivariate Statistical Analysis. Springer 2015, Chapter 17

Multidimensional Scaling D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 79 / 117

Multidimensional Scaling D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 80 / 117

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 81 / 117 Multidimensional Scaling Goals Find a low-dimensional representation of data for which only distances, similarities or dissimilarities are given Understand the link between Multidimensional scaling and PCA Applications Visualize similarities or dissimilarities between high-dimensional objects, e.g. DNA sequences, protein structures

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 82 / 117 Multidimensional Scaling Setting We assume that we are given the pairwise distances, similarities or dissimilarities between all pairs of points d i,j in a dataset. Note: We do not need to know the actual coordinates of the points, as in PCA. The goal in multidimensional scaling (MDS) is to 1 find a low-dimensional representation of the data points, 2 which maximally preserves the pairwise distances. The solution is only determined up to rotation, reflection and shift.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 83 / 117 Multidimensional Scaling Optimization problem We assume that the original data objects are x 1,..., x n R d, and the distance between x i and x j is d i,j. The goal is to find a lower-dimensional representation of these n objects: z 1,..., z n R r. Hence the objective in metric MDS is to minimize the so-called stress function S M : arg min S M (z 1,..., z n ) = i j (d ij z i z j ) 2 (49)

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 84 / 117 Multidimensional Scaling Classic scaling Definition A n n distance matrix D ij = d ij is Euclidean if for some points x 1,..., x n R d d 2 ij = (x i x j ) (x i x j ). Theorem Define a n n matrix A with A ij = 1 2 d 2 ij and B = HAH where H is the centering matrix. D is Euclidean if and only if B is positive semidefinite. If D is the distance matrix of a data matrix X, then B = HXX H. B is called the inner product matrix.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 85 / 117 Multidimensional Scaling Classical scaling - From distances to inner products: Overview If distances are Euclidean, we can convert them to centered inner-products. Let d 2 ij = x i x j 2 be the matrix of pairwise squared Euclidean distances. Then we write d 2 ij = x i x 2 + x j x 2 2 x i x, x j x (50) Defining A ij = { 1 2 d ij 2 }, we double center B: where H = I 1 n 11. B is then the matrix of centered inner products. B = HAH, (51)

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 86 / 117 Multidimensional Scaling Classical scaling - From distances to inner products: Step by step The task of MDS is to find the original Euclidean coordinates from a given distance matrix. The Euclidean distance between the ith and jth points is given by d 2 ij = The general b ij term of B is given by d (x ik x jk ) 2 k=1 d b ij = x ik x jk = xi k=1 x j

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 87 / 117 Multidimensional Scaling Classical scaling - From distances to inner products: Step by step It is possible to derive B from the known squared distances d ij and them from B the unknown coordinates. d 2 ij = x i x i + x j x j 2x i x j = = b ii + b jj 2b ij (52) Centering of the coordinate matrix X implies that n i=1 b ij = 0.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 88 / 117 Multidimensional Scaling Classical scaling - From distances to inner products: Step by step Summing over i and j, we obtain 1 n 2 1 n 1 n n n i=1 n j=1 n i=1 j=1 d 2 ij = 1 n n b ii + b jj (53) i=1 d 2 ij = b ii + 1 n d 2 ij = 2 n n b jj (54) j=1 n b ii (55) i=1 b ij = 1 2 (d 2 ij d 2 i d 2 j + d 2 ) (56)

Multidimensional Scaling Classical scaling - From distances to inner products: Step by step With a ij = 1 2 d 2 ij, and a i = 1 n a j = 1 n a = 1 n 2 n a ij (57) j=1 n a ij (58) i=1 n i=1 j=1 n a ij, (59) we get b ij = a ij a i a j + a D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 89 / 117

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 90 / 117 Multidimensional Scaling Classical scaling - From distances to inner products: Step by step Define the matrix A as (a ij ) and observe that: B = HAH (60) The inner product matrix B can be expressed as B = XX, (61) where X = (x 1,..., x n ) is the n p matrix of coordinates. The rank of B is then rank(b) = rank(xx ) = rank(x) = p. (62)

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 91 / 117 Multidimensional Scaling Classical scaling - From distances to inner products: Step by step B is symmetric, positive definite, of rank p and has p non-zero eigenvalues. B can now be written as B = V p Λ p V p, (63) where Λ p = diag(λ 1,..., λ p ), the diagonal matrix of the eigenvalues of B, and V p = (v 1,..., v p ), the matrix of corresponding eigenvectors. Hence the coordinate matrix X containing the point configuration in R p is given by X = V p Λ 1 2 p (64)

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 92 / 117 Multidimensional Scaling Lower-dimensional approximation What if we want the representation of X to be lower-dimensional than p (r dimensional, where r < p)? Minimize arg min Z:ZZ =B r B B r F Achieved if B r is the rank-r of B approximation based on SVD Then Z = V r Λ 1 2 r. (65)

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 93 / 117 Multidimensional Scaling Classical scaling - Pseudocode The algorithm for recovering coordinates from distances D ij = d ij between pairs of points is as follows: 1 Form matrix A ij = { 1 2 d 2 ij }. 2 Form matrix B = HAH, where H is the centering matrix H = I 1 n 11. 3 Find the spectral decomposition of B, B = VΛV, where Λ is the diagonal matrix formed from the eigenvalues of B, and V is the matrix of corresponding eigenvectors. 4 If the points were originally in a p-dimensional space, the first p eigenvalues of K are nonzero and the remaining n p are zero. Discard these from Λ (rename as Λ p ), and discard the corresponding eigenvalues from V (rename as V p ). 5 Find X = V p Λ 1 2 p, and then the coordinates of the points are given by the rows of X.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 94 / 117 Multidimensional Scaling Non-metric multidimensional scaling Classic scaling, and more general metric scaling, approximate actual dissimilarities or similarities between the data. Non-metric scaling effectively uses a ranking of the dissimilarities to obtain a low-dimensional approximation of the data.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 95 / 117 Multidimensional Scaling Methods for nonlinear dimensionality reduction The underlying idea of these methods is that they lie close to an intrinsically low-dimensional nonlinear manifold embedded in a high-dimensional space. The methods are flattening the manifold and thereby create a low-dimensional representation of the data and their relative location in the manifold. They tend to be useful for systems with high signal to noise ratios.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 96 / 117 Multidimensional Scaling ISOMAP (Tenenbaum et al., 2000) Isometric feature mapping (ISOMAP) constructs a graph to approximate the geodesic distance between points along the manifold: Find all points in the ɛ-neighborhood of a point. Connect the point to its neighbors by an edge. The distance between all non-adjacent points will be approximated by the shortest path (geodesic) distance along the neighborhood graph. Finally, classical scaling is applied to the graph distances. Hence ISOMAP can be thought of as multidimensional scaling on an ɛ-neighborhood graph.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 97 / 117 Multidimensional Scaling ISOMAP (Tenenbaum et al., 2000) 1. Find the neighbours of each point 2. Connect the point to its neighbours by an edge 3. Compute shortest path between non-adjacent points 4. MDS on graph distances

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 98 / 117 Multidimensional Scaling Local linear embedding (Roweis and Saul, 2000) Key idea: Local linear embedding (LLE) approximates each point as a linear combination of its neighbors. Then a lower dimensional representation is constructed that best preserves these local approximations.

x 1 x 2 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 99 / 117 Multidimensional Scaling Local linear embedding (Roweis and Saul, 2000) 1. Find the k-nearest neighbours of each point 2. Compute weights for each point (linear combination of its neighbours) 3. Compute embedding coordinates for fixed weights w 3 x w x 3 1 0 w 2 x 0 =x 1 w 1 +x 2 w 2 +x 3 w 3

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 100 / 117 Multidimensional Scaling Local linear embedding (Roweis and Saul, 2000) For each data point x i in d dimensions, we find its k-nearest neighbors N (i) in Euclidean distance. We approximate each point by an affine mixture of the points in it neighborhood: min Wil x i w il x l 2 (66) l N (i) over weights w il satisfying w il = 0, l / N (i), N l=1 w il = 1. w il is the contribution of point l to the reconstruction of point i. Note that for a hope of a unique solution, we must have l < d.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 101 / 117 Multidimensional Scaling Local linear embedding (Roweis and Saul, 2000) In the final step, we find points y i in a space of dimension r < d to minimize N N y i w il y l 2 (67) with w il fixed. i=1 l=1

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 102 / 117 Multidimensional Scaling Local linear embedding (Roweis and Saul, 2000) In step 3, the following expression is minimized ( ) tr (Y WY) (Y WY) = (68) ( ) tr Y (I W) (I W)Y (69) where W is N N, Y is N r, for some small r < d. The solutions are the trailing eigenvectors of M = (I W) (I W). 1 is a trivial eigenvector with eigenvalue 0, which is discarded. The solution are the next d eigenvalues. As a side effect, 1 Y = 0. That means, the embedding coordinates are centered.

Multidimensional Scaling Local MDS (Chen and Buja, 2008) Local MDS defines N to be the symmetric set of nearby pairs of points. An edge exists between two points (i, i ) if i is among the k nearest neighbors of i or vice versa. Then the stress function to be minimized is: S L (z 1,..., z N ) = (d ii z i z i ) 2 + (i,i ) N D is a large constant and w is a weight. (i,i )/ N A large choice of D means that non-neighbors should be far apart. w(d z i z i ) 2 (70) A small choice of w ensures that non-neigbors do not dominate the overall objective function. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 103 / 117

Multidimensional Scaling Local MDS (Chen and Buja, 2008) Setting w 1 D where τ = 2wD. and D, we obtain: S L (z 1,..., z N ) = (i,i ) N The last term vanishes for D. (d ii z i z i ) 2 τ (i,i )/ N +w (i,i )/ N z i z i (71) z i z i 2 (72) The first term tries to preserve local structure and the second term encourages the representations z i, z i for non-neighbor pairs (i, i ) to be further apart Local MDS optimizes (z 1,..., z N ) for fixed values of k and τ. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 104 / 117

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 105 / 117 Multidimensional Scaling Summary Multidimensional scaling learns a low-dimensional representation of the data given dissimilarity or similarity scores only. The solution of classic scaling are the scaled eigenvectors of the centered inner products, as well as the principal components in PCA. ISOMAP, Local Linear Embedding and Local MDS try to approximate local distances rather than all pairwise distances, to better capture the underlying manifold by nonlinear dimensionality reduction.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 106 / 117 1.5 Self-Organizing Maps based on: Trevor Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning, Springer 2008, Second Edition, Chapter 14.4

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 107 / 117 Self-Organizing Maps Goals Understand the key idea behind self-organizing maps Understand the computation of self-organizing maps Understand the link to k-means

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 108 / 117 Self-Organizing Maps Key idea (Kohonen, 1982) The idea is to learn a map of the data, a low-dimensional embedding. Self-Organizing Maps can be thought of as a constrained version of k-means clustering, in which the prototypes are encouraged to lie in a one- or two-dimensional manifold in the feature space. This manifold is also referred to as constrained topological map, since the original high-dimensional observations can be mapped down onto the two-dimensional coordinate system. The original SOM algorithm was online, but batch versions have been proposed in the meanwhile.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 109 / 117 Self-Organizing Maps Key idea (Kohonen, 1982) We consider a SOM with a two-dimensional rectangular grid of k prototypes m j R d. The choice of a grid is arbitrary, other choices like hexagonal grids are also possible. Each of the k prototypes are parameterized with respect to an integer coordinate pair l j Q 1 Q 2. Here Q 1 = {1, 2,..., q 1 }, Q 2 = {1, 2,..., q 2 }, and k = q 1 q 2. One can think of the protoypes as buttons in a regular pattern. Intuitively, the SOM tries to bend the plane so that the buttons approximate the data points as well as possible. Once the model is fit, the observations can be mapped onto the two-dimensional grid.

Self-Organizing Maps 1.5 1 0.5 0 0.5 1 1.5 1 0.5 0 0.5 1 1 0.5 0 0.5 1 1.5 FIGURE 14.15. Simulated data in three classes, near the surface of a half-sphere. Source: Hastie, Tibshirani, Friedman, 2008 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 110 / 117

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 111 / 117 Self-Organizing Maps 5 4 3 2 1 5 4 3 2 1 1 2 3 4 5 1 2 3 4 5 FIGURE 14.16. Self-organizing map applied to half sphere data example. Left panel is the initial configuration, right panel the final one. The 5 5 grid of prototypes are indicated by circles, and the points that project to each prototype are plotted randomly within the corresponding circle. Source: Hastie, Tibshirani, Friedman, 2008

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 112 / 117 FIGURE 14.17. Wiremesh representation of the fitted SOM model in IR 3. The lines represent the horizontal and vertical edges of the topological lattice. The double lines indicate that the surface was folded diagonally back on itself in order to model the red points. The cluster members have been jittered to indicate their color, and the purple points are the node centers. Source: Hastie, Tibshirani, Friedman, 2008

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 113 / 117 Self-Organizing Maps Key idea (Kohonen, 1982) The observations x i are processed one at a time. We find the closest prototype m j to x i in Euclidean distance in R d, and then for all neighbors m k of m j, we move m k toward x i via the update m k m k + α(x i m k ) The neighbors of m j are defined to be all m k such that the distance between l j and l k is small. α R is the learning rate and determines the scale of the step. The simplest approach uses Euclidean distance, and small is determined by a threshold ɛ. The neighborhood always includes the closest prototype itself. Notice that the distance is defined in the space Q 1 Q 2 of integer topological coordinates of the prototypes, rather than in the feature space R d. The effect of the update is to move the prototypes closer to the data, but also to maintain a smooth two-dimensional spatial relationship between the prototypes.