Data Mining II. Prof. Dr. Karsten Borgwardt, Department Biosystems, ETH Zürich. Basel, Spring Semester 2016 D-BSSE

Size: px
Start display at page:

Download "Data Mining II. Prof. Dr. Karsten Borgwardt, Department Biosystems, ETH Zürich. Basel, Spring Semester 2016 D-BSSE"

Transcription

1 D-BSSE Data Mining II Prof. Dr. Karsten Borgwardt, Department Biosystems, ETH Zürich Basel, Spring Semester 2016

2 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Our course - The team Dr. Damian Roqueiro, Dr. Dean Bodenham, Dr. Dominik Grimm, Dr. Xiao He

3 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Our course - Background information Schedule Structure Moodle link Lecture: Wednesdays 14:15-15:50 Tutorial: Wednesdays 16:00-16:45 Room: Manser Written exam to get the certificate in summer 2016 Key topics: dimensionality reduction, text mining, graph mining, association rules, and others Biweekly homework to apply the algorithms in practice, weekly tutorials, rotating between example tutorials and homework solution tutorials

4 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Our course - Background information More information Homework: Avoid plagiarism and be aware of the consequences. Next week, March 2, we will host the Exam Inspection of the Data Mining 1 Exam, from 15:30-16:30 in this room.

5 1. Dimensionality Reduction D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117

6 Why Dimensionality Reduction? D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117

7 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Why Dimensionality Reduction? Defintion Dimensionality Reduction is the task to find an r-dimensional representation of a d-dimensional dataset, with r << d such that the d-dimensional information is maximally preserved. Motivation uncovering the intrinsic dimensionality of the data, visualization, reduction of redundancy, correlation and noise, computational or memory savings.

8 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Why Dimensionality Reduction? class 1 class 2 class 3 class 4 class z x y

9 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Why Dimensionality Reduction? 2 Variance Explained - PC1: 0.55, PC2: 0.32, PC3: Transformed Y Values Transformed X Values

10 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Reminder: Feature Selection Feature Selection versus Dimensionality Reduction Feature Selection tries to select a subset of relevant variables: for reduced computational, experimental and storage costs, for better data understanding and visualization Univariate feature selection ignores correlations or even redundancies between the features. The same underlying signal can be represented by several features. Dimensionality reduction tries to find these underlying signals and to represent the original data in terms of these signals. Unlike feature selection, these underlying signals are not identical to the original features, but typically combinations of them.

11 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / Principal Component Analysis based on: Shai Shalev-Shwartz and Shai Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press 2014, Chapter 23.1 Mohammed J. Zaki and Wagner Meira Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press 2014, Chapter 7.2

12 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Dimensionality Reduction: Principal Component Analysis Goals To understand why principal component analysis can be interpreted as a compression-recovery scheme. To understand the link between principal component analysis and the eigenvectors and eigenvalues of the covariance matrix. To understand what a principal component is. To understand how to compute principal component analysis efficiently, both for large n and d. To learn an objective way to determine the number of principal components. To understand in which sense principal component analysis is capturing the major axis of variation in the data.

13 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Dimensionality Reduction: Principal Component Analysis Concept Let x 1,..., x n be n vectors in R d. We assume that they are mean-centered. The goal of PCA is to reduce the dimensionality of these vectors using a linear transformation. The matrix W R r d, where r < d, induces a mapping x W x, where W x is the lower dimensional representation of x. A second matrix U R d r can be used to approximately recover each original vector x from its compressed form. That is, for a compressed vector y = W x, where y is in the low-dimensional space R r, we can construct x = Uy, so that x so that x is the recovered version of x in the original space R d.

14 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Principal Component Analysis Objective In PCA, the corresponding objective of finding the compression matrix W and the recovering matrix U is then phrased as arg min W R r d,u R d r n x i UW x i 2 2 (1) That is, PCA tries to minimize the total squared distance between the original vectors and the recovered vectors. i=1

15 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Principal Component Analysis How to find U and W? Lemma Let (U, W ) be a solution to Objective (1). Then the columns of U are orthonormal (that is, U U is the identity matrix of R n ) and W = U.

16 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Principal Component Analysis Proof We make the following assumptions: Choose any U and W and consider the mapping x UW x. The range of this mapping, R = {UW x : x R d } is an r-dimensional linear subspace of R d. Let V R d r be a matrix whose columns form an orthonormal basis of this subspace (i.e. V V = I ). Each vector in R can be written as V y, where y R r. For every x R d and y R r we have This difference is minimized for y = V x. x V y 2 2 = x 2 + y 2 2y (V x) (2)

17 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Principal Component Analysis Proof Continued: Therefore, for each x we have that: VV x = arg min x x 2 2 (3) x R This holds for all x i and we can replace U, W by V, V without increasing the objective: n x i UW x i 2 2 i=1 n x i VV x i 2 2 (4) i=1 As this holds for any U, W, the proof is complete.

18 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Principal Component Analysis How to find U and W? Due to the previous lemma, we can rewrite the Objective (1) as arg min U R d r :U U=I n x i UU x i 2 2 (5) i=1

19 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Principal Component Analysis How to find U and W? We can further simplify the objective by the following transformation: x UU x 2 = x 2 2x UU x + x UU UU x (6) = x 2 x UU x (7) = x 2 trace(u xx U). (8) Note: The trace of a matrix is the sum of its diagonal entries.

20 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Principal Component Analysis How to find U and W? This allows us to turn Objective (5) into a trace maximization problem: ( ) n arg max trace U x i x i U U R d r :U U=I If we define Σ = 1 n n i=1 x ix i, this is equivalent (up to a constant factor) to i=1 ( ) arg max U R d r :U U=I trace U ΣU (9) (10)

21 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Principal Component Analysis Theorem Let Σ = VDV be the spectral decomposition of Σ. D is a diagonal matrix, such that D i,i is the i-th largest eigenvalue of Σ. The columns of V are the corresponding eigenvectors, and V V = VV = I. Then the solution to Objective (10) is the matrix U whose columns are the r first eigenvectors of Σ.

22 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Principal Component Analysis Proof Choose a matrix U R d,r with orthonormal columns and let B = V U. Then, VB = VV U = U. It follows that and therefore U ΣU = B V VDV VB = B DB (11) trace(u ΣU) = d j=1 D j,j r Bj,i. 2 (12) i=1 Note that B B = U VV U = U U = I. Hence the columns of B are orthonormal and d r j=1 i=1 B2 j,i = r.

23 Principal Component Analysis Proof Define B R d d to be the matrix such that the first r columns are the columns of B and in addition, B B = I. Then for every j : d B i=1 j,i 2 = 1, which implies that r i=1 B2 j,i 1. It follows that trace(u ΣU) max β [0,1] d : β 1 r d D j,j β j =. Therefore for every matrix U R d r with orthonormal columns, the following inequality holds: trace(u ΣU) r j=1 D j,j. But if we set U to the matrix with the r leading eigenvectors of Σ as its columns, we obtain trace(u ΣU) = r j=1 D j,j and thereby the optimal solution. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 j=1 r j=1 D j,j

24 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Principal Component Analysis Runtime properties The overall runtime of PCA is O(d 3 + d 2 n): O(d 3 ) for calculating the eigenvalues of Σ, O(d 2 n) for constructing the matrix Σ.

25 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Principal Component Analysis Speed-up for d >> n Often the number of features d greatly exceeds the number of samples n. The standard runtime of O(d 3 + d 2 n) is very expensive for large d. However, there is a workaround to perform the same calculations in O(n 3 + n 2 d).

26 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Principal Component Analysis Speed-up for d >> n Workaround: X R d n, Σ = 1 n n i=1 x ix i or Σ = 1 n XX. Consider K = X X, such that K ij = x i, x j Suppose v is an eigenvector of K: Kv = λv for some λ R. Multiplying the equation by 1 n X and using the definition of K we obtain 1 n XX X v = 1 n λx v. (13) But, using the definition of Σ, we get that Σ(X v ) = λ n (X v ). X v Hence X v is an eigenvector of Σ with eigenvalue λ n. Therefore, one can calculate the PCA solution by calculating the eigenvalues of K rather than Σ.

27 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Principal Component Analysis Pseudocode Require: A matrix X of n examples R d n, number of components r 1: for i = 1,..., n set x i x i 1 n n i=1 x i 2: if n > d then 3: Σ = 1 n XX 4: Compute the r leading eigenvectors v 1,..., v r of Σ. 5: else 6: K = X X 7: Compute the r leading eigenvectors v 1,..., v r of K. 8: for i = 1,..., r set v i := 1 X v i X v i. 9: end if 10: return Compression matrix W = [v 1,..., v r ] or compressed points WX

28 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Principal Component Analysis How to choose the number of principal components? The principal components should capture α per cent of the total variance in the data (α is often set to be 90%). Total variance in the data: d i=1 λ i. Total variance captured by first r eigenvectors: r i=1 λ i The variance explained α is the ratio between the two: α = r i=1 λ i d i=1 λ i.

29 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Principal Component Analysis Theorem The variance captured by the first r eigenvectors of Σ is the sum over its r largest eigenvalues r i=1 λ i.

30 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Principal Component Analysis Proof The variance in a dataset X is defined as var(x ) = 1 n = 1 n n x i 0 2 = (14) i=1 n x i 2 = 1 n i=1 n x i, x i = (15) i=1 = tr(σ) = tr(v DV ) = tr(vv D) = tr(d) = (16) d d = D i,i = λ i. (17) i=1 i=1

31 Principal Component Analysis Proof - Continued The variance in a projected dataset WX, with W = [v 1,..., v r ], is defined as var(wx ) = 1 n = 1 n n W x j 2 = 1 n j=1 n W x j, W x j = 1 n j=1 n x j (v 1 v v r vr )x j = j=1 = (v 1 Σv v r Σv r ) = r i=1 v i r vi Σv i = i=1 n x j W W x j = (18) j=1 n j=1 1 n (x jx j )v i = (19) r vi λ i v i = i=1 r λ i. (20) Therefore the variance explained can be written as a ratio over sums over eigenvalues of the covariance matrix Σ.. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 i=1

32 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Principal Component Analysis Geometric Interpretation An alternative interpretation of PCA is that it finds the major axis of variation in the data. This means that the 1st principal component defines the direction in the data with the greatest variance. The 2nd principal component defines a direction that i) is orthogonal to the 1st principal component and ii) captures the major direction of the remaining variance in the data. In general, the i-th principal component is orthogonal to all previous i 1 principal components and represents the direction of maximum variance remaining in the data.

33 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Principal Component Analysis Proof: Variance Maximization along 1st principal component We start by trying to find one principal component v 1 that maximizes the variance of X : arg max v1 Var(v 1 X ) = arg max v 1 v 1 Σv 1 To avoid picking a v 1 with arbitrarily large entries, we enforce v 1 v 1 = 1. We form the Lagrangian to solve this problem and take the derivative with respect to v 1 and set it to zero v 1 Σv 1 λ(v 1 v 1 1). (21) v 1 (v 1 Σv 1 λ(v 1 v 1 1)) = 0; (22)

34 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Principal Component Analysis Proof: Variance Maximization along 1st principal component v 1 (v 1 Σv 1 λ(v 1 v 1 1)) = 0; (23) 2Σv 1 2λv 1 = 0; (24) Σv 1 = λv 1. (25) The solution v 1 is an eigenvector of Σ. As v 1 Σv 1 = v 1 λv 1 = λv 1 v 1 = λ 1, the variance is maximized by picking the eigenvector corresponding to the largest eigenvalue. Hence the 1st eigenvector v 1 is the principal component/direction that maximizes the variance of X v 1.

35 Principal Component Analysis Proof: Variance Maximization along 2nd principal component We will now show the solution for the 2nd principal component. The second direction of projection should be independent from the first one: cov(v 2 X, v 1 X ) = 0. This can be written as cov(v 2 X, v 1 X ) = v 2 XX v 1 = v 2 Σv 1 = v 2 λv 1 = λv 2 v 1 = 0. We form the Lagrangian v 2 Σv 2 λ(v 2 v 2 1) ρ(v 2 v 1) and set the derivative with respect to v 2 to zero: 2Σv 2 2λv 2 ρv 1 = 0. If we multiply this from the left by v1, the first two terms vanish: ρ = 0; As ρ = 0, we are left with Σv 2 = λv 2, showing that v 2 is again an eigenvector of Σ, and we again pick the eigenvector of the - now second largest - eigenvalue to maximize the variance along the second principle component. The proofs for the other principal components k > 2 follows the same scheme. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117

36 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Principal Component Analysis y x

37 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Principal Component Analysis PC1 PC2 5 y x

38 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Principal Component Analysis Transformed Y Values Transformed X Values 1.0

39 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Principal Component Analysis Summary Principal Component Analysis (Pearson, 1901) is the optimal way to perform dimensionality reduction to a lower-dimensional space with subsequent reconstruction with minimal squared reconstruction error. The principal components are the eigenvectors of the covariance matrix of the data. The eigenvalues capture the variance that is described by the corresponding eigenvector. The number of principal components can be chosen such that they capture α per cent of the total variance. The principal components capture the orthogonal directions of maximum variance in the data. PCA can be computed in O(min(n 3 + n 2 d, d 3 + d 2 n)).

40 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / Singular Value Decomposition based on: Charu Aggarwal, Data Mining - The Textbook, Springer 2015, Chapter Mohammed J. Zaki and Wagner Meira Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press 2014, Chapter 7.4 Dr. R. Costin. Lecture Notes of MATHEMATICS 5101: Linear Mathematics in Finite Dimensions. OSU 2013.

41 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Singular Value Decomposition Goals Get to know a fundamental matrix decomposition that is widely used throughout data mining. Learn how SVD can be used to obtain a low-rank approximation to a given matrix. Get to know how to compute Principal Component Analysis via SVD. Learn how SVD can speed up fundamental matrix operations in data mining.

42 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Singular Value Decomposition Definition A Singular Value Decomposition is defined as a factorization of a given matrix D R n d into three matrices: D = L R, (26) where L is an n n matrix with orthonormal columns, the left singular vectors, is an n d diagonal matrix containing the singular values, and R is a d d matrix with orthonormal columns, the right singular vectors. The singular values are non-negative and by convention arranged in decreasing order. Note that is not (necessarily) a square matrix.

43 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Singular Value Decomposition D d L n Δ d R T d n D {1,1}... D {1,d}.. =.. D {n,1}... D {n,d} n L {1,1}... L {1,n}.... L {n,1}... L {n,n} n δ δ d R {1,1}... R {1,d} R {d,1}... R {d,d}

44 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Singular Value Decomposition Reduced SVD One can discard the singular vectors that correspond to zero singular values, to obtain the reduced SVD: D = L r r R r, where L r is the n r matrix of the left singular vectors, R r is the d r matrix of the right singular vectors, and δ r is the r r diagonal matrix containing the positive singular vectors.

45 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Singular Value Decomposition Reduced SVD The reduced SVD gives rise to the spectral decomposition of D: D = L r r Rr = (27) δ r 1 = ([l 1, l 2,..., l r ]) 0 δ r = (28) δ r rr r = δ 1 l 1 r1 + δ 2 l 2 r δ r l r rr = δ i l i ri (29) Hence D is represented as a sum of rank-one matrices of the form δ i l i r i. i=1

46 Singular Value Decomposition Eckart-Young Theorem By selecting the r largest singular values δ 1, δ 2,..., δ r, and the corresponding left and right singular vectors, we obtain the best rank-r approximation to the original matrix D. Theorem (Eckart-Young Theorem, 1936) If D r is the matrix defined as r i=1 δ il i r i, then D r is the rank-r matrix that minimizes the objective D D r F. The Frobenius Norm of a matrix A, A F, is defined as n n A F = A 2 i,j. D-BSSE Karsten Borgwardt i=1 j=1 Data Mining II Course, Basel Spring Semester / 117

47 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Singular Value Decomposition Proof of Eckart-Young Theorem (Part 1/2) Assume D is of rank k (k > r). Since D D r F = L R D r F = L D r R F. Denoting N = L D r R, we can compute the Frobenius norm as N 2 F = i,j i,j N i,j 2 = k δ i N i,i 2 + N i,i 2 + i>k i j N i,j 2. i=1 This is minimized if all off-diagonal terms of N and all N i,i for i > k are zero. The minimum of δ i N i,i 2 is obtained for N i,i = δ i for i = 1,..., r and all other N i,i are zero.

48 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Singular Value Decomposition Proof of Eckart-Young Theorem (Part 2/2) We do not need the full L and R for computing D r, only their first r columns. This can be seen by splitting L and R into blocks: L = [L r, L 0 ] and R = [R r, R 0 ] and [ D r = L r R r 0 = [L r, L 0 ] 0 0 = [L r, L 0 ] [ r R r 0 ] [ ] R r R0 ] (30) = L r r Rr. (31)

49 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Singular Value Decomposition Geometric interpretation: Column and row space Any n d matrix D represents a linear transformation, D : R d R n, from the space of d-dimensional vectors to the space of n-dimensional vectors, because for any x R d, there exists a y R n, such that Dx = y. The column space of D is defined as the set of all vectors y R n such that Dx = y over all possible x R d. The row space of D is defined as the set of all vectors x R d such that D y = x over all possible y R n. The row space of D is the column space of D.

50 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Singular Value Decomposition Geometric interpretation: Null spaces The set of all vectors x R d, such that Dx = 0 is called the (right) null space of D. The set of all vectors y R n, such that D y = 0 is called the left null space of D.

51 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Singular Value Decomposition Geometric interpretation: SVD as a basis-generator SVD gives a basis for each of the four spaces associated with a matrix D. If D has rank r, then it has only r independent columns and only r independent rows. The r left singular vectors l 1, l 2,..., l r corresponding to the r nonzero singular values of D represent a basis for the column space of D. The remaining n r left singular vectors l r+1,..., l n represent a basis for the left null space of D. The r right singular vectors r 1, r 2,..., r r, corresponding to the r non-zero singular values represent a basis for the row space of D. The remaining d r right singular vectors r r+1,..., r d represent a basis for the (right) null space of D.

52 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Singular Value Decomposition Geometric interpretation: SVD as a basis-generator Proof of right null space: Dx = 0 (32) L L R x = 0 (33) R x =: x (34) x = (δ 1 x 1, δ 2 x 2,..., δ r x r, 0,..., 0) = 0 (35) x = R x (36) The weights for the first r right singular vectors r 1,..., r r is zero. Hence x is a linear combination over the d r right singular vectors r r+1,..., r d.

53 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Singular Value Decomposition Geometric interpretation: SVD as a basis-generator Consider the reduced SVD expression from Equation (27). Right-multiplying both sides by R r and noting that R r R r = I, we obtain: DR r = L r r R r R r (37) DR r = L r r (38) δ DR r = L r 0 δ (39) δ r D([r 1, r 2,..., r r ]) = ([δ 1 l 1, δ 2 l 2,..., δ r l r ]) (40)

54 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Singular Value Decomposition Geometric interpretation: SVD as a basis-generator Hence Dr i = δ i l i for all i = 1,..., r. That means, SVD is a special factorization of the matrix D such that any basis vector r i for the row space is mapped to the corresponding basis vector l i for the column space, scaled by δ i. SVD can be thought of as a mapping from an orthonormal basis (r 1, r 2,..., r r ) in R d, the row space, to an orthonormal basis (l 1, l 2,..., l r ) in R n, the column space, with the corresponding axes scaled according to the singular values δ 1, δ 2,..., δ r.

55 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Singular Value Decomposition Link to PCA There is a direct link between PCA and SVD, which we will elucidate next. This link allows the computation of PCA via SVD. We assume that D is a d n matrix.

56 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Singular Value Decomposition Link to PCA: The covariance matrix via SVD The columns of matrix L, which are also the left singular vectors, are the orthonormal eigenvectors of DD. Proof: DD = L (R R) L = L L Note that R R = 1. The diagonal entries of, that is, the squared nonzero singular values, are therefore the nonzero-eigenvalues of DD.

57 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Singular Value Decomposition Link to PCA: The covariance matrix via SVD The covariance matrix of mean-centered data is 1 n DD and the left singular vectors of SVD are eigenvectors of DD. It follows that the eigenvectors of PCA are the same as the left singular vectors of SVD for mean-centered data. Furthermore, the square singular values of SVD are n times the eigenvalues of PCA.

58 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Singular Value Decomposition Link to PCA: The kernel matrix via SVD The columns of matrix R, which are the right singular vectors, are the orthonormal eigenvectors of D D. Proof: D D = R (L L) R = R R Note that L L = 1. The diagonal entries of, that is, the squared nonzero singular values of, are therefore the nonzero-eigenvalues of D D.

59 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Singular Value Decomposition Link to PCA: The kernel matrix via SVD The kernel matrix on D is D D and the right singular vectors of SVD are eigenvectors of D D. It follows that each eigenvector v of PCA can be expressed in terms of a right singular vectors r of SVD: v = Dr Dr Again, the square singular values of SVD are n times the eigenvalues of PCA (λ Σ = λ K n = δ2 n ).

60 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Singular Value Decomposition Applications of SVD and PCA beyond dimensionality reduction Noise reduction: Removing smaller eigenvectors or singular vectors often improves data quality. Data imputation: The reduced SVD matrices L r, r and R r can be computed even for incomplete matrices and used to impute missing values ( link prediction) Linear equations: Obtain basis of the solution Dx = 0. Matrix inversion: Assume D is square. D = L R D 1 = R 1 L Powers of a matrix: Assume D is square and positive semidefinite. Then D = L L and D k = L k L.

61 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Singular Value Decomposition Summary Singular Value Decomposition (SVD) is a decomposition of a given matrix D into three submatrices. SVD can be used to obtain an optimal low-rank approximation of a matrix in terms of the Frobenius norm. SVD generates bases for the four spaces associated with a matrix. D can be thought of as a mapping between the basis vectors of its row space and the scaled basis vectors of its column space. SVD can be used to implement PCA. SVD is used throughout data mining for efficient matrix computations.

62 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / Kernel Principal Component Analysis based on: B. Schölkopf, A. Smola, K.R. Müller, Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Max Planck Institute for Biological Cybernetics, Technical Report No.44, 1996 Mohammed J. Zaki and Wagner Meira Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press 2014, Chapter 7.3

63 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Kernel Principal Component Analysis 1.5 Nonlinear 2D Dataset y x

64 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Kernel Principal Component Analysis 1.5 Nonlinear 2D Dataset y x

65 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Kernel Principal Component Analysis Goals Learn how to perform non-linear dimensionality reduction. Learn how to perform PCA in feature space solely in terms of kernel functions. Learn how to compute the projections of points onto principal components in feature space.

66 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Kernel Principal Component Analysis Kernel Principal Component Analysis (Schölkopf, Smola, Müller, 1996) PCA is restrictive in the sense that it only allows for linear dimensionality reduction. What about non-linear dimensionality reduction? Idea: Move the computation of principal components to feature space. This approach exists and is called Kernel PCA (Schölkopf, Smola and Müller, 1996)! Define a mapping φ : X H, x φ(x).

67 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Kernel Principal Component Analysis Kernel Principal Component Analysis We assume that we are dealing with centered data n i=1 φ(x i) = 0. The covariance matrix then takes the form C = 1 n n i=1 φ(x i)φ(x i ). Then we have to find eigenvalues λ 0 and nonzero eigenvectors v H \ {0} satisfying: λv = Cv. All solutions v with λ 0 lie in the span of φ(x 1 ),..., φ(x n ), due to the fact that λv = Cv = 1 n n i=1 (φ(x i) v)φ(x i ).

68 Kernel Principal Component Analysis Kernel Principal Component Analysis The first useful consequence is that we can consider the following equations instead: λ φ(x j ), v = φ(x j ), Cv for all j = 1,..., n and the second is that there exist coefficients α j (j = 1,... n) : v = n α i φ(x i ). i=1 Combining these two consequences, we get for all j = 1,..., n: n λ α i φ(x j ), φ(x i ) = 1 n n n α i φ(x j ), φ(x k ) φ(x k ), φ(x i ). i=1 i=1 k=1 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117

69 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Kernel Principal Component Analysis Kernel Principal Component Analysis Combining these two consequences, we get for all j = 1,..., n: n λ α i φ(x j ), φ(x i ) = 1 n i=1 n α i i=1 k=1 n φ(x j ), φ(x k ) φ(x k ), φ(x i ).

70 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Kernel Principal Component Analysis Kernel PCA as an eigenvector problem In terms of the n n Gram matrix K j,i := φ(x j ), φ(x i ), this can be rewritten as nλkα = K 2 α, (41) where α denotes the column vector with entries α 1,..., α n. To find solutions of Equation (41), we solve the problem nλα = Kα, (42) which we obtain by multiplying (41) by K 1 from the left.

71 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Kernel Principal Component Analysis Normalizing the coefficients We require the eigenvectors v k to have unit length, that is v k, v k = 1 for all k = 1,..., r. That means that 1 = v k, v k (43) n = αi k αj k φ(x i ), φ(x j ) (44) = i,j=1 n αi k αj k K i,j (45) i,j=1 = α k, Kα k = λ k α k, α k. (46)

72 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Kernel Principal Component Analysis Normalizing the coefficients As eigenvectors of K, the α k have unit norm. 1 Therefore we have to rescale them by λ to enforce that their norm is 1 λ.

73 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Kernel Principal Component Analysis Projecting points onto principal components A point x can be projected onto the principal component v k (for k = 1,..., r) via v k, φ(x) = n αi k φ(x i ), φ(x). i=1 resulting in a r-dimensional representation of x based on KernelPCA in H.

74 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Kernel Principal Component Analysis How to center the kernel matrix in H? The kernel matrix K can be centered via K = K 1 n 1 n n K 1 n K 1 n n + 1 n 2 1 n n K 1 n n, (47) = (I 1 n 1 n n) K (I 1 n 1 n n) (48) using the notation (1 n n ) ij := 1 for all i, j.

75 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Kernel Principal Component Analysis Pseudocode: KernelPCA(X, r) Require: A matrix X of n examples R d,n, number of components r 1: K i,j = {k(x i, x j ) i,j=1,...,n } 2: K := (I 1 n 1 n n)k(i 1 n 1 n n) 3: (λ 1,..., λ r ) = eigenvalues(k) 4: (α 1,..., α r ) = eigenvectors(k) 5: α i := 1 λ i α i for all i = 1,..., r 6: A = (α 1,..., α r ) 7: return Set of projected points: Z = {z i z i = A K(:, i), for i = 1,..., n}

76 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Kernel Principal Component Analysis Summary Non-linear dimensionality reduction can be performed in feature space in KernelPCA. At the heart of KernelPCA is finding eigenvectors of the kernel matrix. One can compute projections of the data onto the principal components explicitly, but not the principal components themselves (unless function φ is explicitly known).

77 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / Multidimensional Scaling based on: Wolfgang Karl Härdle, Leopold Simar. Applied Multivariate Statistical Analysis. Springer 2015, Chapter 17 Trevor Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning, Springer 2008, Second Edition, Chapters 14.8 and 14.9

78 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Multidimensional Scaling Source of Table and subsequent figure: Wolfgang Karl Härdle, Leopold Simar. Applied Multivariate Statistical Analysis. Springer 2015, Chapter 17

79 Multidimensional Scaling D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117

80 Multidimensional Scaling D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117

81 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Multidimensional Scaling Goals Find a low-dimensional representation of data for which only distances, similarities or dissimilarities are given Understand the link between Multidimensional scaling and PCA Applications Visualize similarities or dissimilarities between high-dimensional objects, e.g. DNA sequences, protein structures

82 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Multidimensional Scaling Setting We assume that we are given the pairwise distances, similarities or dissimilarities between all pairs of points d i,j in a dataset. Note: We do not need to know the actual coordinates of the points, as in PCA. The goal in multidimensional scaling (MDS) is to 1 find a low-dimensional representation of the data points, 2 which maximally preserves the pairwise distances. The solution is only determined up to rotation, reflection and shift.

83 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Multidimensional Scaling Optimization problem We assume that the original data objects are x 1,..., x n R d, and the distance between x i and x j is d i,j. The goal is to find a lower-dimensional representation of these n objects: z 1,..., z n R r. Hence the objective in metric MDS is to minimize the so-called stress function S M : arg min S M (z 1,..., z n ) = i j (d ij z i z j ) 2 (49)

84 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Multidimensional Scaling Classic scaling Definition A n n distance matrix D ij = d ij is Euclidean if for some points x 1,..., x n R d d 2 ij = (x i x j ) (x i x j ). Theorem Define a n n matrix A with A ij = 1 2 d 2 ij and B = HAH where H is the centering matrix. D is Euclidean if and only if B is positive semidefinite. If D is the distance matrix of a data matrix X, then B = HXX H. B is called the inner product matrix.

85 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Multidimensional Scaling Classical scaling - From distances to inner products: Overview If distances are Euclidean, we can convert them to centered inner-products. Let d 2 ij = x i x j 2 be the matrix of pairwise squared Euclidean distances. Then we write d 2 ij = x i x 2 + x j x 2 2 x i x, x j x (50) Defining A ij = { 1 2 d ij 2 }, we double center B: where H = I 1 n 11. B is then the matrix of centered inner products. B = HAH, (51)

86 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Multidimensional Scaling Classical scaling - From distances to inner products: Step by step The task of MDS is to find the original Euclidean coordinates from a given distance matrix. The Euclidean distance between the ith and jth points is given by d 2 ij = The general b ij term of B is given by d (x ik x jk ) 2 k=1 d b ij = x ik x jk = xi k=1 x j

87 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Multidimensional Scaling Classical scaling - From distances to inner products: Step by step It is possible to derive B from the known squared distances d ij and them from B the unknown coordinates. d 2 ij = x i x i + x j x j 2x i x j = = b ii + b jj 2b ij (52) Centering of the coordinate matrix X implies that n i=1 b ij = 0.

88 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Multidimensional Scaling Classical scaling - From distances to inner products: Step by step Summing over i and j, we obtain 1 n 2 1 n 1 n n n i=1 n j=1 n i=1 j=1 d 2 ij = 1 n n b ii + b jj (53) i=1 d 2 ij = b ii + 1 n d 2 ij = 2 n n b jj (54) j=1 n b ii (55) i=1 b ij = 1 2 (d 2 ij d 2 i d 2 j + d 2 ) (56)

89 Multidimensional Scaling Classical scaling - From distances to inner products: Step by step With a ij = 1 2 d 2 ij, and a i = 1 n a j = 1 n a = 1 n 2 n a ij (57) j=1 n a ij (58) i=1 n i=1 j=1 n a ij, (59) we get b ij = a ij a i a j + a D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117

90 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Multidimensional Scaling Classical scaling - From distances to inner products: Step by step Define the matrix A as (a ij ) and observe that: B = HAH (60) The inner product matrix B can be expressed as B = XX, (61) where X = (x 1,..., x n ) is the n p matrix of coordinates. The rank of B is then rank(b) = rank(xx ) = rank(x) = p. (62)

91 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Multidimensional Scaling Classical scaling - From distances to inner products: Step by step B is symmetric, positive definite, of rank p and has p non-zero eigenvalues. B can now be written as B = V p Λ p V p, (63) where Λ p = diag(λ 1,..., λ p ), the diagonal matrix of the eigenvalues of B, and V p = (v 1,..., v p ), the matrix of corresponding eigenvectors. Hence the coordinate matrix X containing the point configuration in R p is given by X = V p Λ 1 2 p (64)

92 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Multidimensional Scaling Lower-dimensional approximation What if we want the representation of X to be lower-dimensional than p (r dimensional, where r < p)? Minimize arg min Z:ZZ =B r B B r F Achieved if B r is the rank-r of B approximation based on SVD Then Z = V r Λ 1 2 r. (65)

93 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Multidimensional Scaling Classical scaling - Pseudocode The algorithm for recovering coordinates from distances D ij = d ij between pairs of points is as follows: 1 Form matrix A ij = { 1 2 d 2 ij }. 2 Form matrix B = HAH, where H is the centering matrix H = I 1 n Find the spectral decomposition of B, B = VΛV, where Λ is the diagonal matrix formed from the eigenvalues of B, and V is the matrix of corresponding eigenvectors. 4 If the points were originally in a p-dimensional space, the first p eigenvalues of K are nonzero and the remaining n p are zero. Discard these from Λ (rename as Λ p ), and discard the corresponding eigenvalues from V (rename as V p ). 5 Find X = V p Λ 1 2 p, and then the coordinates of the points are given by the rows of X.

94 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Multidimensional Scaling Non-metric multidimensional scaling Classic scaling, and more general metric scaling, approximate actual dissimilarities or similarities between the data. Non-metric scaling effectively uses a ranking of the dissimilarities to obtain a low-dimensional approximation of the data.

95 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Multidimensional Scaling Methods for nonlinear dimensionality reduction The underlying idea of these methods is that they lie close to an intrinsically low-dimensional nonlinear manifold embedded in a high-dimensional space. The methods are flattening the manifold and thereby create a low-dimensional representation of the data and their relative location in the manifold. They tend to be useful for systems with high signal to noise ratios.

96 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Multidimensional Scaling ISOMAP (Tenenbaum et al., 2000) Isometric feature mapping (ISOMAP) constructs a graph to approximate the geodesic distance between points along the manifold: Find all points in the ɛ-neighborhood of a point. Connect the point to its neighbors by an edge. The distance between all non-adjacent points will be approximated by the shortest path (geodesic) distance along the neighborhood graph. Finally, classical scaling is applied to the graph distances. Hence ISOMAP can be thought of as multidimensional scaling on an ɛ-neighborhood graph.

97 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Multidimensional Scaling ISOMAP (Tenenbaum et al., 2000) 1. Find the neighbours of each point 2. Connect the point to its neighbours by an edge 3. Compute shortest path between non-adjacent points 4. MDS on graph distances

98 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Multidimensional Scaling Local linear embedding (Roweis and Saul, 2000) Key idea: Local linear embedding (LLE) approximates each point as a linear combination of its neighbors. Then a lower dimensional representation is constructed that best preserves these local approximations.

99 x 1 x 2 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Multidimensional Scaling Local linear embedding (Roweis and Saul, 2000) 1. Find the k-nearest neighbours of each point 2. Compute weights for each point (linear combination of its neighbours) 3. Compute embedding coordinates for fixed weights w 3 x w x w 2 x 0 =x 1 w 1 +x 2 w 2 +x 3 w 3

100 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Multidimensional Scaling Local linear embedding (Roweis and Saul, 2000) For each data point x i in d dimensions, we find its k-nearest neighbors N (i) in Euclidean distance. We approximate each point by an affine mixture of the points in it neighborhood: min Wil x i w il x l 2 (66) l N (i) over weights w il satisfying w il = 0, l / N (i), N l=1 w il = 1. w il is the contribution of point l to the reconstruction of point i. Note that for a hope of a unique solution, we must have l < d.

101 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Multidimensional Scaling Local linear embedding (Roweis and Saul, 2000) In the final step, we find points y i in a space of dimension r < d to minimize N N y i w il y l 2 (67) with w il fixed. i=1 l=1

102 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Multidimensional Scaling Local linear embedding (Roweis and Saul, 2000) In step 3, the following expression is minimized ( ) tr (Y WY) (Y WY) = (68) ( ) tr Y (I W) (I W)Y (69) where W is N N, Y is N r, for some small r < d. The solutions are the trailing eigenvectors of M = (I W) (I W). 1 is a trivial eigenvector with eigenvalue 0, which is discarded. The solution are the next d eigenvalues. As a side effect, 1 Y = 0. That means, the embedding coordinates are centered.

103 Multidimensional Scaling Local MDS (Chen and Buja, 2008) Local MDS defines N to be the symmetric set of nearby pairs of points. An edge exists between two points (i, i ) if i is among the k nearest neighbors of i or vice versa. Then the stress function to be minimized is: S L (z 1,..., z N ) = (d ii z i z i ) 2 + (i,i ) N D is a large constant and w is a weight. (i,i )/ N A large choice of D means that non-neighbors should be far apart. w(d z i z i ) 2 (70) A small choice of w ensures that non-neigbors do not dominate the overall objective function. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117

104 Multidimensional Scaling Local MDS (Chen and Buja, 2008) Setting w 1 D where τ = 2wD. and D, we obtain: S L (z 1,..., z N ) = (i,i ) N The last term vanishes for D. (d ii z i z i ) 2 τ (i,i )/ N +w (i,i )/ N z i z i (71) z i z i 2 (72) The first term tries to preserve local structure and the second term encourages the representations z i, z i for non-neighbor pairs (i, i ) to be further apart Local MDS optimizes (z 1,..., z N ) for fixed values of k and τ. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117

105 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Multidimensional Scaling Summary Multidimensional scaling learns a low-dimensional representation of the data given dissimilarity or similarity scores only. The solution of classic scaling are the scaled eigenvectors of the centered inner products, as well as the principal components in PCA. ISOMAP, Local Linear Embedding and Local MDS try to approximate local distances rather than all pairwise distances, to better capture the underlying manifold by nonlinear dimensionality reduction.

106 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / Self-Organizing Maps based on: Trevor Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning, Springer 2008, Second Edition, Chapter 14.4

107 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Self-Organizing Maps Goals Understand the key idea behind self-organizing maps Understand the computation of self-organizing maps Understand the link to k-means

108 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Self-Organizing Maps Key idea (Kohonen, 1982) The idea is to learn a map of the data, a low-dimensional embedding. Self-Organizing Maps can be thought of as a constrained version of k-means clustering, in which the prototypes are encouraged to lie in a one- or two-dimensional manifold in the feature space. This manifold is also referred to as constrained topological map, since the original high-dimensional observations can be mapped down onto the two-dimensional coordinate system. The original SOM algorithm was online, but batch versions have been proposed in the meanwhile.

109 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Self-Organizing Maps Key idea (Kohonen, 1982) We consider a SOM with a two-dimensional rectangular grid of k prototypes m j R d. The choice of a grid is arbitrary, other choices like hexagonal grids are also possible. Each of the k prototypes are parameterized with respect to an integer coordinate pair l j Q 1 Q 2. Here Q 1 = {1, 2,..., q 1 }, Q 2 = {1, 2,..., q 2 }, and k = q 1 q 2. One can think of the protoypes as buttons in a regular pattern. Intuitively, the SOM tries to bend the plane so that the buttons approximate the data points as well as possible. Once the model is fit, the observations can be mapped onto the two-dimensional grid.

110 Self-Organizing Maps FIGURE Simulated data in three classes, near the surface of a half-sphere. Source: Hastie, Tibshirani, Friedman, 2008 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117

111 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Self-Organizing Maps FIGURE Self-organizing map applied to half sphere data example. Left panel is the initial configuration, right panel the final one. The 5 5 grid of prototypes are indicated by circles, and the points that project to each prototype are plotted randomly within the corresponding circle. Source: Hastie, Tibshirani, Friedman, 2008

112 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 FIGURE Wiremesh representation of the fitted SOM model in IR 3. The lines represent the horizontal and vertical edges of the topological lattice. The double lines indicate that the surface was folded diagonally back on itself in order to model the red points. The cluster members have been jittered to indicate their color, and the purple points are the node centers. Source: Hastie, Tibshirani, Friedman, 2008

113 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester / 117 Self-Organizing Maps Key idea (Kohonen, 1982) The observations x i are processed one at a time. We find the closest prototype m j to x i in Euclidean distance in R d, and then for all neighbors m k of m j, we move m k toward x i via the update m k m k + α(x i m k ) The neighbors of m j are defined to be all m k such that the distance between l j and l k is small. α R is the learning rate and determines the scale of the step. The simplest approach uses Euclidean distance, and small is determined by a threshold ɛ. The neighborhood always includes the closest prototype itself. Notice that the distance is defined in the space Q 1 Q 2 of integer topological coordinates of the prototypes, rather than in the feature space R d. The effect of the update is to move the prototypes closer to the data, but also to maintain a smooth two-dimensional spatial relationship between the prototypes.

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations. Previously Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations y = Ax Or A simply represents data Notion of eigenvectors,

More information

Nonlinear Dimensionality Reduction

Nonlinear Dimensionality Reduction Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Kernel PCA 2 Isomap 3 Locally Linear Embedding 4 Laplacian Eigenmap

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

Unsupervised dimensionality reduction

Unsupervised dimensionality reduction Unsupervised dimensionality reduction Guillaume Obozinski Ecole des Ponts - ParisTech SOCN course 2014 Guillaume Obozinski Unsupervised dimensionality reduction 1/30 Outline 1 PCA 2 Kernel PCA 3 Multidimensional

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction

More information

L26: Advanced dimensionality reduction

L26: Advanced dimensionality reduction L26: Advanced dimensionality reduction The snapshot CA approach Oriented rincipal Components Analysis Non-linear dimensionality reduction (manifold learning) ISOMA Locally Linear Embedding CSCE 666 attern

More information

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University

More information

Non-linear Dimensionality Reduction

Non-linear Dimensionality Reduction Non-linear Dimensionality Reduction CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Introduction Laplacian Eigenmaps Locally Linear Embedding (LLE)

More information

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting Background Advances in data collection and storage

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x

More information

Manifold Learning: Theory and Applications to HRI

Manifold Learning: Theory and Applications to HRI Manifold Learning: Theory and Applications to HRI Seungjin Choi Department of Computer Science Pohang University of Science and Technology, Korea seungjin@postech.ac.kr August 19, 2008 1 / 46 Greek Philosopher

More information

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Alvina Goh Vision Reading Group 13 October 2005 Connection of Local Linear Embedding, ISOMAP, and Kernel Principal

More information

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection

More information

LECTURE NOTE #11 PROF. ALAN YUILLE

LECTURE NOTE #11 PROF. ALAN YUILLE LECTURE NOTE #11 PROF. ALAN YUILLE 1. NonLinear Dimension Reduction Spectral Methods. The basic idea is to assume that the data lies on a manifold/surface in D-dimensional space, see figure (1) Perform

More information

Lecture 10: Dimension Reduction Techniques

Lecture 10: Dimension Reduction Techniques Lecture 10: Dimension Reduction Techniques Radu Balan Department of Mathematics, AMSC, CSCAMM and NWC University of Maryland, College Park, MD April 17, 2018 Input Data It is assumed that there is a set

More information

Unsupervised learning: beyond simple clustering and PCA

Unsupervised learning: beyond simple clustering and PCA Unsupervised learning: beyond simple clustering and PCA Liza Rebrova Self organizing maps (SOM) Goal: approximate data points in R p by a low-dimensional manifold Unlike PCA, the manifold does not have

More information

Dimension Reduction and Low-dimensional Embedding

Dimension Reduction and Low-dimensional Embedding Dimension Reduction and Low-dimensional Embedding Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/26 Dimension

More information

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach Dr. Guangliang Chen February 9, 2016 Outline Introduction Review of linear algebra Matrix SVD PCA Motivation The digits

More information

Nonlinear Dimensionality Reduction

Nonlinear Dimensionality Reduction Nonlinear Dimensionality Reduction Piyush Rai CS5350/6350: Machine Learning October 25, 2011 Recap: Linear Dimensionality Reduction Linear Dimensionality Reduction: Based on a linear projection of the

More information

Advanced Machine Learning & Perception

Advanced Machine Learning & Perception Advanced Machine Learning & Perception Instructor: Tony Jebara Topic 2 Nonlinear Manifold Learning Multidimensional Scaling (MDS) Locally Linear Embedding (LLE) Beyond Principal Components Analysis (PCA)

More information

14 Singular Value Decomposition

14 Singular Value Decomposition 14 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

Dimensionality Reduction AShortTutorial

Dimensionality Reduction AShortTutorial Dimensionality Reduction AShortTutorial Ali Ghodsi Department of Statistics and Actuarial Science University of Waterloo Waterloo, Ontario, Canada, 2006 c Ali Ghodsi, 2006 Contents 1 An Introduction to

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 02-01-2018 Biomedical data are usually high-dimensional Number of samples (n) is relatively small whereas number of features (p) can be large Sometimes p>>n Problems

More information

IV. Matrix Approximation using Least-Squares

IV. Matrix Approximation using Least-Squares IV. Matrix Approximation using Least-Squares The SVD and Matrix Approximation We begin with the following fundamental question. Let A be an M N matrix with rank R. What is the closest matrix to A that

More information

15 Singular Value Decomposition

15 Singular Value Decomposition 15 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

Nonlinear Manifold Learning Summary

Nonlinear Manifold Learning Summary Nonlinear Manifold Learning 6.454 Summary Alexander Ihler ihler@mit.edu October 6, 2003 Abstract Manifold learning is the process of estimating a low-dimensional structure which underlies a collection

More information

1 Principal Components Analysis

1 Principal Components Analysis Lecture 3 and 4 Sept. 18 and Sept.20-2006 Data Visualization STAT 442 / 890, CM 462 Lecture: Ali Ghodsi 1 Principal Components Analysis Principal components analysis (PCA) is a very popular technique for

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

Principal Component Analysis

Principal Component Analysis CSci 5525: Machine Learning Dec 3, 2008 The Main Idea Given a dataset X = {x 1,..., x N } The Main Idea Given a dataset X = {x 1,..., x N } Find a low-dimensional linear projection The Main Idea Given

More information

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Machine Learning Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 1 / 47 Table of contents 1 Introduction

More information

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR

More information

ISSN: (Online) Volume 3, Issue 5, May 2015 International Journal of Advance Research in Computer Science and Management Studies

ISSN: (Online) Volume 3, Issue 5, May 2015 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 3, Issue 5, May 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at:

More information

1 Singular Value Decomposition and Principal Component

1 Singular Value Decomposition and Principal Component Singular Value Decomposition and Principal Component Analysis In these lectures we discuss the SVD and the PCA, two of the most widely used tools in machine learning. Principal Component Analysis (PCA)

More information

Apprentissage non supervisée

Apprentissage non supervisée Apprentissage non supervisée Cours 3 Higher dimensions Jairo Cugliari Master ECD 2015-2016 From low to high dimension Density estimation Histograms and KDE Calibration can be done automacally But! Let

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA Tobias Scheffer Overview Principal Component Analysis (PCA) Kernel-PCA Fisher Linear Discriminant Analysis t-sne 2 PCA: Motivation

More information

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Chapter 14 SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Today we continue the topic of low-dimensional approximation to datasets and matrices. Last time we saw the singular

More information

Linear Dimensionality Reduction

Linear Dimensionality Reduction Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Principal Component Analysis 3 Factor Analysis

More information

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works CS68: The Modern Algorithmic Toolbox Lecture #8: How PCA Works Tim Roughgarden & Gregory Valiant April 20, 206 Introduction Last lecture introduced the idea of principal components analysis (PCA). The

More information

7 Principal Component Analysis

7 Principal Component Analysis 7 Principal Component Analysis This topic will build a series of techniques to deal with high-dimensional data. Unlike regression problems, our goal is not to predict a value (the y-coordinate), it is

More information

PARAMETERIZATION OF NON-LINEAR MANIFOLDS

PARAMETERIZATION OF NON-LINEAR MANIFOLDS PARAMETERIZATION OF NON-LINEAR MANIFOLDS C. W. GEAR DEPARTMENT OF CHEMICAL AND BIOLOGICAL ENGINEERING PRINCETON UNIVERSITY, PRINCETON, NJ E-MAIL:WGEAR@PRINCETON.EDU Abstract. In this report we consider

More information

DIMENSION REDUCTION. min. j=1

DIMENSION REDUCTION. min. j=1 DIMENSION REDUCTION 1 Principal Component Analysis (PCA) Principal components analysis (PCA) finds low dimensional approximations to the data by projecting the data onto linear subspaces. Let X R d and

More information

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA Yoshua Bengio Pascal Vincent Jean-François Paiement University of Montreal April 2, Snowbird Learning 2003 Learning Modal Structures

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

Unsupervised Learning: Dimensionality Reduction

Unsupervised Learning: Dimensionality Reduction Unsupervised Learning: Dimensionality Reduction CMPSCI 689 Fall 2015 Sridhar Mahadevan Lecture 3 Outline In this lecture, we set about to solve the problem posed in the previous lecture Given a dataset,

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Laurenz Wiskott Institute for Theoretical Biology Humboldt-University Berlin Invalidenstraße 43 D-10115 Berlin, Germany 11 March 2004 1 Intuition Problem Statement Experimental

More information

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26 Principal Component Analysis Brett Bernstein CDS at NYU April 25, 2017 Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 1 / 26 Initial Question Intro Question Question Let S R n n be symmetric. 1

More information

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data. Structure in Data A major objective in data analysis is to identify interesting features or structure in the data. The graphical methods are very useful in discovering structure. There are basically two

More information

Kernel methods for comparing distributions, measuring dependence

Kernel methods for comparing distributions, measuring dependence Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations

More information

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x = Linear Algebra Review Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1 x x = 2. x n Vectors of up to three dimensions are easy to diagram.

More information

Manifold Learning and it s application

Manifold Learning and it s application Manifold Learning and it s application Nandan Dubey SE367 Outline 1 Introduction Manifold Examples image as vector Importance Dimension Reduction Techniques 2 Linear Methods PCA Example MDS Perception

More information

Preprocessing & dimensionality reduction

Preprocessing & dimensionality reduction Introduction to Data Mining Preprocessing & dimensionality reduction CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University Fall 2016 CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall 2016

More information

Principal Components Analysis. Sargur Srihari University at Buffalo

Principal Components Analysis. Sargur Srihari University at Buffalo Principal Components Analysis Sargur Srihari University at Buffalo 1 Topics Projection Pursuit Methods Principal Components Examples of using PCA Graphical use of PCA Multidimensional Scaling Srihari 2

More information

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = 30 MATHEMATICS REVIEW G A.1.1 Matrices and Vectors Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = a 11 a 12... a 1N a 21 a 22... a 2N...... a M1 a M2... a MN A matrix can

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms : Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer

More information

Lecture 7: Positive Semidefinite Matrices

Lecture 7: Positive Semidefinite Matrices Lecture 7: Positive Semidefinite Matrices Rajat Mittal IIT Kanpur The main aim of this lecture note is to prepare your background for semidefinite programming. We have already seen some linear algebra.

More information

Lecture 6: Methods for high-dimensional problems

Lecture 6: Methods for high-dimensional problems Lecture 6: Methods for high-dimensional problems Hector Corrada Bravo and Rafael A. Irizarry March, 2010 In this Section we will discuss methods where data lies on high-dimensional spaces. In particular,

More information

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013. The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment Two Caramanis/Sanghavi Due: Tuesday, Feb. 19, 2013. Computational

More information

Data dependent operators for the spatial-spectral fusion problem

Data dependent operators for the spatial-spectral fusion problem Data dependent operators for the spatial-spectral fusion problem Wien, December 3, 2012 Joint work with: University of Maryland: J. J. Benedetto, J. A. Dobrosotskaya, T. Doster, K. W. Duke, M. Ehler, A.

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [based on slides from Nina Balcan] slide 1 Goals for the lecture you should understand

More information

Lecture: Some Practical Considerations (3 of 4)

Lecture: Some Practical Considerations (3 of 4) Stat260/CS294: Spectral Graph Methods Lecture 14-03/10/2015 Lecture: Some Practical Considerations (3 of 4) Lecturer: Michael Mahoney Scribe: Michael Mahoney Warning: these notes are still very rough.

More information

Machine Learning (Spring 2012) Principal Component Analysis

Machine Learning (Spring 2012) Principal Component Analysis 1-71 Machine Learning (Spring 1) Principal Component Analysis Yang Xu This note is partly based on Chapter 1.1 in Chris Bishop s book on PRML and the lecture slides on PCA written by Carlos Guestrin in

More information

Lecture: Face Recognition and Feature Reduction

Lecture: Face Recognition and Feature Reduction Lecture: Face Recognition and Feature Reduction Juan Carlos Niebles and Ranjay Krishna Stanford Vision and Learning Lab 1 Recap - Curse of dimensionality Assume 5000 points uniformly distributed in the

More information

CSE 291. Assignment Spectral clustering versus k-means. Out: Wed May 23 Due: Wed Jun 13

CSE 291. Assignment Spectral clustering versus k-means. Out: Wed May 23 Due: Wed Jun 13 CSE 291. Assignment 3 Out: Wed May 23 Due: Wed Jun 13 3.1 Spectral clustering versus k-means Download the rings data set for this problem from the course web site. The data is stored in MATLAB format as

More information

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS VIKAS CHANDRAKANT RAYKAR DECEMBER 5, 24 Abstract. We interpret spectral clustering algorithms in the light of unsupervised

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

Dimensionality Reduction

Dimensionality Reduction Lecture 5 1 Outline 1. Overview a) What is? b) Why? 2. Principal Component Analysis (PCA) a) Objectives b) Explaining variability c) SVD 3. Related approaches a) ICA b) Autoencoders 2 Example 1: Sportsball

More information

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Principal Components Analysis Le Song Lecture 22, Nov 13, 2012 Based on slides from Eric Xing, CMU Reading: Chap 12.1, CB book 1 2 Factor or Component

More information

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY

More information

EECS 275 Matrix Computation

EECS 275 Matrix Computation EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 23 1 / 27 Overview

More information

Global (ISOMAP) versus Local (LLE) Methods in Nonlinear Dimensionality Reduction

Global (ISOMAP) versus Local (LLE) Methods in Nonlinear Dimensionality Reduction Global (ISOMAP) versus Local (LLE) Methods in Nonlinear Dimensionality Reduction A presentation by Evan Ettinger on a Paper by Vin de Silva and Joshua B. Tenenbaum May 12, 2005 Outline Introduction The

More information

Lecture: Face Recognition and Feature Reduction

Lecture: Face Recognition and Feature Reduction Lecture: Face Recognition and Feature Reduction Juan Carlos Niebles and Ranjay Krishna Stanford Vision and Learning Lab Lecture 11-1 Recap - Curse of dimensionality Assume 5000 points uniformly distributed

More information

Machine Learning - MT & 14. PCA and MDS

Machine Learning - MT & 14. PCA and MDS Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)

More information

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas Dimensionality Reduction: PCA Nicholas Ruozzi University of Texas at Dallas Eigenvalues λ is an eigenvalue of a matrix A R n n if the linear system Ax = λx has at least one non-zero solution If Ax = λx

More information

THE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR

THE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR THE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR 1. Definition Existence Theorem 1. Assume that A R m n. Then there exist orthogonal matrices U R m m V R n n, values σ 1 σ 2... σ p 0 with p = min{m, n},

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

Kernel Principal Component Analysis

Kernel Principal Component Analysis Kernel Principal Component Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

A Tutorial on Data Reduction. Principal Component Analysis Theoretical Discussion. By Shireen Elhabian and Aly Farag

A Tutorial on Data Reduction. Principal Component Analysis Theoretical Discussion. By Shireen Elhabian and Aly Farag A Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag University of Louisville, CVIP Lab November 2008 PCA PCA is A backbone of modern data

More information

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T. Notes on singular value decomposition for Math 54 Recall that if A is a symmetric n n matrix, then A has real eigenvalues λ 1,, λ n (possibly repeated), and R n has an orthonormal basis v 1,, v n, where

More information

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD DATA MINING LECTURE 8 Dimensionality Reduction PCA -- SVD The curse of dimensionality Real data usually have thousands, or millions of dimensions E.g., web documents, where the dimensionality is the vocabulary

More information

Advanced data analysis

Advanced data analysis Advanced data analysis Akisato Kimura ( 木村昭悟 ) NTT Communication Science Laboratories E-mail: akisato@ieee.org Advanced data analysis 1. Introduction (Aug 20) 2. Dimensionality reduction (Aug 20,21) PCA,

More information

Lecture 6 Sept Data Visualization STAT 442 / 890, CM 462

Lecture 6 Sept Data Visualization STAT 442 / 890, CM 462 Lecture 6 Sept. 25-2006 Data Visualization STAT 442 / 890, CM 462 Lecture: Ali Ghodsi 1 Dual PCA It turns out that the singular value decomposition also allows us to formulate the principle components

More information

Computation. For QDA we need to calculate: Lets first consider the case that

Computation. For QDA we need to calculate: Lets first consider the case that Computation For QDA we need to calculate: δ (x) = 1 2 log( Σ ) 1 2 (x µ ) Σ 1 (x µ ) + log(π ) Lets first consider the case that Σ = I,. This is the case where each distribution is spherical, around the

More information

7. Symmetric Matrices and Quadratic Forms

7. Symmetric Matrices and Quadratic Forms Linear Algebra 7. Symmetric Matrices and Quadratic Forms CSIE NCU 1 7. Symmetric Matrices and Quadratic Forms 7.1 Diagonalization of symmetric matrices 2 7.2 Quadratic forms.. 9 7.4 The singular value

More information

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold.

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold. Nonlinear Methods Data often lies on or near a nonlinear low-dimensional curve aka manifold. 27 Laplacian Eigenmaps Linear methods Lower-dimensional linear projection that preserves distances between all

More information

Review of Linear Algebra

Review of Linear Algebra Review of Linear Algebra Definitions An m n (read "m by n") matrix, is a rectangular array of entries, where m is the number of rows and n the number of columns. 2 Definitions (Con t) A is square if m=

More information

Motivating the Covariance Matrix

Motivating the Covariance Matrix Motivating the Covariance Matrix Raúl Rojas Computer Science Department Freie Universität Berlin January 2009 Abstract This note reviews some interesting properties of the covariance matrix and its role

More information

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016 Lecture 8 Principal Component Analysis Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 13, 2016 Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 1 / 31 Outline 1 Eigen

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Machine Learning. Data visualization and dimensionality reduction. Eric Xing. Lecture 7, August 13, Eric Xing Eric CMU,

Machine Learning. Data visualization and dimensionality reduction. Eric Xing. Lecture 7, August 13, Eric Xing Eric CMU, Eric Xing Eric Xing @ CMU, 2006-2010 1 Machine Learning Data visualization and dimensionality reduction Eric Xing Lecture 7, August 13, 2010 Eric Xing Eric Xing @ CMU, 2006-2010 2 Text document retrieval/labelling

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1 Week 2 Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1 Part I Other datatypes, preprocessing 2 / 1 Other datatypes Document data You might start with a collection of

More information

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Dimensionality Reduction Goal:

More information

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes Week 2 Based in part on slides from textbook, slides of Susan Holmes Part I Other datatypes, preprocessing October 3, 2012 1 / 1 2 / 1 Other datatypes Other datatypes Document data You might start with

More information

Dimensionality Reduction with Principal Component Analysis

Dimensionality Reduction with Principal Component Analysis 10 Dimensionality Reduction with Principal Component Analysis Working directly with high-dimensional data, such as images, comes with some difficulties: it is hard to analyze, interpretation is difficult,

More information

8.1 Concentration inequality for Gaussian random matrix (cont d)

8.1 Concentration inequality for Gaussian random matrix (cont d) MGMT 69: Topics in High-dimensional Data Analysis Falll 26 Lecture 8: Spectral clustering and Laplacian matrices Lecturer: Jiaming Xu Scribe: Hyun-Ju Oh and Taotao He, October 4, 26 Outline Concentration

More information