Mid-year Report Linear and Non-linear Dimentionality. Reduction. applied to gene expression data of cancer tissue samples

Mid-year Report Linear and Non-linear Dimentionality applied to gene expression data of cancer tissue samples Franck Olivier Ndjakou Njeunje Applied Mathematics, Statistics, and Scientific Computation Advisers Wojtek Czaja John J. Benedetto Norbert Wiener Center for Harmonic Department of Mathematics University of Maryland - College Park December 2, 2014 1 / 26

Outline 2 / 26

DNA Microarray [2] 3 / 26

DNA Microarray [2] Gene expression matrix 3 / 26

Project goal In this project I would like to demonstrate the effectiveness of Non-linear versus Linear algorithm in capturing biologically relevant structures in cancer expression dataset. 4 / 26

Project goal In this project I would like to demonstrate the effectiveness of Non-linear versus Linear algorithm in capturing biologically relevant structures in cancer expression dataset. I will be using clustering analysis and rand index as tools to measure the preservation of the structure between the original and the reduced dataset. 4 / 26

I am considering the following methods: 1. Dimension reduction algorithms 12 LDR: (PCA) [1] NDR: Laplacian Eigenmap (LE) [3] 1 Jinlong Shi, Zhigang Luo, Nonlinear dimensionality reduction. 2 Mikhail Belkin, Partha Niyogi, Laplacian Eigenmaps. 5 / 26

I am considering the following methods: 1. Dimension reduction algorithms 12 LDR: (PCA) [1] NDR: Laplacian Eigenmap (LE) [3] 2. Clustering algorithm K-means (KM) Hierachical clustering (HC) 1 Jinlong Shi, Zhigang Luo, Nonlinear dimensionality reduction. 2 Mikhail Belkin, Partha Niyogi, Laplacian Eigenmaps. 5 / 26

Step 1: Compute the standardized matrix X of the original matrix X, X = ( x 1, x 2,..., x M ) (1) = ( x 1 x 1, x 2 x 2,..., x M x M ). (2) σ11 σmm σ22 Here, x 1, x 2,..., x M and σ 11, σ 22,..., σ MM are respectively the mean values and the variances for corresponding variable vectors. 6 / 26

Step 2: Compute the covariance matrix of X, then make spectral decomposition to get the eigenvalues and its corresponding eigenvectors. C = X X = UΛU. (3) Here Λ = diag(λ 1, λ 2,..., λ M ), λ 1 λ 2... λ M, U = (u 1, u 2,..., u M ). λ i and u i are separately the ith eigenvalue corresponding eigenvector for covariance matrix C. Step 3: Determine the number of principal components based on the preconcerted value. Supposing the number to be m, the i th principal component can be computed as Xu i, and the reduced dimentional (N m) subspace is XU m. 7 / 26

Illustration 8 / 26

Loading vectors components are in short a linear combination of the original component making up the matrix X. 9 / 26

Loading vectors components are in short a linear combination of the original component making up the matrix X. The vector carrying the coefficients associated with the combination are known as the loading vectors. Finding the first loading vector u 1 must be done so that the magnitude of the first principal component is maximized. 9 / 26

First loading vectors: Rayleigh quotient u 1 = arg max u=1 y 1 2 (4) = arg max u=1 Xu 2 (5) = arg max u=1 u X Xu u u (6) = arg max u=1 u Cu u u (7) 10 / 26

First loading vectors: Rayleigh quotient u 1 = arg max u=1 y 1 2 (4) = arg max u=1 Xu 2 (5) = arg max u=1 u X Xu u u (6) = arg max u=1 u Cu u u (7) The quantity to be maximized is well-known as the Rayleigh quotient for symmetric matrices. The solution to this optimization problem is known to be the eigenvector of C corresponding to the eigenvalue of largest magnitude. 10 / 26

Remaining loading vectors To find the remaining loading vectors u k for k = 2... m, we will apply the same idea to the modified matrix X k. k 1 X k = X Xu i u i (8) i=1 where all correlation with the previously found loading vectors has been removed. 11 / 26

Covariance matrix With C R M M as our symmetric covariance matrix, The set {u j } of unit eigenvectors of C with j = 1... M, where M M, forms a basis of R M. The corresponding eigenvalues {λ j }, are such that λ 1 > λ 2 >... > λ M. So any vector u (0) R M can be written as: u (0) = c 1 u 1 + c 2 u 2 +... + c M u M (9) for some c 1, c 2,..., c M R. 12 / 26

[4] Assuming that c 1 0. Au (0) = c 1 λ 1 u 1 + c 2 λ 2 u 2 +... + c M λ M u M A k u (0) = c 1 λ k 1u 1 + c 2 λ k 2u 2 +... + c M λ k M u M A k u (0) = λ k 1(c 1 u 1 + c 2 ( λ 2 λ 1 ) k u 2 +... + c M ( λ M λ 1 ) k u M ). 13 / 26

[4] Assuming that c 1 0. Au (0) = c 1 λ 1 u 1 + c 2 λ 2 u 2 +... + c M λ M u M A k u (0) = c 1 λ k 1u 1 + c 2 λ k 2u 2 +... + c M λ k M u M A k u (0) = λ k 1(c 1 u 1 + c 2 ( λ 2 λ 1 ) k u 2 +... + c M ( λ M λ 1 ) k u M ). So, as k increases we get, u 1 Ak u (0) A k u (0). (10) 13 / 26

Algorithm 3 Pick a starting vector u (0) with u (0) = 1 While u (k) u (k 1) > 10 6 Let w = Au (k 1) Let u (k) = w w Note: The convergence rate depends on the magnitude of the second largest eigenvalue. 3 Gene H. Golub, Henk A. van der Vorstb, Eigenvalue computation in the 20th century. 14 / 26

Data set [5] The matrix X has dimension: 2000 3 15 / 26

Variability: This number reflects the amount of variance captured in the reduction. It is the percentage of total magnitude of eigenvalues corresponding to the eigenvectors (loading vectors) used. DR Toolbox PCA My PCA: 90% variability 16 / 26

Data set [5] The matrix X has dimension: 20000 3 17 / 26

DR Toolbox PCA My PCA: 99% variability 18 / 26

[6] Rank index is a measure of agreement between two data clustering. Given a set of n elements S and two partition X and Y of the set S, the rand index r is given by: r = a + b C(n, 2) (11) 19 / 26

[6] Rank index is a measure of agreement between two data clustering. Given a set of n elements S and two partition X and Y of the set S, the rand index r is given by: where: r = a + b C(n, 2) a, the number of pairs of elements in S that are in the same set in X and in the same set in Y. b, the number of pairs of elements in S that are in different sets in X and in different sets in Y. (11) 19 / 26

K-means Clustering The matrix X has dimension: 20002 60 Raw clustering labels 2D labels: 83% agreement 20 / 26

Hierarchycal Clustering The matrix X has dimension: 20002 60 Raw clustering labels 2D labels: 73% agreement 21 / 26

22 / 26

Variability preserved within data 23 / 26

Conclusion : Kmeans vs Hierarchycal 83% vs 73% agreement 24 / 26

Conclusion : Kmeans vs Hierarchycal 83% vs 73% agreement Laplacian Eigenmap: Kmeans vs Hierarchycal??% vs??% agreement 24 / 26

Conclusion : Kmeans vs Hierarchycal 83% vs 73% agreement Laplacian Eigenmap: Kmeans vs Hierarchycal??% vs??% agreement vs Laplacian Eigenmap??% vs??% agreement 24 / 26

References: Jinlong Shi, Zhigang Luo, Nonlinear dimensionality reduction of gene expression data for visualization and clustering analysis of cancer tissue samples. Computers in Biology and Medicine 40 (2010) 723-732. Larssono, September 2007, Microarray-schema. Via Wikipedia - DNA microarray page. Mikhail Belkin, Partha Niyogi, Laplacian Eigenmaps for Dimentionality and Data Representation. Neural Computation 15, 1373-1396 (2003) Gene H. Golub, Henk A. van der Vorstb, Eigenvalue computation in the 20th century. Journal of Computational and Applied Mathematics 123 (2000) 35-65. 25 / 26

References: Laurens van der Maaten, Affiliation: Delft University of Technology. Matlab Toolbox for (v0.8.1b) March 21, 2013. W. M. Rand (1971), Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66 (336): 846850. 26 / 26