Singular Value Decomposition and Principal Component Analysis (PCA) I Prof Ned Wingreen MOL 40/50 Microarray review Data per array: 0000 genes, I (green) i,i (red) i 000 000+ data points! The expression matrix has entries of the form log 2 ( X = g i a j With X 00 arrays, this yields, ) I (green) ij : I (red) ij where the i th row g i of the m rows is the transcriptional response of the gene i, and where the j th column a j of the n columns is the expression profile assay j This is an m n matrix, where the number of rows m 0000 and the number of columns n 00 Example Assays taken every 0 minutes after addition of some compound (nutrient, poison, signal molecule) how to characterize the response of the genes? For any 2 log 2 I(t) I(0) 50 00 50 200 t 2 Figure : Noisy genes individual gene, it may be impossible to see the signal through the noise Individual genes may respond with a superposition of different patterns and noise How to extract the important patterns from the noise? We will use SVD and
PCA to analyze the expression matrix X But first we need to learn (or review) some linear algebra, so we can manipulate matrices like X Linear Algebra Matrices a a a 2 a a m n m n matrix A =, the transpose A T a 2 =, a m a m2 a mn a m where the transpose interchanges columns and rows If m = n, then A is called square Matrix multiplication If A is a m n matrix and B is a n l matrix, then AB is a m l matrix where (AB) ij = n a ik b kj k= We can think of this as dot products of vectors: a a 2 ( ) b b2 bl AB = a m Identity matrix (AB) ij = a i b j = a T b square matrix I = 0 0 I is the identity matrix because AI = A IA = A, for all matrices A 2
Inverse of a square matrix AA = A A = I Note that (A ) = A Finding the inverse of a matrix is slow in practice In general: C C 2 A = A (C ij) T = A C 2, C ji where A = deta Therefore A is /deta times the transpose of the cofactor matrix (aka the adjoint of A ) C ij = ( ) i+j M ij, where the minor M ij is the determinant of the submatrix obtained from A by deleting row i and column j Determinants of square matrices Example 2 2 matrix: ( ) a b det = ad bc c d The determinant of larger matrices can be defined by induction Example) 3 3 matrix: i j k det a b c = i(bf ce) j(af cd) + k(ae bd) d e f More generally, A = deta = n a ij ( ) i+j M ij, j= where as above M ij is the determinant of the submatrix formed by A without row i and column j: a a 2 a n A = a 2 ( ), so that A = a submatrix M a 2 M 2 + Example: Inverse of a 2 2 matrix A = A = A ( ) a b A = c d ( ) C C 2 = ( ) M M 2 A M 2 M 22 C 2 C 22 ( ) d b = c a ad bc ( d b c a ) 3
Check: ( )( ) A d b a b A = ad bc c a c d ( ) ad bc 0 = ad bc 0 ad bc ( ) 0 = = I 0 Linear independence c v + c 2 v 2 + c 3 v 3 + + c n v n = 0 c = c 2 = = c n = 0 If n vectors are not linearly independent, then they span a subspace of dimension < n Eg Two vectors v and v 2 are linearly dependent if and only if they are multiples of each other: v = a v 2 In this case v + v 2 only span a dimensional subspace Orthogonality Figure 2: Linearly dependent vectors Two vectors are orthogonal if x y = 0 ( ) ( ) 4 x =, y = : x y = 4 + 2( 2) = 0 2 2 Orthogonal vectors are perpendicular x y Figure 3: Orthogonal vectors Eigenvalues and eigenvectors of symmetric matrices Eigen means characteristic, so these are also sometimes called characteristic values and vectors Eigenvalues and vectors are defined and related by the equation A v = λ v, 4
where v is an eigenvector and λ is an eigenvalue (A consequence of this definition is that if v is an eigenvector, then so is c v) To find eigenvalues, we use the theorem (A λi) v = 0 det(a λi) = 0 This follows from a fact which is good to know: detx = i λ i Similarly, tr X = i λ i We can therefore find eigenvalues by solving a λ a 2 a 2 a 22 λ det(a λi) = = 0, ann λ which is an n th order polynomial in λ, with n roots, ie solutions for λ (these are either real or complex conjugate pairs) Once a λ i is known, A v = λ i v gives a set of linear equations which can be solved for the associated eigenvector v i Diagonalization If A is an n n matrix with n distinct eigenvalues λ,λ 2,,λ n and associated eigenvectors v i, then let ( ) v v U = 2 v n, where the eigenvectors v are the columns of U Now AU = A( v, v 2,, v n ) = (A v,a v 2,,A v n ) = (λ v,λ 2 v 2,,λ n v n ) λ 0 λ 2 = U = UD, 0 λn where D is the diagonal matrix of eigenvalues Manipulating the expression AU = U D yields the identity A = UDU This is the eigen-decomposition theorem, and it is an example of a similarity transform When A is a symmetric matrix, ie, when A = A T, then it is possible to make the vs orthonormal, in which case U = U T, so that A = UDU T 5
(Principal Component Analysis (PCA) is equivalent to diagonalization of a particular symmetric matrix the covariance matrix) The eigendecomposition allows some nice results: A 2 = AA = (UDU )(UDU ) = UDDU = UD 2 U = U λ 2 λ 2 2 U, λ 2 n and in the same way A m = UD m U = U λ m λ m 2 λ m n U So if we known the eigenvalues of A, we immediately know the eigenvalues of A m In particular, It is easy to check that A = (UDU ) = (U ) D U = UD U A A = (UD U )(UDU ) = UD DU = UU = I Therefore /λ A /λ 2 = U U /λn Singular Value Decomposition (SVD) SVD b A v A v s v 2 s 2 A b v 2 Figure 4: Singular Value Decomposition (SVD) SVD is a generalization of diagonalization for non-symmetric matrices If A is an m n matrix, then U,D,V : A = UDV T, 6
where U is an m m matrix with orthonormal columns and V is an n n matrix with orthonormal rows D is an m n diagonal matrix, where S S 2 D =, S S 2 S n 0 Sn These S i are the singular values 7
SVD and PCA II Prof Ned Wingreen MOL 40/50 Let s see how we can first apply PCA and then SVD to extract information about gene regulation from a gene expression matrix Recall a j X = g i, where the i th row g i of the m rows is the transcriptional response of the gene i, and where the j th column a j of the n columns is the expression profile assay j Consider the expression profile for each assay This will be the vector a k =(x k,x 2k,,x mk ) Each assay is therefore associated with a point in the m-dimensional space of genes The graph, of course, shows only 2 dimensions of a m-dimensional space, g 2 a 2 a g Figure : Gene space but one can immediately see that there is a correlation between the expression of gene g and g 2 How can we find correlations among multiple genes? Eg how would we know if the expression of genes g, g 2, g 37,andg 64 are all correlated? We can learn this from PCA and/or SVD First, center the data by subtracting out the mean value for each gene: y ik = x ik x i, where x i = n n x ik, the average having been taken over the n assays This yields the vector a k = a k x =(y k,y 2k,,y mk ) k=
g 2 a 2 a g Figure 2: Re-centered gene space Notice in this example that even after subtracting out the means, the expression values for genes and 2 are correlated over the set of assays Biologically, this suggests that genes and 2 are coregulated How can we quantify this? In particular how can we recognize if many sets of genes are coregulated (or counter-regulated)? Graphically, we d like to find the directions in gene space (weighed combinations of genes) that best capture the observed correlations in the data How do we do this mathematically, allowing for many correlated genes? Answer these directions are the principal components of a particular matrix, the covariance matrix With n assays, this takes the form: C ij = n y ik y jk = n (x ik x i )(x jk x j ) n n k= k= = E[(x i x i )(x j x j )] = E[x i x j ] x i x j =cov(x i,x j ) From the covariance matrix we can also define cor(x i,x j )= cov(x i,x j ) σ i σ j, the correlation coefficient of x i and x j (often called r), which satisfies r If x i and x j are independent, Two other easy relations are cov(x i,x j )=E[x i x j ] x i x j = x i x j x i x j =0 cov(x i,x i )=E[(x i x i ) 2 ]=σi 2 (cor(x i,x i )=r =) cov(x i, x i )= σi 2 (cor(x i, x i )=r = ) The covariance matrix is symmetric: cov(x,x ) cov(x 2,x ) cov = cov(x,x 2 ) 2
Diagonalizing the covariance matrix is called a Principal Component Analysis λ u ( ) cov = UDU T u u = 2 u m λ 2 u 2, λm u m where λ λ 2 λ m 0 The us are called orthonormal eigenvectors called the principal component vectors and the λs, the eigenvalues of the covariance matrix, are called principal component values The eigenvectors give the variance of the data along the principal component directions, and the covariance between these directions is zero So u gives the direction in which the data is most stretched and λ is the variance of the data in this direction Notice that Total variance of data = i cov(x i,x i )=trcov= k λ k where we ve used the fact that the trace of a square matrix is the sum of its eigenvalues In terms of gene expression, u gives the weighted set of genes that are most strongly coregulated (or counter-regulated!), u 2 gives the second strongest set, etc A plot of the eigenvalues λ i is order from largest to smallest may reveal a transition from signal to noise: 3 Fraction of total variance 2 signal noise Figure 3: Eigenvalue plot Dimension reduction If the PCA separates signal from noise, we can clean up data by considering projection of data onto principal components For each assay: a k =(x k,x 2k,,x mk )= a k + x, 3
g 2 ỹ ỹ 2 ỹ 3 g ỹ 4 ỹ 5 Figure 4: Values of ỹ we can find the projection of a k onto the principal components: ỹ k = u a k ỹ k2 = u 2 a k Keeping only the first few values of ỹ for each assay accounts for most of the signal in the data Plotting the assay data projected onto the first few principal components can also reveal correlations beyond linear Sometimes, ỹ and ỹ 2 will show no additional correlations (Figure 5), but sometimes correlations will be apparent even though ỹ, ỹ 2 = 0 (Figure 6) ỹ 2 ỹ Figure 5: Uncorrelated Other applications of PCA in biology: Molecular dynamics reconstructing flexible modes of biomolecules from snapshots Synthetic lethality Immunology antigen-antibody Residue-residue potentials Almost any large data set 4
ỹ 2 ỹ Figure 6: Correlated SVD By summing over assays to produce the covariance matrix, we have thrown away information Imagine that the assays were taken at fixed time intervals how can we recapture the time dependence of the gene expression? Any one gene may have a noisy time course of expression: x t t Figure 7: Single gene expression over time But maybe there is a coherent response that contributes a little bit to many genes: This coherent signal will contribute to the correlation of many genes, and so these genes will co-vary So these genes will form a principal component in the PCA of the covariance matrix So after performing a PCA, imagine putting time labels back on the data: By finding the principal components and plotting the projections on these components for each assay vs time, we reconstruct the coherent signal hidden in noisy data SVD allows us to do this in one step: m n matrix X = UDV T = ( u u 2 u m S ) S 2 v v 2 Sn v n, where S S 2 S n 0 In this decomposition, the us represent correlated sets of genes and the vs represent coherent patterns of genes expression 5
many genes slightly positive x many genes slightly negative t Figure 8: Overall x signal g 2 2 4 3 6 7 5 g Figure 9: Relabeled data The singular values S are simply related to the principal components of the covariance matrix: λ k = Sk 2 The matrix constructed from the leading l singular values, and the associated right { v} and left { u} singular vectors is the best rank l approximation to X, that is, choosing l X (l) = u k S k v k T minimizes i,j k= x ij x (l) ij 2 6
ỹ t Figure 0: Explicit time plot 7