PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57
Announcements HW1 solutions posted. PCA and admixture models 2 / 57
Supervised versus Unsupervised Learning Unsupervised Learning from unlabeled observations Dimensionality Reduction. Last class. Other latent variable models. This class + review of PCA. PCA and admixture models 3 / 57
Outline Dimensionality reduction Linear Algebra background PCA Practical issues Probabilistic PCA Admixture models Population structure and GWAS PCA and admixture models Dimensionality reduction 4 / 57
Raw data can be complex, high-dimensional If we knew what to measure, we could find simple relationships. Signals have redundancy. Genotype measured at 500K SNPs. Genotypes at neighboring SNPs correlated. PCA and admixture models Dimensionality reduction 5 / 57
Dimensionality reduction Goal: Find a more compact representation of data Why? Visualize and discover hidden patterns. Preprocessing for a supervised learning problem. Statistical: remove noise. Computational: reduce wasteful computation. PCA and admixture models Dimensionality reduction 6 / 57
Dimensionality reduction Goal: Find a more compact representation of data Why? Visualize and discover hidden patterns. Preprocessing for a supervised learning problem. Statistical: remove noise. Computational: reduce wasteful computation. PCA and admixture models Dimensionality reduction 6 / 57
An example We measure parents and offspring heights. Two measurements. Points in R 2 How can we find a more compact representation? Two measurements are correlated with some noise. Pick a direction and project. PCA and admixture models Dimensionality reduction 7 / 57
An example We measure parents and offspring heights. Two measurements. Points in R 2 How can we find a more compact representation? Two measurements are correlated with some noise. Pick a direction and project. PCA and admixture models Dimensionality reduction 7 / 57
An example We measure parents and offspring heights. Two measurements. Points in R 2 How can we find a more compact representation? Two measurements are correlated with some noise. Pick a direction and project. PCA and admixture models Dimensionality reduction 7 / 57
Goal: Minimize reconstruction error Find projection that minimizes the Euclidean distance between original points and projections. Principal Components Analysis solves this problem! PCA and admixture models Dimensionality reduction 8 / 57
Principal Components Analysis PCA: find lower dimensional representation of data Choose K. X is N M raw data. X ZW T where Z = N K reduced representaion (PC scores) W is M K principal components (columns are principal components). PCA and admixture models Dimensionality reduction 9 / 57
Outline Dimensionality reduction Linear Algebra background PCA Practical issues Probabilistic PCA Admixture models Population structure and GWAS PCA and admixture models Linear Algebra background 10 / 57
Covariance matrix C = 1 N XT X Generalizes to many features C i,i : variance of feature i C i,j : covariance of feature i and j Symmetric PCA and admixture models Linear Algebra background 11 / 57
Covariance matrix C = 1 N XT X Positive semi-definite (PSD). Sometimes indicated as C 0 (Positive semi-definite matrix) A matrix A R n n is positive semi-definite iff v T Av 0 for all v R n. PCA and admixture models Linear Algebra background 11 / 57
Covariance matrix C = 1 N XT X Positive semi-definite (PSD). Sometimes indicated as C 0 v T Cv v T X T Xv = (Xv) T Xv n 2 = (Xv) i i=1 PCA and admixture models Linear Algebra background 11 / 57
Covariance matrix C = 1 N XT X All covariance matrices (being symmetric and PSD) have an eigendecomposition PCA and admixture models Linear Algebra background 11 / 57
Eigenvector and eigenvalue (Eigenvector and eigenvalue) A vector v is an eigenvector of A R n n if Av = λv for λ is the eigenvalue associated with v. PCA and admixture models Linear Algebra background 12 / 57
Eigendecomposition of a covariance matrix C is symmetric Its eigenvectors {u i }, i {1,..., M} can be chosen to be orthonormal u T i u j = 0, i j u T i u i = 1 We can choose eigenvectors so that eigenvalues are in decreasing order: λ 1 λ 2... λ M. PCA and admixture models Linear Algebra background 13 / 57
Eigendecomposition of a covariance matrix Arrange U = [u 1... u M ] Cu i = λ i u i, i {1,..., M} CU = C[u 1... u M ] = [Cu 1... Cu M ] = [λ 1 u 1... λ M u M ] λ 1 0... 0 = [u 1... u M ].... 0 0... λ M = UΛ PCA and admixture models Linear Algebra background 13 / 57
Eigendecomposition of a covariance matrix CU = UΛ Now U is an orthogonal matrix. So UU T = I M C = CUU T = UΛU T PCA and admixture models Linear Algebra background 14 / 57
Eigendecomposition of a covariance matrix C = UΛU T U is m m orthonormal matrix. Columns are eigenvectors sorted by eigenvalues. Λ is a diagonal matrix of eigenvalues. PCA and admixture models Linear Algebra background 14 / 57
Eigendecomposition: Example Covariance matrix : Ψ PCA and admixture models Linear Algebra background 15 / 57
Eigendecomposition: Example Covariance matrix : Ψ PCA and admixture models Linear Algebra background 15 / 57
Alternate characterization of eigenvectors Eigenvectors are orthonormal directions of maximum variance Eigenvalues are the variance in these directions. First eigenvector direction of maximum variance with variance = λ 1. PCA and admixture models Linear Algebra background 16 / 57
Alternate characterization of eigenvectors Given covariance matrix C R M M x = arg max x x T Cx x 2 = 1 Solution:x = u 1 is the first eigenvector of C. Example of a constrained optimization problem Why do we need the constaint? PCA and admixture models Linear Algebra background 16 / 57
Outline Dimensionality reduction Linear Algebra background PCA Practical issues Probabilistic PCA Admixture models Population structure and GWAS PCA and admixture models PCA 17 / 57
Back to PCA Given N data points x n R M, n {1,..., N}, find a linear transformation from a lower dimensional space K < M : W R M K and a projection z n R K so that we can reconstruct original data from the lower dimensional projection. x n w 1 z n,1 +... + w K z n,k = [w 1... w K ] z n,1. z n,k = W z n, z n R K We assume the data is centered. n x n,m = 0. Compression We go from storing N M to M K + N K. How do we define quality of reconstruction? PCA and admixture models PCA 18 / 57
PCA Find z n R K and W R M K to minimize the reconstruction error J(W, Z) = 1 2 x n W z n 2 N n Z = [z 1,..., z N ] T Require columns of W to be orthonormal. The optimal solution is obtained by setting Ŵ = U K where U K contains the K eigenvectors associated with the K largest eigenvalues of the covaiance matrix C of X. The low-dimensional projection zˆ n = Ŵ T x n. PCA and admixture models PCA 19 / 57
PCA Find z n R K and W R M K to minimize the reconstruction error J(W, Z) = 1 2 x n W z n 2 N n Z = [z 1,..., z N ] T Require columns of W to be orthonormal. The optimal solution is obtained by setting Ŵ = U K where U K contains the K eigenvectors associated with the K largest eigenvalues of the covaiance matrix C of X. The low-dimensional projection zˆ n = Ŵ T x n. PCA and admixture models PCA 19 / 57
PCA Find z n R K and W R M K to minimize the reconstruction error J(W, Z) = 1 2 x n W z n 2 N n Z = [z 1,..., z N ] T Require columns of W to be orthonormal. The optimal solution is obtained by setting Ŵ = U K where U K contains the K eigenvectors associated with the K largest eigenvalues of the covaiance matrix C of X. The low-dimensional projection zˆ n = Ŵ T x n. PCA and admixture models PCA 19 / 57
PCA: K = 1 J(w 1, z 1 ) = 1 2 x n w 1 z n,1 2 N n = 1 (x n w 1 z n,1 ) T (x n w 1 z n,1 ) N n = 1 ( x T N n x 2w T 1 x n z n,1 + z 2 n,1 w T ) 1 w 1 n = const + 1 ( 2w T 2 N 1 x n z n,1 + z ) n,1 n To maximize this function, take derivatives with respect to z n,1 J(w 1, z 1 ) z n,1 = 0 z n,1 = w T 1 x n PCA and admixture models PCA 20 / 57
PCA: K = 1 Plugging back z n,1 = w T 1 x n J(w 1 ) = const + 1 N = const + 1 N = const 1 N ( 2w T 2 1 x n z n,1 + z ) n,1 n ( 2 2zn,1 z n,1 + z ) n,1 n n z n,1 2 Now, because the data is centered E [ z 1 ] = 1 z n,1 N n = 1 w T 1 x n N n = w T 1 1 x n = 0 N PCA and admixture models PCA n 20 / 57
PCA: K = 1 J(w 1 ) = const 1 N n z n,1 2 Var [ z 1 ] = E [ z 2 ] 1 E [ z 1 ] 2 = 1 z 2 n,1 0 N = 1 N n n z n,1 2 PCA and admixture models PCA 20 / 57
PCA: K = 1 Putting together J(w 1 ) = const 1 N Var [ z 1 ] = 1 N n z n,1 2 n z n,1 2 We have J(w 1 ) = const Var [ z 1 ] Two views of PCA: Find a direction that minimizes the reconstruction error Find a direction that maximizes variance of projected data arg min w1 J(w 1 ) = arg max w1 Var [ z 1 ] PCA and admixture models PCA 20 / 57
PCA: K = 1 arg min w1 J(w 1 ) = arg max w1 Var [ z 1 ] Var [ z 1 ] = 1 2 z n,1 N n = 1 w T 1 x n w T 1 x n N n = 1 w T 1 x n x T n w 1 N n = w T n (x nx T n ) 1 w 1 N = w T 1 Cw 1 PCA and admixture models PCA 21 / 57
PCA: K = 1 So we need to solve arg min w1 J(w 1 ) = arg max w1 Var [ z 1 ] arg max w1 w T 1 Cw 1 Since we required W to be orthonormal, we need to constrain: w 1 2 = 1. This objective function is maximized when w 1 is the first eigenvector of C PCA and admixture models PCA 21 / 57
PCA: K > 1 We can repeat the argument for K > 1. Since we require directions w k to be orthonormal, we can repeat the argument by searching for direction that maximzes the remaining variance and is orthogonal to previously selected directions. PCA and admixture models PCA 22 / 57
Computing eigendecompositions Numerical algorithms to compute all eigenvalue, eigenvectors. O(M 3 ). Infeasible for genetic datasets. Computing largest eigenvalue, eigenvector: Power iteration. O(M 2 ). Since we are interested in covariance matrices, can use algorithms to compute the singular-value decomposition (SVD): O(MN 2 ). (Will discuss later). PCA and admixture models PCA 23 / 57
Practical issues Choosing K For visualization, K = 2 or K = 3. For other analyses, pick K so that most of the variance in the data is retained. Fraction of variance retained in the top K eigenvectors K k=1 λ k M m=1 λ m PCA and admixture models PCA 24 / 57
PCA: Example PCA and admixture models PCA 25 / 57
PCA: Example PCA and admixture models PCA 25 / 57
PCA: Example PCA and admixture models PCA 25 / 57
PCA: Example PCA and admixture models PCA 25 / 57
PCA: Example PCA and admixture models PCA 25 / 57
PCA on HapMap PCA and admixture models PCA 26 / 57
PCA on Human Genome Diversity Project PCA and admixture models PCA 27 / 57
PCA on Human Genome Diversity Project PCA and admixture models PCA 27 / 57
PCA on European genetic data 1 Novembre et al. Nature 2008 PCA and admixture models PCA 28 / 57
Probabilistic interpretation of PCA z n iid N (0, I K ) p(x n z n ) = N (W z n, σ 2 I M ) PCA and admixture models PCA 29 / 57
Probabilistic interpretation of PCA z n iid N (0, I K ) p(x n z n ) = N (W z n, σ 2 I M ) E [x n z n ] = W z n E [x n ] = E [E [x n z n ]] = E [W z n ] = W E [z n ] = 0 PCA and admixture models PCA 29 / 57
Probabilistic interpretation of PCA z n iid N (0, I K ) p(x n z n ) = N (W z n, σ 2 I M ) Cov [x n ] = E [ x n x T ] n E [xn ] E [x n ] T [ = E (W z n + ɛ n )(W z n + ɛ n ) T] 0 = E [ W z n z T n W T + 2W z n ɛ T n + ɛ n ɛ T ] n = E [ W z n z T n W T] + E [ 2W z n ɛ T ] [ n + E ɛn ɛ T ] n = W E [z n z n ] W T + 2W E [ z n ɛ T n ] + σ 2 I M = W E [z n z n ] W T + 2W E [z n ] E [ɛ n ] T + σ 2 I M = W I K W T + 2W 0 + σ 2 I M = W W T + σ 2 I M PCA and admixture models PCA 29 / 57
Probabilistic PCA Log likelihood LL(W, σ 2 ) log P (D W, σ 2 ) Maximize W subject to constraint that columns of W are orthonormal. The maximum likelihood estimator Ŵ ML = U K (ΛK σ 2 I K ) U K = [U 1... U K ] λ 1... 0 Λ K =.. 0... λ K σ 2 1 M ML = λ j M K j=k+1 PCA and admixture models PCA 30 / 57
Probabilistic PCA Log likelihood LL(W, σ 2 ) log P (D W, σ 2 ) Maximize W subject to constraint that columns of W are orthonormal. The maximum likelihood estimator Ŵ ML = U K (ΛK σ 2 I K ) U K = [U 1... U K ] λ 1... 0 Λ K =.. 0... λ K σ 2 1 M ML = λ j M K j=k+1 PCA and admixture models PCA 30 / 57
Probabilistic PCA Computing the MLE Compute eigenvalues, eigenvectors Hidden/latent variable problem: Use EM PCA and admixture models PCA 31 / 57
Probabilistic PCA Computing the MLE Compute eigenvalues, eigenvectors Hidden/latent variable problem: Use EM PCA and admixture models PCA 31 / 57
Other advantages of Probabilistic PCA Can use model selection to infer K. Choose K to maximize the marginal likelihood P (D K). Use cross-validation and pick K that maximizes likelihood on held out data. Other model selection criteria such as AIC or BIC (see lecture 6 on clustering). PCA and admixture models PCA 32 / 57
Mini-Summary Dimensionality reduction: Linear methods Exploratory analysis and visualization. Downstream inference: Can use the low-dimensional features for other tasks. Principal Components Analysis finds a linear subspace that minimized reconstruction error or equivalently maximizes the variance. Eigenvalue problem. Probabilistic interpretation also leads to EM. Why may PCA not be appropriate for genetic data? PCA and admixture models PCA 33 / 57