PCA and admixture models

Similar documents
Introduction to Machine Learning

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Machine Learning - MT & 14. PCA and MDS

Association studies and regression

Dimensionality Reduction

CS281 Section 4: Factor Analysis and PCA

Neuroscience Introduction

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Dimensionality Reduction

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Principal Component Analysis

PCA, Kernel PCA, ICA

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Unsupervised Learning

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Preprocessing & dimensionality reduction

Lecture: Face Recognition and Feature Reduction

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

CSC 411 Lecture 12: Principal Component Analysis

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Lecture 7: Con3nuous Latent Variable Models

CS 340 Lec. 6: Linear Dimensionality Reduction

Probabilistic & Unsupervised Learning

EECS 275 Matrix Computation

STA 414/2104: Machine Learning

Dimension Reduction and Low-dimensional Embedding

Statistical Machine Learning

Probabilistic Latent Semantic Analysis

LECTURE 16: PCA AND SVD

Expectation Maximization

Mathematical foundations - linear algebra

Methods for sparse analysis of high-dimensional data, II

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Dimensionality Reduction with Principal Component Analysis

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Overfitting, Bias / Variance Analysis

Data Mining Techniques

What is Principal Component Analysis?

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

MLCC 2015 Dimensionality Reduction and PCA

Lecture: Face Recognition and Feature Reduction

1 Principal Components Analysis

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Fundamentals of Matrices

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Principal Component Analysis

Notes on Latent Semantic Analysis

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

CS181 Midterm 2 Practice Solutions

Machine Learning for Data Science (CS4786) Lecture 12

Unsupervised Learning: Dimensionality Reduction

Linear Regression (continued)

Statistics for Applications. Chapter 9: Principal Component Analysis (PCA) 1/16

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

14 Singular Value Decomposition

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Probabilistic & Unsupervised Learning. Latent Variable Models

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Maximum variance formulation

Principal Components Analysis (PCA)

Principal components analysis COMS 4771

Lecture 3: Review of Linear Algebra

Lecture 3: Review of Linear Algebra

Multivariate Statistical Analysis

Principal Component Analysis (PCA)

7 Principal Component Analysis

Collaborative Filtering: A Machine Learning Perspective

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Statistical Pattern Recognition

Machine learning for pervasive systems Classification in high-dimensional spaces

Descriptive Statistics

PRINCIPAL COMPONENTS ANALYSIS

Linear Algebra & Geometry why is linear algebra useful in computer vision?

L11: Pattern recognition principles

20 Unsupervised Learning and Principal Components Analysis (PCA)

Deriving Principal Component Analysis (PCA)

Latent Variable Models and EM Algorithm

Methods for sparse analysis of high-dimensional data, II

Covariance and Correlation Matrix

Principal Component Analysis (PCA)

EE16B Designing Information Devices and Systems II

Linear Dimensionality Reduction

EE16B Designing Information Devices and Systems II

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Unsupervised Learning Basics

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

Advanced Machine Learning & Perception

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Principal Component Analysis

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Latent Variable Models and EM algorithm

1 Singular Value Decomposition and Principal Component

Transcription:

PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57

Announcements HW1 solutions posted. PCA and admixture models 2 / 57

Supervised versus Unsupervised Learning Unsupervised Learning from unlabeled observations Dimensionality Reduction. Last class. Other latent variable models. This class + review of PCA. PCA and admixture models 3 / 57

Outline Dimensionality reduction Linear Algebra background PCA Practical issues Probabilistic PCA Admixture models Population structure and GWAS PCA and admixture models Dimensionality reduction 4 / 57

Raw data can be complex, high-dimensional If we knew what to measure, we could find simple relationships. Signals have redundancy. Genotype measured at 500K SNPs. Genotypes at neighboring SNPs correlated. PCA and admixture models Dimensionality reduction 5 / 57

Dimensionality reduction Goal: Find a more compact representation of data Why? Visualize and discover hidden patterns. Preprocessing for a supervised learning problem. Statistical: remove noise. Computational: reduce wasteful computation. PCA and admixture models Dimensionality reduction 6 / 57

Dimensionality reduction Goal: Find a more compact representation of data Why? Visualize and discover hidden patterns. Preprocessing for a supervised learning problem. Statistical: remove noise. Computational: reduce wasteful computation. PCA and admixture models Dimensionality reduction 6 / 57

An example We measure parents and offspring heights. Two measurements. Points in R 2 How can we find a more compact representation? Two measurements are correlated with some noise. Pick a direction and project. PCA and admixture models Dimensionality reduction 7 / 57

An example We measure parents and offspring heights. Two measurements. Points in R 2 How can we find a more compact representation? Two measurements are correlated with some noise. Pick a direction and project. PCA and admixture models Dimensionality reduction 7 / 57

An example We measure parents and offspring heights. Two measurements. Points in R 2 How can we find a more compact representation? Two measurements are correlated with some noise. Pick a direction and project. PCA and admixture models Dimensionality reduction 7 / 57

Goal: Minimize reconstruction error Find projection that minimizes the Euclidean distance between original points and projections. Principal Components Analysis solves this problem! PCA and admixture models Dimensionality reduction 8 / 57

Principal Components Analysis PCA: find lower dimensional representation of data Choose K. X is N M raw data. X ZW T where Z = N K reduced representaion (PC scores) W is M K principal components (columns are principal components). PCA and admixture models Dimensionality reduction 9 / 57

Outline Dimensionality reduction Linear Algebra background PCA Practical issues Probabilistic PCA Admixture models Population structure and GWAS PCA and admixture models Linear Algebra background 10 / 57

Covariance matrix C = 1 N XT X Generalizes to many features C i,i : variance of feature i C i,j : covariance of feature i and j Symmetric PCA and admixture models Linear Algebra background 11 / 57

Covariance matrix C = 1 N XT X Positive semi-definite (PSD). Sometimes indicated as C 0 (Positive semi-definite matrix) A matrix A R n n is positive semi-definite iff v T Av 0 for all v R n. PCA and admixture models Linear Algebra background 11 / 57

Covariance matrix C = 1 N XT X Positive semi-definite (PSD). Sometimes indicated as C 0 v T Cv v T X T Xv = (Xv) T Xv n 2 = (Xv) i i=1 PCA and admixture models Linear Algebra background 11 / 57

Covariance matrix C = 1 N XT X All covariance matrices (being symmetric and PSD) have an eigendecomposition PCA and admixture models Linear Algebra background 11 / 57

Eigenvector and eigenvalue (Eigenvector and eigenvalue) A vector v is an eigenvector of A R n n if Av = λv for λ is the eigenvalue associated with v. PCA and admixture models Linear Algebra background 12 / 57

Eigendecomposition of a covariance matrix C is symmetric Its eigenvectors {u i }, i {1,..., M} can be chosen to be orthonormal u T i u j = 0, i j u T i u i = 1 We can choose eigenvectors so that eigenvalues are in decreasing order: λ 1 λ 2... λ M. PCA and admixture models Linear Algebra background 13 / 57

Eigendecomposition of a covariance matrix Arrange U = [u 1... u M ] Cu i = λ i u i, i {1,..., M} CU = C[u 1... u M ] = [Cu 1... Cu M ] = [λ 1 u 1... λ M u M ] λ 1 0... 0 = [u 1... u M ].... 0 0... λ M = UΛ PCA and admixture models Linear Algebra background 13 / 57

Eigendecomposition of a covariance matrix CU = UΛ Now U is an orthogonal matrix. So UU T = I M C = CUU T = UΛU T PCA and admixture models Linear Algebra background 14 / 57

Eigendecomposition of a covariance matrix C = UΛU T U is m m orthonormal matrix. Columns are eigenvectors sorted by eigenvalues. Λ is a diagonal matrix of eigenvalues. PCA and admixture models Linear Algebra background 14 / 57

Eigendecomposition: Example Covariance matrix : Ψ PCA and admixture models Linear Algebra background 15 / 57

Eigendecomposition: Example Covariance matrix : Ψ PCA and admixture models Linear Algebra background 15 / 57

Alternate characterization of eigenvectors Eigenvectors are orthonormal directions of maximum variance Eigenvalues are the variance in these directions. First eigenvector direction of maximum variance with variance = λ 1. PCA and admixture models Linear Algebra background 16 / 57

Alternate characterization of eigenvectors Given covariance matrix C R M M x = arg max x x T Cx x 2 = 1 Solution:x = u 1 is the first eigenvector of C. Example of a constrained optimization problem Why do we need the constaint? PCA and admixture models Linear Algebra background 16 / 57

Outline Dimensionality reduction Linear Algebra background PCA Practical issues Probabilistic PCA Admixture models Population structure and GWAS PCA and admixture models PCA 17 / 57

Back to PCA Given N data points x n R M, n {1,..., N}, find a linear transformation from a lower dimensional space K < M : W R M K and a projection z n R K so that we can reconstruct original data from the lower dimensional projection. x n w 1 z n,1 +... + w K z n,k = [w 1... w K ] z n,1. z n,k = W z n, z n R K We assume the data is centered. n x n,m = 0. Compression We go from storing N M to M K + N K. How do we define quality of reconstruction? PCA and admixture models PCA 18 / 57

PCA Find z n R K and W R M K to minimize the reconstruction error J(W, Z) = 1 2 x n W z n 2 N n Z = [z 1,..., z N ] T Require columns of W to be orthonormal. The optimal solution is obtained by setting Ŵ = U K where U K contains the K eigenvectors associated with the K largest eigenvalues of the covaiance matrix C of X. The low-dimensional projection zˆ n = Ŵ T x n. PCA and admixture models PCA 19 / 57

PCA Find z n R K and W R M K to minimize the reconstruction error J(W, Z) = 1 2 x n W z n 2 N n Z = [z 1,..., z N ] T Require columns of W to be orthonormal. The optimal solution is obtained by setting Ŵ = U K where U K contains the K eigenvectors associated with the K largest eigenvalues of the covaiance matrix C of X. The low-dimensional projection zˆ n = Ŵ T x n. PCA and admixture models PCA 19 / 57

PCA Find z n R K and W R M K to minimize the reconstruction error J(W, Z) = 1 2 x n W z n 2 N n Z = [z 1,..., z N ] T Require columns of W to be orthonormal. The optimal solution is obtained by setting Ŵ = U K where U K contains the K eigenvectors associated with the K largest eigenvalues of the covaiance matrix C of X. The low-dimensional projection zˆ n = Ŵ T x n. PCA and admixture models PCA 19 / 57

PCA: K = 1 J(w 1, z 1 ) = 1 2 x n w 1 z n,1 2 N n = 1 (x n w 1 z n,1 ) T (x n w 1 z n,1 ) N n = 1 ( x T N n x 2w T 1 x n z n,1 + z 2 n,1 w T ) 1 w 1 n = const + 1 ( 2w T 2 N 1 x n z n,1 + z ) n,1 n To maximize this function, take derivatives with respect to z n,1 J(w 1, z 1 ) z n,1 = 0 z n,1 = w T 1 x n PCA and admixture models PCA 20 / 57

PCA: K = 1 Plugging back z n,1 = w T 1 x n J(w 1 ) = const + 1 N = const + 1 N = const 1 N ( 2w T 2 1 x n z n,1 + z ) n,1 n ( 2 2zn,1 z n,1 + z ) n,1 n n z n,1 2 Now, because the data is centered E [ z 1 ] = 1 z n,1 N n = 1 w T 1 x n N n = w T 1 1 x n = 0 N PCA and admixture models PCA n 20 / 57

PCA: K = 1 J(w 1 ) = const 1 N n z n,1 2 Var [ z 1 ] = E [ z 2 ] 1 E [ z 1 ] 2 = 1 z 2 n,1 0 N = 1 N n n z n,1 2 PCA and admixture models PCA 20 / 57

PCA: K = 1 Putting together J(w 1 ) = const 1 N Var [ z 1 ] = 1 N n z n,1 2 n z n,1 2 We have J(w 1 ) = const Var [ z 1 ] Two views of PCA: Find a direction that minimizes the reconstruction error Find a direction that maximizes variance of projected data arg min w1 J(w 1 ) = arg max w1 Var [ z 1 ] PCA and admixture models PCA 20 / 57

PCA: K = 1 arg min w1 J(w 1 ) = arg max w1 Var [ z 1 ] Var [ z 1 ] = 1 2 z n,1 N n = 1 w T 1 x n w T 1 x n N n = 1 w T 1 x n x T n w 1 N n = w T n (x nx T n ) 1 w 1 N = w T 1 Cw 1 PCA and admixture models PCA 21 / 57

PCA: K = 1 So we need to solve arg min w1 J(w 1 ) = arg max w1 Var [ z 1 ] arg max w1 w T 1 Cw 1 Since we required W to be orthonormal, we need to constrain: w 1 2 = 1. This objective function is maximized when w 1 is the first eigenvector of C PCA and admixture models PCA 21 / 57

PCA: K > 1 We can repeat the argument for K > 1. Since we require directions w k to be orthonormal, we can repeat the argument by searching for direction that maximzes the remaining variance and is orthogonal to previously selected directions. PCA and admixture models PCA 22 / 57

Computing eigendecompositions Numerical algorithms to compute all eigenvalue, eigenvectors. O(M 3 ). Infeasible for genetic datasets. Computing largest eigenvalue, eigenvector: Power iteration. O(M 2 ). Since we are interested in covariance matrices, can use algorithms to compute the singular-value decomposition (SVD): O(MN 2 ). (Will discuss later). PCA and admixture models PCA 23 / 57

Practical issues Choosing K For visualization, K = 2 or K = 3. For other analyses, pick K so that most of the variance in the data is retained. Fraction of variance retained in the top K eigenvectors K k=1 λ k M m=1 λ m PCA and admixture models PCA 24 / 57

PCA: Example PCA and admixture models PCA 25 / 57

PCA: Example PCA and admixture models PCA 25 / 57

PCA: Example PCA and admixture models PCA 25 / 57

PCA: Example PCA and admixture models PCA 25 / 57

PCA: Example PCA and admixture models PCA 25 / 57

PCA on HapMap PCA and admixture models PCA 26 / 57

PCA on Human Genome Diversity Project PCA and admixture models PCA 27 / 57

PCA on Human Genome Diversity Project PCA and admixture models PCA 27 / 57

PCA on European genetic data 1 Novembre et al. Nature 2008 PCA and admixture models PCA 28 / 57

Probabilistic interpretation of PCA z n iid N (0, I K ) p(x n z n ) = N (W z n, σ 2 I M ) PCA and admixture models PCA 29 / 57

Probabilistic interpretation of PCA z n iid N (0, I K ) p(x n z n ) = N (W z n, σ 2 I M ) E [x n z n ] = W z n E [x n ] = E [E [x n z n ]] = E [W z n ] = W E [z n ] = 0 PCA and admixture models PCA 29 / 57

Probabilistic interpretation of PCA z n iid N (0, I K ) p(x n z n ) = N (W z n, σ 2 I M ) Cov [x n ] = E [ x n x T ] n E [xn ] E [x n ] T [ = E (W z n + ɛ n )(W z n + ɛ n ) T] 0 = E [ W z n z T n W T + 2W z n ɛ T n + ɛ n ɛ T ] n = E [ W z n z T n W T] + E [ 2W z n ɛ T ] [ n + E ɛn ɛ T ] n = W E [z n z n ] W T + 2W E [ z n ɛ T n ] + σ 2 I M = W E [z n z n ] W T + 2W E [z n ] E [ɛ n ] T + σ 2 I M = W I K W T + 2W 0 + σ 2 I M = W W T + σ 2 I M PCA and admixture models PCA 29 / 57

Probabilistic PCA Log likelihood LL(W, σ 2 ) log P (D W, σ 2 ) Maximize W subject to constraint that columns of W are orthonormal. The maximum likelihood estimator Ŵ ML = U K (ΛK σ 2 I K ) U K = [U 1... U K ] λ 1... 0 Λ K =.. 0... λ K σ 2 1 M ML = λ j M K j=k+1 PCA and admixture models PCA 30 / 57

Probabilistic PCA Log likelihood LL(W, σ 2 ) log P (D W, σ 2 ) Maximize W subject to constraint that columns of W are orthonormal. The maximum likelihood estimator Ŵ ML = U K (ΛK σ 2 I K ) U K = [U 1... U K ] λ 1... 0 Λ K =.. 0... λ K σ 2 1 M ML = λ j M K j=k+1 PCA and admixture models PCA 30 / 57

Probabilistic PCA Computing the MLE Compute eigenvalues, eigenvectors Hidden/latent variable problem: Use EM PCA and admixture models PCA 31 / 57

Probabilistic PCA Computing the MLE Compute eigenvalues, eigenvectors Hidden/latent variable problem: Use EM PCA and admixture models PCA 31 / 57

Other advantages of Probabilistic PCA Can use model selection to infer K. Choose K to maximize the marginal likelihood P (D K). Use cross-validation and pick K that maximizes likelihood on held out data. Other model selection criteria such as AIC or BIC (see lecture 6 on clustering). PCA and admixture models PCA 32 / 57

Mini-Summary Dimensionality reduction: Linear methods Exploratory analysis and visualization. Downstream inference: Can use the low-dimensional features for other tasks. Principal Components Analysis finds a linear subspace that minimized reconstruction error or equivalently maximizes the variance. Eigenvalue problem. Probabilistic interpretation also leads to EM. Why may PCA not be appropriate for genetic data? PCA and admixture models PCA 33 / 57