PCA vignette Principal components analysis with snpstats

Similar documents
Principal Components Analysis (PCA)

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Principal Component Analysis (PCA) Principal Component Analysis (PCA)

FaST linear mixed models for genome-wide association studies

Introduction to Machine Learning

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

(Genome-wide) association analysis

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

The Matrix Algebra of Sample Statistics

EE16B Designing Information Devices and Systems II

Asymptotic distribution of the largest eigenvalue with application to genetic data

EE16B Designing Information Devices and Systems II

Principal Component Analysis (PCA) Theory, Practice, and Examples

1 Principal Components Analysis

PCA and admixture models

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

CS168: The Modern Algorithmic Toolbox Lecture #8: PCA and the Power Iteration Method

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

BTRY 7210: Topics in Quantitative Genomics and Genetics

PRINCIPAL COMPONENT ANALYSIS

Canonical Correlation & Principle Components Analysis

-Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the

Principal component analysis

Chapter 4: Factor Analysis

Methods for Cryptic Structure. Methods for Cryptic Structure

JUST THE MATHS UNIT NUMBER 9.9. MATRICES 9 (Modal & spectral matrices) A.J.Hobson

7 Principal Components and Factor Analysis

ECON 331 Homework #2 - Solution. In a closed model the vector of external demand is zero, so the matrix equation writes:

LECTURE 4 PRINCIPAL COMPONENTS ANALYSIS / EXPLORATORY FACTOR ANALYSIS

Introduction to Matrix Algebra and the Multivariate Normal Distribution

Singular Value Decomposition

Exercises * on Principal Component Analysis

PRINCIPAL COMPONENTS ANALYSIS

w. T. Federer, z. D. Feng and c. E. McCulloch

Principal Components Theory Notes

Eigenvalues, Eigenvectors, and an Intro to PCA

Principal Component Analysis-I Geog 210C Introduction to Spatial Data Analysis. Chris Funk. Lecture 17

Eigenvalues, Eigenvectors, and an Intro to PCA

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

Dimension Reduc-on. Example: height of iden-cal twins. PCA, SVD, MDS, and clustering [ RI ] Twin 2 (inches away from avg)

GEOG 4110/5100 Advanced Remote Sensing Lecture 15

Dimensionality Reduction Techniques (DRT)

Solution: a) Let us consider the matrix form of the system, focusing on the augmented matrix: 0 k k + 1. k 2 1 = 0. k = 1

Introduction to multivariate analysis Outline

Flexible phenotype simulation with PhenotypeSimulator Hannah Meyer

Linear Algebra review Powers of a diagonalizable matrix Spectral decomposition

Principal Component Analysis

Mathematical foundations - linear algebra

Quantitative Understanding in Biology Principal Components Analysis

1. Understand the methods for analyzing population structure in genomes

Linear Dimensionality Reduction

Principal Components Analysis: A How-To Manual for R

Discrete Simulation of Power Law Noise

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Data Preprocessing Tasks

STAT 501 EXAM I NAME Spring 1999

Taking into account sampling design in DAD. Population SAMPLING DESIGN AND DAD

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

CS281 Section 4: Factor Analysis and PCA

The Quantitative TDT

Introduction to Statistical Genetics (BST227) Lecture 6: Population Substructure in Association Studies

Repeated Eigenvalues and Symmetric Matrices

Covariance to PCA. CS 510 Lecture #8 February 17, 2014

22.3. Repeated Eigenvalues and Symmetric Matrices. Introduction. Prerequisites. Learning Outcomes

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Factor Analysis Continued. Psy 524 Ainsworth

Multivariate Statistics Fundamentals Part 1: Rotation-based Techniques

Principal Component Analysis

Linear Algebra review Powers of a diagonalizable matrix Spectral decomposition

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Covariance and Principal Components

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, data types 3 Data sources and preparation Project 1 out 4

PCA, Kernel PCA, ICA

Multivariate Statistics 101. Ordination (PCA, NMDS, CA) Cluster Analysis (UPGMA, Ward s) Canonical Correspondence Analysis

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Covariance to PCA. CS 510 Lecture #14 February 23, 2018

Gopalkrishna Veni. Project 4 (Active Shape Models)

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Multivariate Statistics (I) 2. Principal Component Analysis (PCA)

MIXED MODELS THE GENERAL MIXED MODEL

Principal Component Analysis (PCA) CSC411/2515 Tutorial

Vectors and Matrices Statistics with Vectors and Matrices

Spring 2019 Exam 2 3/27/19 Time Limit: / Problem Points Score. Total: 280

DIMENSION REDUCTION AND CLUSTER ANALYSIS

Principal Components Analysis

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

. The following is a 3 3 orthogonal matrix: 2/3 1/3 2/3 2/3 2/3 1/3 1/3 2/3 2/3

Visualizing the Multivariate Normal, Lecture 9

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Principal Component Analysis!! Lecture 11!

SNP Association Studies with Case-Parent Trios

Sparse orthogonal factor analysis

Math 4242 Fall 2016 (Darij Grinberg): homework set 8 due: Wed, 14 Dec b a. Here is the algorithm for diagonalizing a matrix we did in class:

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

Singular Value Decompsition

Chemometrics. Matti Hotokka Physical chemistry Åbo Akademi University

Vector Space Models. wine_spectral.r

1 Interpretation. Contents. Biplots, revisited. Biplots, revisited 2. Biplots, revisited 1

Karhunen-Loève Transform KLT. JanKees van der Poel D.Sc. Student, Mechanical Engineering

Transcription:

PCA vignette Principal components analysis with snpstats David Clayton October 30, 2018 Principal components analysis has been widely used in population genetics in order to study population structure in genetically heterogeneous populations. More recently, it has been proposed as a method for dealing with the problem of confounding by population structure in genome-wide association studies. The maths Usually, principal components analysis is carried out by calculating the eigenvalues and eigenvectors of the correlation matrix. With N cases and P variables, if we write X for the N P matrix which has been standardised so that columns have zero mean and unit standard deviation, we find the eigenvalues and eigenvectors of the P P matrix X T.X (which is N or (N 1) times the correlation matrix depending on which denominator was used when calculating standard deviations). The first eigenvector gives the loadings of each variable in the first principal component, the second eigenvector gives the loadings in the second component, and so on. Writing the first C component loadings as columns of the P C matrix B, the N C matrix of subjects principal component scores, S, is obtained by applying the factor loadings to the original data matrix, i.e. S = X.B. The sum of squares and products matrix, S T.S = D, is diagonal with elements equal to the first C eigenvalues of the X T.X matrix, so that the variances of the principal components can obtained by dividing the eigenvalues by N or (N 1). This standard method is rarely feasible for genome-wide data since P is very large indeed and calculating the eigenvectors of X T.X becomes impossibly onerous. However, the calculations can also be carried out by calculating the eigenvalues and eigenvectors of the N N matrix X.X T. The (non-zero) eigenvalues of this matrix are the same as those of X T.X, and its eigenvectors are proportional to the principal component scores defined above; writing the first C eigenvectors of X.X T as the columns of the N C matrix, U, then U = S.D 1/2. Since for many purposes we are not too concerned about the scaling of the principal components, it will often be acceptable to use the eigenvectors, U, in place of the more conventionally scaled principal components. However some attention should be paid to the corresponding eigenvalues since, as noted above, these are proportional to the 1

variances of the conventional principle components. The factor loadings may be calculated by B = X T.U.D 1/2. Using this method of calculation, it is only (!) necessary to find the eigenvalues and eigenvectors of an N N matrix. Current microarray-based genotyping studies are such that N is typically a few thousands while P may be in excess of one million. An example In this exercise, we shall calculate principal component loadings in controls alone and then apply these loading to the whole data. This is more complicated than the simpler procedure of calculating principal components in the entire dataset but avoids component loadings which unduly reflect case/control differences; using such components to correct for population structure would seriously reduce the power to detect association since one would, to some extent, be correcting for case/control differences 1. We will also thin the data by taking only every tenth SNP. We do this mainly to reduce computation time but thinning is often employed to minimize the impact of linkage disequilibrium (LD), to reduce the risk that the larger components may simply reflect unusually long stretches of LD rather than population structure. Of course, this would require a more sophisticated approach to thinning than that used in this demonstration. In a more sophisticated approach, one might use the output of snp.imputation to eliminate all but one of a groups of SNPs in strong LD for thinning. We shall use the data introduced in the main vignette. We shall first load the data and extract the controls. > require(snpstats) > data(for.exercise) > controls <- rownames(subject.support)[subject.support$cc==0] > use <- seq(1, ncol(snps.10), 10) > ctl.10 <- snps.10[controls,use] The next step is to standardize the data to zero mean and unit standard deviation and to calculate the X.X T matrix. These operations are carried out using the function xxt. > xxmat <- xxt(ctl.10, correct.for.missing=false) The argument correct.for.missing=false selects a very simple missing data treatment, i.e. replacing missing values by their mean. This is quite adequate when the proportion of missing data is low. The default method is more sophisticated but introduces complications later so we will keep it simple. When performing a genome-wide analysis, it will usually be the case that all the data cannot be stored in a single SnpMatrix object. Usually they will be organized with one 1 An alternative approach is to standardise the X matrix so that each column has zero mean in both cases and controls. This can be achieved by using the strata argument in the call to xxt. Here, however, we have used controls only since this reduces the size of the matrix for the eigenvalue and vector calculations. 2

matrix for each chromosome. In these cases, it is straightforward to write a script which carries out the above calculations for each chromosome in turn, saving the resultant matrix to disk each time. When all chromosomes have been processed, the X.X T matrices are read and added together. The next step of the calculations requires us to calculate the eigenvalues and eigenvectors of the X.X T matrix. This can be carried out using a standard R function. We will save the first five components. > evv <- eigen(xxmat, symmetric=true) > pcs <- evv$vectors[,1:5] > evals <- evv$values[1:5] > evals [1] 167181 8890 8709 8504 8379 Here, pcs refers to the scaled principal components (i.e. to the columns of the matrix U in our mathematical introduction) and all have the same variance. The eigenvalues give an idea of the relative magnitude of these sources of variation. The first principal component has a markedly larger eigenvalue and we might hope that this reflects population structure. In fact these data were drawn from two very different populations, as indicated by the stratum variable in the subject support frame. The next set of commands extract this variable for the controls and plot box plots for the first two components by stratum. > pop <- subject.support[controls,"stratum"] > par(mfrow=c(1,2)) > boxplot(pcs[,1]~pop) > boxplot(pcs[,2]~pop) 3

0.04 0.02 0.00 0.02 0.04 0.06 0.1 0.0 0.1 0.2 CEU JPT+CHB CEU JPT+CHB Clearly the first component has captured the difference between the populations. Equally clearly, the second principal component has not. The next step in the calculation is to obtain the SNP loadings in the components. This requires calculation of B = X T.S.D 1/2. Here we calculate the transpose of this matrix, B T = D 1/2 S T.X, using the special function snp.pre.multiply which pre-multiplies a SnpMatrix object by a matrix after first standardizing it to zero mean and unit standard deviation. > btr <- snp.pre.multiply(ctl.10, diag(1/sqrt(evals)) %*% t(pcs)) We can now apply these loadings back to the entire dataset (cases as well as controls) to 4

derive scores that we can use to correct for population structure. To do that we use the function snp.post.multiply which post-multiplies a SnpMatrix by a general matrix, after first standardizing the columns of the SnpMatrix to zero mean and unit standard deviation. Note that it is first necessary to select out those SNPs that we have actually used in the calculation of components. > pcs <- snp.post.multiply(snps.10[,use], t(btr)) Finally we shall evaluate how successful the first principal component is in correcting for population structure effects. (snp.rhs.tests return glm objects.) > cc <- subject.support$cc > uncorrected <- single.snp.tests(cc, snp.data=snps.10) > corrected <- snp.rhs.tests(cc~pcs[,1], snp.data=snps.10) > par(mfrow=c(1,2),cex.sub=0.85) > qq.chisq(chi.squared(uncorrected,1), df=1) N omitted lambda 28497.000 0.000 1.712 > qq.chisq(chi.squared(corrected), df=1) N omitted lambda 28497.000 0.000 1.009 5

QQ plot QQ plot 35 30 30 25 Observed 25 20 15 10 Observed 20 15 10 5 5 0 0 0 5 10 15 0 5 10 15 Expected Expected distribution: chi squared (1 df) Expected Expected distribution: chi squared (1 df) The use of the first principal component as a covariate has been quite successful in reducing the serious over-dispersion due to population structure. Indeed it is just as successful as stratification by the observed stratum variable. Acknowledgement Thanks to David Poznik for pointing out an error in the earlier versions of this vignette. 6