Asymptotic distribution of the largest eigenvalue with application to genetic data

Similar documents
Methods for Cryptic Structure. Methods for Cryptic Structure

Research Statement on Statistics Jun Zhang

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

Overview of clustering analysis. Yuehua Cui

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

(Genome-wide) association analysis

Package AssocTests. October 14, 2017

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q)

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Latent Variable Methods for the Analysis of Genomic Data

Introduction to Statistical Genetics (BST227) Lecture 6: Population Substructure in Association Studies

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

SUPPLEMENTARY INFORMATION

The Quantitative TDT

PCA vignette Principal components analysis with snpstats

Confounder Adjustment in Multiple Hypothesis Testing

Concentration Inequalities for Random Matrices

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

FaST linear mixed models for genome-wide association studies

How to analyze many contingency tables simultaneously?

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

Combining dependent tests for linkage or association across multiple phenotypic traits

Interpreting principal components analyses of spatial population genetic variation

Case-Control Association Testing. Case-Control Association Testing

Efficient Bayesian mixed model analysis increases association power in large cohorts

1. Understand the methods for analyzing population structure in genomes

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

Lecture 9. QTL Mapping 2: Outbred Populations

NIH Public Access Author Manuscript Genet Epidemiol. Author manuscript; available in PMC 2014 June 23.

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics

OVERVIEW. L5. Quantitative population genetics

Latent Variable models for GWAs

Multivariate analysis of genetic data: an introduction

Linkage and Linkage Disequilibrium

Post-selection Inference for Forward Stepwise and Least Angle Regression

Calculation of IBD probabilities

Genetics Studies of Multivariate Traits

Marginal Screening and Post-Selection Inference

Principal component analysis and the asymptotic distribution of high-dimensional sample eigenvectors

1 Preliminary Variance component test in GLM Mediation Analysis... 3

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Equivalence of random-effects and conditional likelihoods for matched case-control studies

Principal Component Analysis for a Spiked Covariance Model with Largest Eigenvalues of the Same Asymptotic Order of Magnitude

Exponential tail inequalities for eigenvalues of random matrices

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

Power and sample size calculations for designing rare variant sequencing association studies.

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

PCA and admixture models

Régression en grande dimension et épistasie par blocs pour les études d association

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Genotype Imputation. Biostatistics 666

Computational Systems Biology: Biology X

Bayesian Regression (1/31/13)

BTRY 4830/6830: Quantitative Genomics and Genetics

Calculation of IBD probabilities

Robust Statistics, Revisited

Genetic Studies of Multivariate Traits

Test for interactions between a genetic marker set and environment in generalized linear models Supplementary Materials

Causal inference in biomedical sciences: causal models involving genotypes. Mendelian randomization genes as Instrumental Variables

Painlevé Representations for Distribution Functions for Next-Largest, Next-Next-Largest, etc., Eigenvalues of GOE, GUE and GSE

An Adaptive Association Test for Microbiome Data

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Selection-adjusted estimation of effect sizes

Extreme inference in stationary time series

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease

Invertibility of random matrices

SNP Association Studies with Case-Parent Trios

Statistical Power of Model Selection Strategies for Genome-Wide Association Studies

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15.

Lecture 9. Short-Term Selection Response: Breeder s equation. Bruce Walsh lecture notes Synbreed course version 3 July 2013

Populations in statistical genetics

EE16B Designing Information Devices and Systems II

Multivariate analysis of genetic data an introduction

ELEG 5633 Detection and Estimation Signal Detection: Deterministic Signals

Data reduction for multivariate analysis

G E INTERACTION USING JMP: AN OVERVIEW


Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics

Chapter 4: Factor Analysis

Composite Hypotheses and Generalized Likelihood Ratio Tests

Genetics Studies of Comorbidity

A SPECTRAL GRAPH APPROACH TO DISCOVERING GENETIC ANCESTRY 1

Variance Component Models for Quantitative Traits. Biostatistics 666

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs.

arxiv: v2 [stat.me] 22 Jun 2015

Genetics: Early Online, published on May 5, 2017 as /genetics

Relationship between Genomic Distance-Based Regression and Kernel Machine Regression for Multi-marker Association Testing

Package pfa. July 4, 2016

GWAS IV: Bayesian linear (variance component) models

Discrete Multivariate Statistics

Powerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions

Similarity Measures and Clustering In Genetics

25 : Graphical induced structured input/output models

Comparing Non-informative Priors for Estimation and. Prediction in Spatial Models

Association studies and regression

Transcription:

Asymptotic distribution of the largest eigenvalue with application to genetic data Chong Wu University of Minnesota September 30, 2016 T32 Journal Club Chong Wu 1 / 25

Table of Contents 1 Background Gene-gene interaction Population Stratification 2 Theory (Johnstone, 2009) 3 Application Gene-Gene Interaction Population Structure 4 Discussion T32 Journal Club Chong Wu 2 / 25

Gene-gene interaction Gene-gene interaction Play an important role in many complex diseases Improve our understanding of the genetic regulation Only find few replicable human gene-gene interactions Reason: poor power, confounding, measurement error, etc. Improve power by global testing T32 Journal Club Chong Wu 3 / 25

Gene-gene interaction Gene-gene interaction Only study binary phenotype, such as disease status Some notations: Y n 1 : phenotype G n p : genetic marker, such as SNPs C n q : covariates P 1, P 0 : population partial correlation matrix of case (Y = 1) or control (Y = 0) conditional on C Global test: H 0 : P 1 = P 0, H A : P 1 P 0 T32 Journal Club Chong Wu 4 / 25

Gene-gene interaction Gene-gene interaction Idea: If P 1 = P 0, the largest eigenvalue of P 1 and P 0 is equal If we know the distribution of largest eigenvalue, we can calculate the p-value We will revisit this problem later T32 Journal Club Chong Wu 5 / 25

Population Stratification Population Structure Human originally spread many thousand years ago Migration and genetic drift led to genetic diversity Inference of population structure is an important step in genetic study T32 Journal Club Chong Wu 6 / 25

Population Stratification Inferring Population Structure with PCA PCA is the most widely used method Apply PCA to the genotype data and get top Principal Components (PCs) PCs explain difference among samples Top PCs often reflects genetic variation due to ancestry in the sample T32 Journal Club Chong Wu 7 / 25

Population Stratification Inferring Population Structure with PCA Before applying PCA, one often wishes to determine if the samples are from a population that has structure If not, PCs probably capture noise and decrease power We are statisticians and we use formal hypothesis testing T32 Journal Club Chong Wu 8 / 25

Population Stratification Inferring Population Structure with PCA Idea If top PCs correspond to large eigenvalues, we are expect nonrandom population structure Problem: how large is large We can solve the above two problems by studying the distribution of the largest eigenvalue of a random matrix T32 Journal Club Chong Wu 9 / 25

Population Stratification Table of Contents 1 Background Gene-gene interaction Population Stratification 2 Theory (Johnstone, 2009) 3 Application Gene-Gene Interaction Population Structure 4 Discussion T32 Journal Club Chong Wu 10 / 25

Theory x 1,..., x n N p (µ, Σ) X n p = (x 1,..., x n ) A = X X follows a Wishart distribution, A W p (Σ, n) p = 1, A σ 2 χ 2 (n) Definition θ Let A W p (I, m) be independent of B W p (I, n), where m p. Then the largest eigenvalue θ of (A + B) 1 B is called the greatest root statistic and its distribution is denoted θ(p, m, n). T32 Journal Club Chong Wu 11 / 25

Theory ( ) W (p, m, n) = log θ(p,m,n) 1 θ(p,m,n) (W µ(p, m, n))/σ(p, m, n) F 1 (Tracy Widom) F 1 is asymmetric, exponential decay tail Figure: Density of the Tracy Widom distribution F 1 T32 Journal Club Chong Wu 12 / 25

Theory Theorem 1 Suppose that independent samples from two normal distribution N p (µ 1, Σ 1 ) and N p (µ 2, Σ 2 ) lead to covariance estimates S i : n i S i W p (n i, Σ i ) for i = 1, 2. Then the largest root test of the null hypothesis H 0 : Σ 1 = Σ 2 is based on the largest eigenvalue θ of (n 1 S 1 + n 2 S 2 ) 1 n 2 S 2. Note: We assume normal distribution p n 1 T32 Journal Club Chong Wu 13 / 25

Theory Theorem 2 (Johnstone, 2001) A = X X is a Wishart matrix. Let {λ i } 1 i p be the eigenvalues of A. The distribution of the largest eigenvalue λ 1 is approximately to a Tracy Widom distribution. Note: We assume independent normal distribution for each X i, 1 i n n/p γ 1 or n < p are both large T32 Journal Club Chong Wu 14 / 25

Table of Contents 1 Background Gene-gene interaction Population Stratification 2 Theory (Johnstone, 2009) 3 Application Gene-Gene Interaction Population Structure 4 Discussion T32 Journal Club Chong Wu 15 / 25

Gene-Gene Interaction Gene-gene interaction Only study binary phenotype, such as disease status Some notations: Y n 1 : phenotype G n p : genetic marker, such as SNPs C n q : covariates P 1, P 0 : population partial correlation matrix of case (Y = 1) or control (Y = 0) conditional on C Global test: H 0 : P 1 = P 0, H A : P 1 P 0 T32 Journal Club Chong Wu 16 / 25

Gene-Gene Interaction Methods (GET) S 0, S 1 : sample partial correlation matrix whose elements contain the correlation between each pair of genetic markers conditional on the values of the covariates C n = n i=1 I(Y i = 1), d = n i=1 I(Y i = 0) By theorem 1, the largest eigenvalue of (ds 1 + (n d)s 0 ) 1 (n d)s 0 follows a Tracy Widom distribution asymptotically Calculate the p-value based on the asymptotic Tracy Widom distribution T32 Journal Club Chong Wu 17 / 25

Gene-Gene Interaction Real Data Analysis Analyze GWAS data from the GLAUGEN study Aim: characterize genetic markers and gene-environment interactions associated with primary open-angle glaucoma After QC, 976 cases, 1, 136 controls and 200, 432 SNPs For each phenotype Perform filtering and only select 100 SNPs n/p 20 Testing SNP-SNP interactions T32 Journal Club Chong Wu 18 / 25

Gene-Gene Interaction Real Data Analysis T32 Journal Club Chong Wu 19 / 25

Population Structure Methods (Patterson, 2006) By theorem 2, testing the largest eigenvalue of A is significant or not Linkage disequilibrium among SNPs will reduce the effective sample size, but we can adjust it T32 Journal Club Chong Wu 20 / 25

Population Structure Results Figure: PP plot corresponding to a sample size of n = 200 and p = 50, 000 markers. T32 Journal Club Chong Wu 21 / 25

Population Structure Table of Contents 1 Background Gene-gene interaction Population Stratification 2 Theory (Johnstone, 2009) 3 Application Gene-Gene Interaction Population Structure 4 Discussion T32 Journal Club Chong Wu 22 / 25

Normal distribution Genetic data (G) do not have the normal distribution For Theorem 2 can be applied if the the high-order moments of each cell no greater than the normal distribution (Soshnikov, 2002) For Theorem 1 We don t have any theory guarantee if G does not follow normal distribution T32 Journal Club Chong Wu 23 / 25

Tracy Widom approximation Conservative in nearly all cases Can be used for initial screening Take home message The largest eigenvalue of a matrix follows a Tracy Widom distribution asymptotically, which can be applied in genetic data analysis T32 Journal Club Chong Wu 24 / 25

Reference Johnstone, Iain M. Approximate null distribution of the largest root in multivariate analysis. The annals of applied statistics 3.4 (2009): 1616. Johnstone, Iain M. On the distribution of the largest eigenvalue in principal components analysis. Annals of statistics (2001): 295-327. Frost, H. Robert, Christopher I. Amos, and Jason H. Moore. A global test for genegene interactions based on random matrix theory. Genetic Epidemiology (2016). Patterson, Nick, Alkes L. Price, and David Reich. Population structure and eigenanalysis. PLoS Genetics 2.12 (2006): e190. T32 Journal Club Chong Wu 25 / 25