Methods for Cryptic Structure. Methods for Cryptic Structure

Similar documents
Case-Control Association Testing. Case-Control Association Testing

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

Asymptotic distribution of the largest eigenvalue with application to genetic data

Introduction to Statistical Genetics (BST227) Lecture 6: Population Substructure in Association Studies

Genetic Association Studies in the Presence of Population Structure and Admixture

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Overview of clustering analysis. Yuehua Cui

p(d g A,g B )p(g B ), g B

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

Package AssocTests. October 14, 2017

(Genome-wide) association analysis

1. Understand the methods for analyzing population structure in genomes

An Efficient and Accurate Graph-Based Approach to Detect Population Substructure

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Computational Systems Biology: Biology X

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Interpreting principal components analyses of spatial population genetic variation

Implicit Causal Models for Genome-wide Association Studies

A Unified Approach for Quantifying, Testing and Correcting Population Stratification in Case-Control Association Studies

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

PCA, admixture proportions and SFS for low depth NGS data. Anders Albrechtsen

A Robust Test for Two-Stage Design in Genome-Wide Association Studies

EFFICIENT COMPUTATION WITH A LINEAR MIXED MODEL ON LARGE-SCALE DATA SETS WITH APPLICATIONS TO GENETIC STUDIES

Covariance between relatives

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q)

Linkage and Linkage Disequilibrium

Calculation of IBD probabilities

Case-Control Association Testing with Related Individuals: A More Powerful Quasi-Likelihood Score Test

Breeding Values and Inbreeding. Breeding Values and Inbreeding

Calculation of IBD probabilities

Multidimensional heritability analysis of neuroanatomical shape. Jingwei Li

Research Statement on Statistics Jun Zhang

Association studies and regression

Package LBLGXE. R topics documented: July 20, Type Package

Lecture 9. QTL Mapping 2: Outbred Populations

An analytical comparison of the principal component method and the mixed effects model for genetic association studies

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

arxiv: v1 [stat.ap] 22 Sep 2009

PCA and admixture models

G E INTERACTION USING JMP: AN OVERVIEW

Introduction to Quantitative Genetics. Introduction to Quantitative Genetics

SUPPLEMENTARY INFORMATION

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Conditions under which genome-wide association studies will be positively misleading

FaST Linear Mixed Models for Genome-Wide Association Studies

Test for interactions between a genetic marker set and environment in generalized linear models Supplementary Materials

Pedigree and genomic evaluation of pigs using a terminal cross model

Genotype Imputation. Biostatistics 666

I Have the Power in QTL linkage: single and multilocus analysis

Mathematical models in population genetics II

A Short Course in Basic Statistics

Bayesian Regression (1/31/13)

The Quantitative TDT

Latent Variable Methods for the Analysis of Genomic Data

25 : Graphical induced structured input/output models

A SPECTRAL GRAPH APPROACH TO DISCOVERING GENETIC ANCESTRY 1

Variance Component Models for Quantitative Traits. Biostatistics 666

ARTICLE MASTOR: Mixed-Model Association Mapping of Quantitative Traits in Samples with Related Individuals

Chapter 4: Factor Analysis

Cover Page. The handle holds various files of this Leiden University dissertation

Lecture 5: BLUP (Best Linear Unbiased Predictors) of genetic values. Bruce Walsh lecture notes Tucson Winter Institute 9-11 Jan 2013

Detecting selection from differentiation between populations: the FLK and hapflk approach.

Microsatellite data analysis. Tomáš Fér & Filip Kolář

9.1 Orthogonal factor model.

Populations in statistical genetics

Régression en grande dimension et épistasie par blocs pour les études d association

Supplementary Information

FaST linear mixed models for genome-wide association studies

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 6: Bivariate Correspondence Analysis - part II

Separation of the largest eigenvalues in eigenanalysis of genotype data from discrete subpopulations

Resemblance between relatives

SNP Association Studies with Case-Parent Trios

GENETIC association studies aim to map phenotypeinfluencing

Notes on Population Genetics

Power and sample size calculations for designing rare variant sequencing association studies.

Distinctive aspects of non-parametric fitting

Introduction to QTL mapping in model organisms

Principal component analysis (PCA) for clustering gene expression data

NIH Public Access Author Manuscript Genet Epidemiol. Author manuscript; available in PMC 2014 June 23.

News Shocks: Different Effects in Boom and Recession?

Multivariate analysis of genetic data: an introduction

GBLUP and G matrices 1

Principal Components Analysis (PCA)

Similarity Measures and Clustering In Genetics

Analytic power calculation for QTL linkage analysis of small pedigrees

BS 50 Genetics and Genomics Week of Oct 3 Additional Practice Problems for Section. A/a ; B/B ; d/d X A/a ; b/b ; D/d

PCA vignette Principal components analysis with snpstats

Genetics Studies of Multivariate Traits

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Relationship between Genomic Distance-Based Regression and Kernel Machine Regression for Multi-marker Association Testing

Improved linear mixed models for genome-wide association studies

1 Preliminary Variance component test in GLM Mediation Analysis... 3

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease

TOPICS IN STATISTICAL METHODS FOR HUMAN GENE MAPPING

Inférence en génétique des populations IV.

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

Transcription:

Case-Control Association Testing Review Consider testing for association between a disease and a genetic marker Idea is to look for an association by comparing allele/genotype frequencies between the cases (affected individuals) and the controls (unaffected individuals). AA AA Aa Aa Aa aa AA Aa AA Aa aa Aa Cases Controls

Population Structure and Association Testing The observations in genome-wide case-control association studies can have several sources of dependence. Population structure, the presence of subgroups in the population with ancestry differences, is a major concern for association studies Population structure is often cryptic. Neglecting such structure in the data can lead to seriously spurious associations.

Balding-Nichols Model A model that is often used for population structure is the Balding-Nichols model (Balding and Nichols, 1995). Consider unrelated outbred individuals that are sampled from a population with K subpopulations, i.e., the subpopulations are 1, 2,..., K. Assume that an individual can be a member of only one subpopulation, i.e., there is no admixture. Under the Balding-Nichols model, the allele frequency for subpopulation k, where 1 k K, is a random draw from a beta distribution with parameters p(1 F stk )/F stk and (1 p)(1 F stk )/F stk, where 0 < p < 1 The parameter p can be viewed as the ancestral allele frequency and F stk can be viewed as Wright s standardized measure of variation for subpopulation k.

Balding-Nichols Model: Covariance Structure Consider a single bi-allelic marker (e.g. a SNP) with allele labels 0 and 1 Let N be the number of sampled individuals with genotype data at the marker. Let Y = (Y 1,... Y N ) where Y i =the number of alleles of type 1 in individual i, so the value of Y i is 0, 1, or 2. Under the Balding-Nichols model: Individual i has inbreeding coefficient equal to F st If individuals i and j are are both from the same subpopulation k, then Corr(Y i, Y j ) = F stk If i and j are from different subpopulations then Corr(Y i, Y j ) = 0 The F stk values, the number of subpopulations K, and the subpopulation memberships for the sample individuals will generally be unknown when there is cryptic population structure.

Balding-Nichols Model: Covariance Structure If there is no structure then the covariance matrix of Y will be a function of the identity matrix: 1 0... 0 0 1... 0 I 0 =........, 0 0... 1 If there is structure then the covariance matrix of Y will be a function of : Σ 0 = 1 + F st1 F st1... 0 F st1 1 + F st1... 0........ 0 0... 1 + F stk,

Case-Control Design Methods have been proposed to correct for cryptic population structure in case-control studies by using data across the genome. Let N be the number of individuals in the study. Let X = (X 1,... X N ) be a phenotype indicator vector for case control status where X i = 1 if i is a case and X i = 0 if i is a control Let M be the number of bi-allelic markers (e.g. SNPs) in the data. Consider a marker s, where 1 s M, and let Y s = (Y 1s,... Y Ns ) where Y is =the number of alleles of type 1 in individual i at marker s.

Genomic Control Devlin and Roeder (1999) proposed correcting for substructure via a method called genomic control. For each marker s, the Armitage trend statistic is calculated A rs = Nr 2 XY s where rxy 2 s is the squared correlation between the genotype variable Y s for marker s and the binary phenotype variable X. If there is no population structure, the distribution of A rs will approximately follow a χ 2 distribution with 1 degree of freedom. If there is population structure, the statistic will deviate from a χ 2 1 distribution due to an inflated variance.

Genomic Control Use λ = median(ar 1,...,Ars,...Ar M ).456 as a correction factor for cryptic structure, where.456 is the median of a χ 2 1 distribution. The uniform inflation factor λ is then applied to the Armitage trend statistic values à rs = A r s λ à rs will approximately follow a χ 2 distribution with 1 degree of freedom.

Genomic Control Another way to view genomic control is as follows: A rs = T 2 Var 0 (T ) where T is a measure of allele frequency differences between cases and controls and Var 0 (T ) is the variance of T under the null hypothesis For the Armitage statistic, Var 0 (T ) is calculated assuming individuals are unrelated (calculation based on the identity matrix). Genomic control inflates this variance to account for the structure (unknown F st values) Ã rs = T 2 λvar 0 (T )

Principal Components Analysis Price et al. (2006) proposed corrected for structure in association studies by using principal components analysis (PCA) They developed a method called EIGENSTRAT for association testing in structured populations. Calculate an empirical covariance matrix ˆΨ with components ˆψ ij : ˆψ ij = 1 M (Y is 2ˆp s )(Y js 2ˆp s ) M ˆp s (1 ˆp s ) s=1 where ˆp s is an allele frequency estimate for the type 1 allele at marker s

Principal Components Analysis Principal components (eigenvectors) for ˆΨ are obtained. The top principal components are viewed as continuous axes of variation that reflect subpopulation genetic variation in the sample. Individuals with similar values for a particular top principal component will have similar ancestry for that axes. The top principal components (highest eigenvalues) are used as covariates in a multi-linear regression. Y s = β 0 + β 1 X + β 2 PC 1 + β 3 PC 2 + β 4 PC 3 + + ɛ H 0 : β 1 = 0 vs. H a : β 1 0

Computation of Eigenstrat The kth axis of variation is defined to be the kth eigenvector of ˆΨ (that is, the eigenvector with kth largest eigenvalue). Thus, the ancestry a jk of individual j along the kth axis of variation equals coordinate j of the kth eigenvector. Adjustments of genotypes for ancestry Let Y js be the genotype of individual j (Y js = 0, 1 or 2) at marker s, and let a j be the ancestry of individual j along a given axis of variation. = Y js γ s a j, where γ s = i a iy is / i a2 i γ s is a regression coefficient for ancestry predicting genotype across individuals j with valid genotypes at marker s. Y adjusted js A similar adjustment is performed for each axis of variation.

Computation of Eigenstrat Adjustments of phenotypes for ancestry X j = 1 if j is a case and X j = 0 if j is a control = X j δa i, where δ = i a ix i / i a2 i The EIGENSTRAT statistic adjusted for ancestry at marker s is X adjusted j [ ( EIG s = (N K 1) corr Ys adjusted, X adjusted)] 2 where N is the number of subjects in the sample and K is the number of axes of variation used to adjust for ancestry. EIG s will approximately follow a χ 2 distribution with 1 degree of freedom.

Structured Samples with Related Individuals Previous population structure methods have been developed for samples with unrelated individuals Methods may not be valid in samples with related individuals (known and/or unknown) Thornton and McPeek (2010) proposed ROADTRIPS for association testing in samples with related individuals and population structure ROADTRIPS: RObust Association Detection Test for Related Individuals with Population Structure

Structured Samples with Related Individuals Many genetic studies have samples with related individuals Consider testing for association between a trait and a single genetic marker in a case-control design in which some individuals are related Assume that there are some related individuals in the sample from a structured population. If individuals i and j are from population k and they are related, the correlation, however, is no longer F stk! The correlation for the pair is now ψ ij, where ψ ij is a function of both the kinship coefficient for individuals i and j and F stk.

ROADTRIPS Similar to EIGENSTRAT, the ROADTRIPS approach calculates an empirical covariance matrix Ψ. The top principal components in EIGENSTRAT are not able to capture the complicated covariance structure due to the related individuals in the samples. Instead of taking the top principal components of the matrix Ψ, ROADTRIPS uses the entire matrix to correct the variance of a general class of statistics for unknown structure ROADTRIPS extensions have been developed for a number of association tests including Pearson χ 2 test, the Armitage trend test, the corrected χ 2 test, the W QLS test, and the M QLS test. ROADTRIPS is a valid association method for partially or completely unknown population and pedigree structure.

Structured Association Method Pritchard et al. (2000) developed the program STRUCTURE This method uses MCMC to cluster genetically similar individuals into subpopulations. Association testing is performed within each of the subpopulations Assignments of individuals are highly sensitive to the number of subpopulations K in the sample, which is often unknown when there is cryptic structure Not computationally feasible for genome-wide association studies

References Balding DJ, Nichols RA (1995). A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identify and paternity. Genetica 96, 3-12. Devlin B, Roeder K (1999). Genomic control for association studies. Biometrics 55, 997-1004. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904-909.

References Pritchard J, Stephens M, Donnelly P. (2000). Inference of population structure using multilocus genotype data. Genetics 155, 945-959. Thornton T, McPeek MS (2010). ROADTRIPS: Case-Control Association Testing with Partially or Completely Unknown Population and Pedigree Structure. Am. J. Hum. Genet. 86, 172-184.