Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics

Similar documents
Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

(Genome-wide) association analysis

Association studies and regression

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

Maximum Likelihood for Variance Estimation in High-Dimensional Linear Models

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Linear Regression (1/1/17)

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Efficient Bayesian mixed model analysis increases association power in large cohorts

Lecture WS Evolutionary Genetics Part I 1

Variance Component Models for Quantitative Traits. Biostatistics 666

GWAS IV: Bayesian linear (variance component) models

BTRY 4830/6830: Quantitative Genomics and Genetics

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

Econ 2120: Section 2

Power and sample size calculations for designing rare variant sequencing association studies.

FaST linear mixed models for genome-wide association studies

Lecture 9 Multi-Trait Models, Binary and Count Traits

Quantitative characters - exercises

Lecture 28: BLUP and Genomic Selection. Bruce Walsh lecture notes Synbreed course version 11 July 2013

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

Mixed-Models. version 30 October 2011

Lecture 24: Multivariate Response: Changes in G. Bruce Walsh lecture notes Synbreed course version 10 July 2013

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

DNA polymorphisms such as SNP and familial effects (additive genetic, common environment) to

Quasi-regression for heritability

27: Case study with popular GM III. 1 Introduction: Gene association mapping for complex diseases 1

Lecture 9. QTL Mapping 2: Outbred Populations

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

STAT 100C: Linear models

arxiv: v1 [stat.me] 10 Jun 2018

Linear Models for Regression CS534

Statistics 203: Introduction to Regression and Analysis of Variance Course review

25 : Graphical induced structured input/output models

Regression, Ridge Regression, Lasso

Linear Methods for Prediction

Relationship between Genomic Distance-Based Regression and Kernel Machine Regression for Multi-marker Association Testing

Linear Models for Regression CS534

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

Genotype Imputation. Biostatistics 666

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Alternative implementations of Monte Carlo EM algorithms for likelihood inferences

MIXED MODELS THE GENERAL MIXED MODEL

Lecture 32: Infinite-dimensional/Functionvalued. Functions and Random Regressions. Bruce Walsh lecture notes Synbreed course version 11 July 2013

Linear Regression. Volker Tresp 2018

Partitioning the Genetic Variance

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations

Matrix Factorizations

Latent Variable models for GWAs

Genotyping strategy and reference population

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Asymptotic distribution of the largest eigenvalue with application to genetic data

Linear Methods for Prediction

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Multivariate Regression

Marginal Screening and Post-Selection Inference

HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA)

Evolutionary quantitative genetics and one-locus population genetics

Multidimensional heritability analysis of neuroanatomical shape. Jingwei Li

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Lecture 6: Selection on Multiple Traits

Some models of genomic selection

On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease

Lecture 5: BLUP (Best Linear Unbiased Predictors) of genetic values. Bruce Walsh lecture notes Tucson Winter Institute 9-11 Jan 2013

Prediction of genetic Values using Neural Networks

Models with multiple random effects: Repeated Measures and Maternal effects

Covariance function estimation in Gaussian process regression

Computational Approaches to Statistical Genetics

Machine Learning Linear Classification. Prof. Matteo Matteucci

GAUSSIAN PROCESS REGRESSION

Estimation of Parameters in Random. Effect Models with Incidence Matrix. Uncertainty

Bayesian construction of perceptrons to predict phenotypes from 584K SNP data.

Mixed-Model Estimation of genetic variances. Bruce Walsh lecture notes Uppsala EQG 2012 course version 28 Jan 2012

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Accounting for read depth in the analysis of genotyping-by-sequencing data

Multivariate Survival Analysis

Master s Written Examination

Distinctive aspects of non-parametric fitting

ECE 275A Homework 7 Solutions

F & B Approaches to a simple model

A General Framework for Variable Selection in Linear Mixed Models with Applications to Genetic Studies with Structured Populations

GWAS with mixed models

STA414/2104 Statistical Methods for Machine Learning II

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Chapter 2: simple regression model

Multiple random effects. Often there are several vectors of random effects. Covariance structure

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

A Statistical Analysis of Fukunaga Koontz Transform

Linear Models for Regression CS534

Rowan University Department of Electrical and Computer Engineering

Research Statement on Statistics Jun Zhang

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Prediction of the Confidence Interval of Quantitative Trait Loci Location

Decision Tree Learning Lecture 2

Statistics 3858 : Maximum Likelihood Estimators

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Transcription:

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics Lee H. Dicker Rutgers University and Amazon, NYC Based on joint work with Ruijun Ma (Rutgers), Murat A. Erdogdu (Stanford/MSR), and Yakir Reshef (Harvard) Borchard Colloquium July 4, 2017 Work conducted prior to joining Amazon

Heritability in genetics Heritability is a fundamental concept in genetics. It represents the extent to which an exhibited trait in an individual is attributable to genetics. (Slightly) more technical formulation: The proportion of variation in a phenotype that can be explained by the genotype. Population-level estimates of heritability can be obtained from phenotype data (e.g. human height or milk fat percentage in cows) and genotype data (e.g. pedigree information or GWAS data). Heritability estimation has a long history in statistics going back to R.A. Fisher. Currently, LMM-based methods are probably the most prevalant approach for estimating heritability with GWAS data. LMM-based methods for heritability estimation are not new (Henderson, 1950, Ann. Math. Stat.). Modern LMM-based methods for GWAS data took off after (Yang et al., 2010, Nat. Genet.), which gave estimates of heritability for human height. Thousands of papers on LMM-based heritability for GWAS data since 2010. 1 / 15

Heritability in genetics Hayes et al. (2010) PLoS Genet. I Much of the recent work on LMM-based methods for heritability estimation is built on older research in to cattle breeding. 2 / 15

This talk Some new statistical perspectives on heritability estimation. Main messages: 3 / 15

This talk Some new statistical perspectives on heritability estimation. Main messages: 1. Standardize your predictors. 3 / 15

This talk Some new statistical perspectives on heritability estimation. Main messages: 1. Standardize your predictors. 2. There s a big opportunity for statistics to make an impact in this area with thoughtful modeling and technical expertise. 3 / 15

Genetic relatedness and heritability Let y = (y 1,..., y n ) R n be a vector of centered real-valued outcomes, where y i is the phenotype value for individual i in some population. Assume that y = g + e (1) can be decomposed into an additive genetic effect g MV(0, σ 2 gk) and an uncorrelated noise vector e MV(0, σ 2 e I). K = (K ij ) is the genetic relationship matrix (GRM) and K ij measures genetic similarity between individuals i and j; the GRM is standardized so that K ii = 1. The noise vector e may contain environmental noise, measurement error, and other non-additive genetic effects. Frequently, the data is transformed before reaching the representation (1), e.g. project out covariates or other fixed effects. The heritability coefficient is σ2 g h 2 =. σ 2 g + σ 2 e Since E(yy ) = σ 2 g K + σ2 e I, the heritability coefficient is the R2 for regressing y i y j on K ij, i.e. regressing phenotype similarity on genetic similarity. 4 / 15

Genetic relatedness and heritability To estimate h 2, Henderson (1950) used least squares for regressing y i y j on K ij. Maximum likelihood is also widely used: Let l(s 2 g, s 2 e) = 1 2 log det(s2 g K + s2 e I) + 1 2 y (s 2 g K + s2 e I) 1 y be the log-likelihood for (σ 2 g, σ 2 e) under the assumption that g, e are Gaussian; minimize l to get the MLE for (σ 2 g, σ 2 e) and h 2. Henderson s estimator and the MLE are both reasonable estimators for a given GRM K how should we choose K? Pre-GWAS: Pedigree/familial information determines K and describes how individuals are related. Post-GWAS: Molecular genetic data gives fine-grained measure of genetic similarity. 5 / 15

Genetic relatedness and LMMs for GWAS data For GWAS data, each individual s genetic data can be encoded in a vector x i = (x i1,..., x im ) R m, where x ij represents the (frequently standardized) minor allele count for the j-th single nucelotide polymorphism (SNP). x ij is tri-nary. m can be in the 100K-1M s; n may be in the 1-10K s. The GRM is determined by a kernel function K with K ij = K(x i, x j ). The linear kernel is the most widely used in practice. K(x i, x j ) = 1 m x i x j With the linear kernel, the original model y = g + e can be rewritten as a linear random-effects model (LMM) y = Xb + e, where X = (x 1,..., x n ), b = (b 1,..., b m ) R m, and b 1,..., b m MV(0, σ 2 g/m) are iid. 6 / 15

LMMs for heritability estimation: Questions y = Xb + e, (2) Variance components methods for estimating h 2 under (2) have emerged as one of the most popular strategies for heritability estimation in GWAS. However, a number of challenges have emerged. Causal SNPs and linkage disequilibrium. In most generative models for linking GWAS data and outcomes, there is a fixed collection of causal SNPs C [m] and b is assumed to be a sparse vector supported on C. If the SNPs x i are highly correlated i.e. they are in linkage disequilibrium (LD) the LMM approach can give badly biased estimates for h 2 (Speed et al., 2012, Am. J. Hum. Genet.) Partitioning heritability. Sometimes it s desirable to estimate the heritability attributable to a subset of SNPs, S [m]. To date, there s no consensus about how this should be done and existing solutions have significant drawbacks. 7 / 15

Causal SNPs and linkage disequilibrium Simulations show standard LMM estimators for h 2 are biased when causal SNPs are located in regions with low (or high) LD. Settings: n = 500, m = 1000. σ 2 e = 0.5. b = 1 m (z 1,..., z m/2, 0,..., 0) R m with z 1,..., z m/2 N(0, 1) iid. So C = {1,..., m/2}. ( ) AR(0.2) 0 x i N(0, Σ) with Σ =. 0 AR(0.8) Then σ 2 g = 0.5 and h 2 = 0.5. h 2 ĥ 2 0.5 Mean: 0.432 95% CI: (0.404,0.460) Table: Summary of results based on simulating 50 independent datasets. 8 / 15

Partitioning heritability One method for partitioning heritability is to assume a LMM with multiple variance components (Finucane et al., 2015, Nat. Genet.): where are all independent. y = X S b S + X S c b S c + e, (3) ( MV b i MV The S-partitioned heritability is ) 0, σ2 S, S ( ) if i S, 0, if i S σ 2 S c m S σ 2 S h 2 S = σ 2 +. S σ2 + σ 2 S c e h 2 S can be estimated, for instance, using maximum for the model (3) under a Gaussian random-effects assumption. However, estimating the total heritability h 2 = (σ 2 S + σ2 S c )/(σ 2 S + σ2 S c + σ 2 e) under this model has the same issues noted in the previous slide/simulation. 9 / 15

LMMs for heritability estimation Challenges arise when the location of causal SNPs is correlated with LD structure. Solution: 10 / 15

LMMs for heritability estimation Challenges arise when the location of causal SNPs is correlated with LD structure. Solution: Remove LD structure, i.e. whiten. 10 / 15

LMMs for heritability estimation: Mahalanobis kernel Our proposal: Use the Mahalanobis kernel to measure genetic similarity. where Cov(x i ) = Σ. K ij = x i Σ 1 x j, This resolves the causal SNPs/linkage disequilibrium and partitioning heritability problems, and clears a path for more modeling progress and results. Argument: K ij = x i Σ 1 x j y = Xb + e, with b MV(0, σ 2 gσ 1 /m) y = X Σb Σ + e, fixed-effects model, where X Σ = XΣ 1/2 and b Σ = Σ 1/2 b. The Mahalanobis kernel has been used extensively elsewhere in genetics (e.g. for association testing), but not to our knowledge for heritability estimation. 11 / 15

Revisiting Causal SNPs/LD and partitioning heritability To find the Mahalanobis-MLE ĥ 2 Σ, replace X by XΣ 1/2 and then find the MLE for h 2 using the linear kernel, i.e. whiten/decorrelate/standardize the predictors and compute the usual MLE. h 2 ĥ 2 ĥ 2 Σ 0.5 Mean: 0.432 Mean: 0.499 95% CI: (0.404,0.460) 95% CI: (0.473, 0.525) Let ( ) ΣS,S Σ S,S c Σ = Σ. Σ S,S c S c,s c The partitioned heritability is defined via the decomposition y = g S + g S c + e, where g S = X S (b S + Σ 1 S,S Σ S,S c b S c ), g S c = (X S c X S Σ 1 S,S Σ S,S c )b S c. Under the LMM y = Xb + e with b MV(0, σ 2 gσ 1 /m), g S and g S c are uncorrelated. 12 / 15

Why does this work? Model misspecification results for variance components estimation problems (D & Erdogdu, 2016, 2017). Meta-result. Assume: (i) y = Xb + e R n (ii) Cov(e) = σ 2 e. (iii) b R m is any (fixed or random) vector with b 2 1. If x i N(0, Σ) are iid, then the variance component estimators for the Mahalanobis kernel (ˆσ 2 g, ˆσ 2 e) are nice: (i) (ˆσ 2 g, ˆσ 2 e) (σ 2 g, σ 2 e), where σ 2 g = lim b 2. (ii) (ˆσ 2 g, ˆσ 2 e) are approximately normal. Implication: If genotypes are random, then the Mahalanobis kernel for estimating heritability works even with fixed genetic effects and, in particular, when the location of causal SNPs is associated with LD structure. 13 / 15

Why does this work? Proof idea: In the fixed-effects model with random genotypes, b inherits randomness from X. Let ˆθ = (ˆσ 2 g, ˆσ 2 e) be the MLE and assume WLOG that x i N(0, I). Key point: Since X D = XU for any m m orthogonal matrix U, ˆθ(y, X) D = ˆθ(ỹ, X), where ỹ = X b + e, b uniform{s m 1 (σ 2 g)}. In other words, ˆθ = ˆθ(y, X) has the same distribution as ˆθ(ỹ, X), where the data are drawn from a random-effects model. Since b is approximately Gaussian for large m, we can use tools for variance components estimation in Gaussian random-effects models to approximate the distribution of ˆθ. Finite sample normal approximation results for quadratic forms. Other comments: We can get closed form expressions for the asymptotic variance of ˆθ using results on the Marčenko-Pastur distribution. 14 / 15

Open questions Results for non-gaussian x i. SNP genotype data is always non-gaussian, but simulations suggest results may still hold. Binary outcomes. Erdogdu, Bayati & D (2016). Applications to association testing problems. 15 / 15