Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics Lee H. Dicker Rutgers University and Amazon, NYC Based on joint work with Ruijun Ma (Rutgers), Murat A. Erdogdu (Stanford/MSR), and Yakir Reshef (Harvard) Borchard Colloquium July 4, 2017 Work conducted prior to joining Amazon

Heritability in genetics Heritability is a fundamental concept in genetics. It represents the extent to which an exhibited trait in an individual is attributable to genetics. (Slightly) more technical formulation: The proportion of variation in a phenotype that can be explained by the genotype. Population-level estimates of heritability can be obtained from phenotype data (e.g. human height or milk fat percentage in cows) and genotype data (e.g. pedigree information or GWAS data). Heritability estimation has a long history in statistics going back to R.A. Fisher. Currently, LMM-based methods are probably the most prevalant approach for estimating heritability with GWAS data. LMM-based methods for heritability estimation are not new (Henderson, 1950, Ann. Math. Stat.). Modern LMM-based methods for GWAS data took off after (Yang et al., 2010, Nat. Genet.), which gave estimates of heritability for human height. Thousands of papers on LMM-based heritability for GWAS data since 2010. 1 / 15

Heritability in genetics Hayes et al. (2010) PLoS Genet. I Much of the recent work on LMM-based methods for heritability estimation is built on older research in to cattle breeding. 2 / 15

This talk Some new statistical perspectives on heritability estimation. Main messages: 3 / 15

This talk Some new statistical perspectives on heritability estimation. Main messages: 1. Standardize your predictors. 3 / 15

This talk Some new statistical perspectives on heritability estimation. Main messages: 1. Standardize your predictors. 2. There s a big opportunity for statistics to make an impact in this area with thoughtful modeling and technical expertise. 3 / 15

Genetic relatedness and heritability Let y = (y 1,..., y n ) R n be a vector of centered real-valued outcomes, where y i is the phenotype value for individual i in some population. Assume that y = g + e (1) can be decomposed into an additive genetic effect g MV(0, σ 2 gk) and an uncorrelated noise vector e MV(0, σ 2 e I). K = (K ij ) is the genetic relationship matrix (GRM) and K ij measures genetic similarity between individuals i and j; the GRM is standardized so that K ii = 1. The noise vector e may contain environmental noise, measurement error, and other non-additive genetic effects. Frequently, the data is transformed before reaching the representation (1), e.g. project out covariates or other fixed effects. The heritability coefficient is σ2 g h 2 =. σ 2 g + σ 2 e Since E(yy ) = σ 2 g K + σ2 e I, the heritability coefficient is the R2 for regressing y i y j on K ij, i.e. regressing phenotype similarity on genetic similarity. 4 / 15

Genetic relatedness and heritability To estimate h 2, Henderson (1950) used least squares for regressing y i y j on K ij. Maximum likelihood is also widely used: Let l(s 2 g, s 2 e) = 1 2 log det(s2 g K + s2 e I) + 1 2 y (s 2 g K + s2 e I) 1 y be the log-likelihood for (σ 2 g, σ 2 e) under the assumption that g, e are Gaussian; minimize l to get the MLE for (σ 2 g, σ 2 e) and h 2. Henderson s estimator and the MLE are both reasonable estimators for a given GRM K how should we choose K? Pre-GWAS: Pedigree/familial information determines K and describes how individuals are related. Post-GWAS: Molecular genetic data gives fine-grained measure of genetic similarity. 5 / 15

Genetic relatedness and LMMs for GWAS data For GWAS data, each individual s genetic data can be encoded in a vector x i = (x i1,..., x im ) R m, where x ij represents the (frequently standardized) minor allele count for the j-th single nucelotide polymorphism (SNP). x ij is tri-nary. m can be in the 100K-1M s; n may be in the 1-10K s. The GRM is determined by a kernel function K with K ij = K(x i, x j ). The linear kernel is the most widely used in practice. K(x i, x j ) = 1 m x i x j With the linear kernel, the original model y = g + e can be rewritten as a linear random-effects model (LMM) y = Xb + e, where X = (x 1,..., x n ), b = (b 1,..., b m ) R m, and b 1,..., b m MV(0, σ 2 g/m) are iid. 6 / 15

LMMs for heritability estimation: Questions y = Xb + e, (2) Variance components methods for estimating h 2 under (2) have emerged as one of the most popular strategies for heritability estimation in GWAS. However, a number of challenges have emerged. Causal SNPs and linkage disequilibrium. In most generative models for linking GWAS data and outcomes, there is a fixed collection of causal SNPs C [m] and b is assumed to be a sparse vector supported on C. If the SNPs x i are highly correlated i.e. they are in linkage disequilibrium (LD) the LMM approach can give badly biased estimates for h 2 (Speed et al., 2012, Am. J. Hum. Genet.) Partitioning heritability. Sometimes it s desirable to estimate the heritability attributable to a subset of SNPs, S [m]. To date, there s no consensus about how this should be done and existing solutions have significant drawbacks. 7 / 15

Causal SNPs and linkage disequilibrium Simulations show standard LMM estimators for h 2 are biased when causal SNPs are located in regions with low (or high) LD. Settings: n = 500, m = 1000. σ 2 e = 0.5. b = 1 m (z 1,..., z m/2, 0,..., 0) R m with z 1,..., z m/2 N(0, 1) iid. So C = {1,..., m/2}. ( ) AR(0.2) 0 x i N(0, Σ) with Σ =. 0 AR(0.8) Then σ 2 g = 0.5 and h 2 = 0.5. h 2 ĥ 2 0.5 Mean: 0.432 95% CI: (0.404,0.460) Table: Summary of results based on simulating 50 independent datasets. 8 / 15

Partitioning heritability One method for partitioning heritability is to assume a LMM with multiple variance components (Finucane et al., 2015, Nat. Genet.): where are all independent. y = X S b S + X S c b S c + e, (3) ( MV b i MV The S-partitioned heritability is ) 0, σ2 S, S ( ) if i S, 0, if i S σ 2 S c m S σ 2 S h 2 S = σ 2 +. S σ2 + σ 2 S c e h 2 S can be estimated, for instance, using maximum for the model (3) under a Gaussian random-effects assumption. However, estimating the total heritability h 2 = (σ 2 S + σ2 S c )/(σ 2 S + σ2 S c + σ 2 e) under this model has the same issues noted in the previous slide/simulation. 9 / 15

LMMs for heritability estimation Challenges arise when the location of causal SNPs is correlated with LD structure. Solution: 10 / 15

LMMs for heritability estimation Challenges arise when the location of causal SNPs is correlated with LD structure. Solution: Remove LD structure, i.e. whiten. 10 / 15

LMMs for heritability estimation: Mahalanobis kernel Our proposal: Use the Mahalanobis kernel to measure genetic similarity. where Cov(x i ) = Σ. K ij = x i Σ 1 x j, This resolves the causal SNPs/linkage disequilibrium and partitioning heritability problems, and clears a path for more modeling progress and results. Argument: K ij = x i Σ 1 x j y = Xb + e, with b MV(0, σ 2 gσ 1 /m) y = X Σb Σ + e, fixed-effects model, where X Σ = XΣ 1/2 and b Σ = Σ 1/2 b. The Mahalanobis kernel has been used extensively elsewhere in genetics (e.g. for association testing), but not to our knowledge for heritability estimation. 11 / 15

Revisiting Causal SNPs/LD and partitioning heritability To find the Mahalanobis-MLE ĥ 2 Σ, replace X by XΣ 1/2 and then find the MLE for h 2 using the linear kernel, i.e. whiten/decorrelate/standardize the predictors and compute the usual MLE. h 2 ĥ 2 ĥ 2 Σ 0.5 Mean: 0.432 Mean: 0.499 95% CI: (0.404,0.460) 95% CI: (0.473, 0.525) Let ( ) ΣS,S Σ S,S c Σ = Σ. Σ S,S c S c,s c The partitioned heritability is defined via the decomposition y = g S + g S c + e, where g S = X S (b S + Σ 1 S,S Σ S,S c b S c ), g S c = (X S c X S Σ 1 S,S Σ S,S c )b S c. Under the LMM y = Xb + e with b MV(0, σ 2 gσ 1 /m), g S and g S c are uncorrelated. 12 / 15

Why does this work? Model misspecification results for variance components estimation problems (D & Erdogdu, 2016, 2017). Meta-result. Assume: (i) y = Xb + e R n (ii) Cov(e) = σ 2 e. (iii) b R m is any (fixed or random) vector with b 2 1. If x i N(0, Σ) are iid, then the variance component estimators for the Mahalanobis kernel (ˆσ 2 g, ˆσ 2 e) are nice: (i) (ˆσ 2 g, ˆσ 2 e) (σ 2 g, σ 2 e), where σ 2 g = lim b 2. (ii) (ˆσ 2 g, ˆσ 2 e) are approximately normal. Implication: If genotypes are random, then the Mahalanobis kernel for estimating heritability works even with fixed genetic effects and, in particular, when the location of causal SNPs is associated with LD structure. 13 / 15

Why does this work? Proof idea: In the fixed-effects model with random genotypes, b inherits randomness from X. Let ˆθ = (ˆσ 2 g, ˆσ 2 e) be the MLE and assume WLOG that x i N(0, I). Key point: Since X D = XU for any m m orthogonal matrix U, ˆθ(y, X) D = ˆθ(ỹ, X), where ỹ = X b + e, b uniform{s m 1 (σ 2 g)}. In other words, ˆθ = ˆθ(y, X) has the same distribution as ˆθ(ỹ, X), where the data are drawn from a random-effects model. Since b is approximately Gaussian for large m, we can use tools for variance components estimation in Gaussian random-effects models to approximate the distribution of ˆθ. Finite sample normal approximation results for quadratic forms. Other comments: We can get closed form expressions for the asymptotic variance of ˆθ using results on the Marčenko-Pastur distribution. 14 / 15

Open questions Results for non-gaussian x i. SNP genotype data is always non-gaussian, but simulations suggest results may still hold. Binary outcomes. Erdogdu, Bayati & D (2016). Applications to association testing problems. 15 / 15