Quasi-regression for heritability

Size: px

Start display at page:

Download "Quasi-regression for heritability"

Isaac Price
5 years ago
Views:

1 Quasi-regression for heritability Art B. Owen Stanford University March 01 Abstract We show in an idealized model that the narrow sense (linear heritability from d autosomal SNPs can be estimated without bias from n independently sampled individuals, by a method of moments strategy based on quasi-regression. The variance of this estimator is below C 1/n + C d /n 3 for constants C <, uniformly in n and d 1. In particular d n is allowed, and heritability can be consistently estimated in some limits where the effects of individual SNPs cannot be consistently estimated due to non-sparsity. Furthermore, when the C d /n 3 term dominates, then doubling the sample size reduces the variance by almost eight fold. The idealization is that the model assumes complete linage equilibrium, although it does allow for an arbitrary pattern of regression coefficients. In particular the coefficients need not be sparse or Gaussianly distributed, nor independent of the minor allele frequency. The method and the rate of convergence extend to additive heritability. For a full quadratic model encompassing pairwise interactions with d = d + d(d 1/ coefficients the error variance is O(d 4 /n 3. 1 Introduction A missing heritability problem has emerged in genome wide association studies (Manolio et al., 009. A phenotype nown to be largely heritable, for example height, may be found to have only a few significantly predictive SNPs and those few may explain in total only a small portion of the nown heritability. A typical dataset setup has measurements of d SNPs on n subects where frequently d n. Multiple linear regression of the phenotype on SNPs is a useful way to study their effects. Even when d n, it is still possible to estimate the needed regression coefficients, if the true coefficient vector is sparse. See for example Zhao and Yu (006. When that coefficient vector is not sparse, then it is not possible to estimate it accurately (Candès and Davenport,

2 Notation and models In this paper, the meaning of a(n, d = O(b(n, d is that there is some constant C < for which a(n, d C b(n, d holds for all n and all d 1. That is, we use non-asymptotic order bounds. Let y i be a phenotype for subect i. We assume that the population average of this phenotype has been subtracted from y i so that E(y i = 0. We will suppose that the phenotype is bounded, y i c <. Boundedness is not necessary, but we prefer to write c 3 instead of, for example, E(y 1 1/4. Conversely, we find E(y more informative than c, so we do not use the bound everywhere. Let x i be a genetic measure for SNP on subect i. For an autosomal SNP, x i might be 1 if subect i has one or more copies of the minor allele and zero otherwise. Alternatively, x i {0, 1, } might be the number of copies of minor allele in subect i. Let x i = ( x i E( x i / Var( x i so that E(x i = 0 and Var(x i = 1 in the relevant population. We assume additionally that the components x 1,..., x d of x are independent of each other. This neglects linage disequilibrium (LD. LD only affects a small fraction of the O(d SNP pairs, but we expect it could be a significant consideration in aggregate because there are so many such pairs. The th moment of x is µ = E(x. We will need µ 4 for each SNP. If SNP has minor allele frequency ɛ > 0, then µ 4 = O(ɛ 1. The variable x is bounded. In the motivating examples, x c where c = O(ɛ 1/. In most wor, we assume that ɛ ɛ 0 holds for some ɛ 0 > 0. Then µ 4 are uniformly bounded. Some intermediate expressions use the average µ 4 = (1/d d =1 µ 4. These allow us to consider a limit in which alleles with smaller MAF become eligible for use as n increases. For example, if ɛ n 1/ then µ 4 n 1/, and if we only assume ɛ m/n (perhaps the smallest reasonable MAFs to use for some m > 0, then µ 4 = O(n. The Euclidean norm of x is denoted x. We easily find that E( x = d and E( x 4 = d + d( µ 4 1, so Var( x = d( µ 4 1. As a result x /d = 1 + O p (d 1/ µ 1/ 4. In particular ( n ( xi E 1 = n d d ( µ 4 1, and so if d n, then x i d is a good approximation and one that underlies some of our methods. We consider several models for heritability. The linear model is y i = x i β + ε i, i = 1,..., n (1 =1 where ε i is a random error uncorrelated with x i = (x i1,..., x id having mean 0 and variance σ. Effects of the environment, interactions among the SNPs

3 and interactions between SNPs and the environment are subsumed into ε i. The pairs (x i, y i are independent. Interest naturally centers on the vector β = (β 1,..., β d. Estimation of β is difficult because usually n d. If β is sparse, then it can still be estimated using compressed sensing methods. But β might not be sparse and then there are lower bounds (Candès and Davenport, 011 on its estimation error. To study linear heritability we are interested in σl = d =1 β = β. If β is not sparse enough to allow an accurate estimate ˆβ of β, then we cannot rely on ˆβ to be a good estimate of σl. Accordingly, we turn our attention to strategies for estimating β directly without requiring an accurate estimate of β. Our starting point is the bias corrected quasi-regression estimator of σl from Owen (000 given below. That algorithm is able to estimate σl in an asymptotic regime where n/d /3, without assuming sparsity of β. That is, the sample size n can be sublinear in d. We choose an asymptotic regime with n and d growing together not because more SNPs will be found, but because the usual fixed d and n framewor is not a good description for a finite data set with n d. In addition to the linear model, we may also be interested in an additive model which contains x terms. If x only taes levels, then the centered value x 1 is linearly dependent on x, and we gain nothing from incorporating it into the linear model. When SNPs are recorded at 3 levels, this additive model allows one to investigate dominance effects. Let z = x µ 3x 1 µ 4 µ 3 1. ( Then E(z = 0, E(z = 1 and E(x z = 0. The additive model for heritability is y i = x i β + =1 z i γ + ε i, i = 1,..., n, (3 =1 where ε i contains an error uncorrelated with the x i and the z i, and does not ordinarily tae the same numerical value in models (1, (3 and others that we consider. We are also interested in the squares model y i = z i γ + ε i, i = 1,..., n, (4 =1 the pure interaction model y i = d 1 =1 =+1 x i x i β + ε i, i = 1,..., n, (5 3

4 and the quadratic model y i = x i β + =1 =1 d 1 z i γ + =1 =+1 x i x i β + ε i, i = 1,..., n. (6 Some of these are of direct biological interest and others play a role in our proofs. Corresponding to the model parameters above are the sums of squares σ L = β, σ S = γ, and σ I = < β, along with combinations σa = σ L + σ S and σ Q = σ A + σ I. These sums of squares tae the same value in any model for which all their parts are defined. The following inequality max(σ L, σ S, σ I σ Q E(y (7 will be useful in bounding some error terms. For instance, even though σi is the sum of d(d 1/ nonnegative contributions, that sum is O(1. Also, introducing artificial phenotypes lie y E(y, and applying a version of (7 for them, yields further bounds that we use. 3 Quasi-regression for linear heritability We begin with estimation of σl, for linear heritability. For d n it is infeasible to estimate β by least squares. The quasi-regression estimator of β is β = 1 n x i y i. n We easily find that E( β = β. The naive estimator of σ L is Proposition 1 gives the bias of ˆσ L. Proposition 1. Proof. First ˆσ L = β. =1 E(ˆσ L = n 1 n σ L + 1 n E( x y. ne( β = 1 ( n E Summing over yields the result. x i x i y i y i i i = E ( x y + (n 1E(x y = E ( x y + (n 1β. 4

5 Because E( x = d, the bias in ˆσ L is of order d/n, which is severe when d n. But this bias is easily removed. Let B L = 1 n x i y i. (8 We can obtain a bias corrected estimate ˆσ L,BC = (ˆσ n 1 L B L, (9 because E (ˆσ L,BC n ( n 1 = n 1 n σ L + 1 n E( x y 1 n 1 E( x y = σl. A simple approximation to the bias adustment can be obtained via the concentration of measure approximation x i d. Letting, B L = d y i n the resulting estimate is ˆσ L,BC,CM = (ˆσ n 1 L B L. (10 Propositio. Suppose that the phenotype y satisfies the bound y c. Then E ( ( B L B L c 4 µ 4 d. Proof. First B L B L = n n (d x i y i. Therefore E ( ( B L B L = 1 n 3 E( (d x y 4 + n 1 n 3 E ( (d x y c 4 µ 4 d n 3 + c4 1 E( (d x c 4 µ 4 d n 3 + c4 n 1 n 3 d µ 4. Propositio shows that the concentration of measure approximation changes our estimate of σl by an amount with root mean square (RMS O(d1/ n 1 µ 1/ 4, where c is assumed to be constant in n and d for the phenotype of interest. In an asymptotic regime with µ 4 bounded we find that the concentration of measure approximation maes a vanishing RMS effect on our estimate if d = o(. Simulations in Owen (000 showed very little difference between approximations with and without the concentration of measure approximation. 5

6 Theorem 1. Let ˆσ L be the naive quasi-regression variance estimate and assume that the phenotype satisfies the bound y c. Then E ( (ˆσ L,BC σl ( 1 = Var(ˆσ L,BC = O n + d n 3. Proof. It suffices to show that Var(ˆσ L B L = O ( n 1 + d n 3. A lengthy derivation (see Lemma 3 of the appendix yields Var(ˆσ L 4 n c σl + 1 (4σ L c 3( d 1/ + d µ 4 + (d µ4 + E(y n 3 c4 (d + d µ 4 = O(n 1 + dn + d n 3 = O(n 1 + d n 3. Next Var( B L = 1 n 3 ( E( x 4 y 4 E ( x y c4 n 3 (d + d µ 4. Thus Var( B L = O(d n 3 and Var(ˆσ L = O(n 1 + d n 3, so Var(ˆσ L B L Var(ˆσ L + Var( B L + (Var(ˆσ L Var( B L 1/ = O(n 1 + d n 3. When d n, the dominant term in the bound for Var(ˆσ L,BC is 4c4 d n 3. In that case, doubling n reduces the variance by nearly a factor of eight. The concentration of measure approximation is accurate to within O(d 1/ n 1. The standard deviation of ˆσ L,BC is O(n 1/ + dn 3/. As a result, the bias introduced by concentration of measure is of order n/d + d/n 3 standard errors. It is thus negligible for the values of n and d used in many genome-wide association studies. Both ˆσ L,BC and ˆσ L,BC,CM require O(nd computation but the latter has a smaller implicit constant. 4 Additive and quadratic heritability The additive model predicts y by a linear combination of x and z for = 1,..., d. Because x 1,..., x d are independent, it follows that the entire list x 1,..., x d, z 1,..., z d consists of d mutually uncorrelated predictors. The analysis of the additive model is similar to that of the linear model, except that there are twice as many predictors and the fourth moments of z are larger than those of x. The quadratic model incorporates predictors x x for <. For distinct pairs < and < with or the predictors x x and x x are uncorrelated. Similarly x x is uncorrelated with all of the x and z whether or not = or =. The big difference in the quadratic model is that there are now more than d / coefficients to estimate instead of O(d coefficients. For the additive model, we estimate β = 1 n n x i y i, and γ = 1 n 6 n z i y i,

7 and the naive estimate of σ A is Proposition 3. ˆσ A = ( β + γ. =1 E(ˆσ A = n 1 n σ A + 1 n E( ( x + z y. Proof. The same argument used in Proposition 3 applies here. Letting B A = 1 we obtain estimates n ( x i + z i y i and BA = d ˆσ A,BC = (ˆσ n 1 A B A ˆσ A,BC,CM = (ˆσ n 1 A B A. and n yi, As in the linear case we find that E (ˆσ A,BC = σ A. Also from A = S + L we get E ( ( B A B A c 4( µ 4 (x + µ 4 (z + µ 4 (x µ 4 (z d where µ 4 (x and µ 4 (z are average fourth moments of x and z respectively. Theorem. Let ˆσ A be the naive quasi-regression variance estimate of σ A and assume that the phenotype satisfies the bound y c. Then ( 1 Var(ˆσ A,BC = O n + d n 3. Proof. The additive variance explained is σa = σ L + σ S, and the estimate partitions accordingly, as ˆσ A,BC = ˆσ L,BC + ˆσ S,BC. We can apply Theorem 1 to both parts, letting z tae the role of x in the second application. For the quadratic model, we incorporate estimates β = 1 n n x i x i y i for 1 < d. The naive estimate of σ Q is ˆσ Q = =1( β + γ + < β. 7

8 Proposition 4. E(ˆσ Q = n 1 n σ Q + 1 n E( ( x + z + x 4 y. Proof. Recall that σq = σ L + σ S + σ I. To tae account of the σ I write for <, ne ( β = 1 n i i ( E xi x i x i x i y i y i = E ( x x y + (n 1β. component, Summing over 1 < d, yields < E ( β = 1 n E( x 4 y + n 1 n σ I. Once again, we get a bias correction and a concentration of measure approximation: B Q = 1 n B Q = d + ( + µ 4 d ( x i 4 + x i + z i y i n yi. The bias is now of much larger order d, consistent with there being more than d / parameters to estimate. Using these expressions, we obtain estimates ˆσ Q,BC = (ˆσ n 1 Q B Q ˆσ Q,BC,CM = (ˆσ n 1 Q B Q. Here E (ˆσ Q,BC = σ Q. In the model equation Q = L + S + I we have already seen how the linear and square parts have an accurate concentration of measure approximation. It remains to consider B I B I. and and Proposition 5. For the interaction model, the bias adustments satisfy Proof. E ( ( B I B I ( d 3 = O. E ( ( B I B I = 1 (( n n 4 E (d + d µ 4 x i 4 yi 8

9 = 1 ( n 3 E (d + d µ 4 x 4 y 4 + n 1 E ((d + d µ 4 x 4 y = c4 Var( x 4 ( d 3 = O. Theorem 3. Let ˆσ Q be the naive quasi-regression variance estimate of σ Q and assume that the phenotype satisfies the bound y c. Then n 3 ( 1 Var(ˆσ Q,BC = O n + d4 n 3. Proof. The error from the additive parts is already seen to be O(1/n + d /n 3. Therefore it suffices to show that Var(ˆσ I,BC = O(1/n + d4 /n 3. This is exactly what we would expect from a linear model with d = d(d 1/ independent predictors. But the predictors in the interaction model, while uncorrelated, are not independent. We can follow the argument for the linear case using d predictors of the form x x for < in Lemma 3, replacing each x by x r x s for r < s and then each x by x x for < and each x by x = (x 1 x, x 1 x 3,..., x d 1 x d R d. We then use E(x x y = β and replace ε L by ε I = y < x x β. The analogy holds in a straightforward way except for the term which becomes n 1 n 3 E ( x x x r x s y < r<s which now involves four way sums over components of x. We need this to be O(d / = O(d /. If we use the artificial phenotype y E(y and consider a pure quartic model with d(d 1(d (d 3/4 predictors x x x r x s with,, r, and s all distinct, then the terms with no ties among,, r, s sum to at most E(y. What remains are the terms where {, } {r, s}. There are O(d such terms and they are uniformly bounded. Thus while it is possible to estimate the interaction inheritance and hence the quadratic inheritance with n d, we still need n of larger order than d. Practically, this implies that we may be able to investigate interactions among strategically chosen subsets of SNPs but perhaps not the entire interaction at practical sample sizes. The concentration of measure approximation for the interaction inheritance is x 4 d + d µ 4. The root mean square change from this shortcut is d 3/ /n, while the standard deviation of the estimate is O(n 1/ + d n 3/. Suppose that n is ust barely large enough to allow estimation of σ I, perhaps n d4/3+ɛ. Then the RMS difference between ˆσ I,BC and ˆσ I,BC,CM is of order d3/ n 1 = d 1/6 ɛ which diverges. As a result, the concentration of measure shortcut is not recommended for the interaction or quadratic model. 9

10 Acnowledgments I than Or Zu for helpful discussions. This wor was supported by the NSF under grant DMS References Candès, E. and Davenport, M. A. (011. Howe well can we estimate a sparse vector? Technical report, Stanford University. Manolio, T. A., Collins, F. S., Cox, N. J., Goldstein, D. B., Hindorff, L. A., Hunter, D. J., McCarthy, M. I., Ramos, E. M., Cardon, L. R., Charavarti, A., Cho, J. H., Guttmacher, A. E., Kong, A., Kruglya, L., Mardis, E., Rotimi, C. N., Slatin, M., Valle, D., Whittemore, A. S. Boehne, M., Clar, A. G., Eichler, E. E., Gibson, G., Haines, J. L., Macay, T. F. C., McCarroll, S. A., and Visscher, P. M. (009. Finding the missing heritability of complex diseases. Nature, 461: Owen, A. B. (000. Assessing linearity in high dimensions. The Annals of Statistics, 8(1:1 19. Zhao, P. and Yu, B. (006. On model selection consistency of the lasso. Journal of Machine Learning Research, 7: Appendix Here we prove some basic Lemmas used in the main Theorem. The following basic properties of x are used below: E ( x = d, E ( x 4 = d + d µ 4 E ( ( x d = d µ 4, and E ( ( x d r = O(d r 1, integer r 3. Lemma 1 uses the artificial phenotype y E(y to bound a quantity which appears in Var(ˆσ L. Lemma 1. E ( x x y (d µ4 + E(y 4. Proof. We write E ( x x y = E ( x x y + E ( x y. For the first term, consider an artificial phenotype y E(y. Then E ( x x y = E ( x x (y E(y < 10

11 E ( (y E(y E(y 4, because of inequality (7 applied to the interaction model for this artificial phenotype. Next E(x y E(x 4 E(y 4 E(y 4 d µ 4. Lemma. E ( (y ε x y 3 σ L c 3( d + d µ 4 1/. Proof. Using Cauchy-Schwarz, boundedness of y, and the above results on x : E ( (y ε x y 3 E ( (y ε 1/ E ( x 4 y 6 1/ Lemma 3 is the main lemma. σ L c 3( d + d µ 4 1/. Lemma 3. Under the conditions of Theorem 1 Var(ˆσ L 4 n c σl + 1 (4σ L c 3( d 1/ +d µ 4 +(d µ4 +E(y n 3 c4 (d +d µ 4. Proof. First, Var (ˆσ L = E ( β ( β (E β We simplify some expressions below using E ( x y = β and ε L y x β. For each pair, {1,,..., d}, E( β β (( 1 ( 1 = E x i y i x i y i n n i i = 1 ( n 4 E xi x i x i x i y i y i y i y i i i i i (n 1(n (n 3 = n 3 E(x y E(x y (n 1(n ( + n 3 E(x y E(x y + E(x y E(x y + 4E(x x y E(x ye(x y + n 1 ( n 3 E ( x y E ( x x y 3 + E ( x y E ( x x y 3 + E ( x y E ( x y + E ( x x y + 1 n 3 E( x x y 4. 11

12 Summing over and, E( β β = + (n 1(n (n 3 n 3 σl 4 (n 1(n ( n 3 σle ( x y + 4E ( (y ε L y + n 1 n 3 ( 4E ( (y ε L x y 3 + E( x y n 3 E( x 4 y 4. E(x x y Next E( β = E(x y /n + (n 1β /n. Therefore ( ( E Subtracting, β = 1 + E ( x y E ( x y (n 1 = 1 E( x y + β β + n 1 E ( x y β (n 1 σ 4 L + n 1 E ( x y σ L. ( (n 1(n (n 3 Var(ˆσ L (n 1 = n 3 σl 4 ( (n 1(n + n 3 n 1 E ( x y σl ( n 1 + n 3 1 E ( x y (n 1(n + 4 n 3 E ( (y ε L y + n 1 ( n 3 4E ( (y ε L x y 3 + E(x x y + 1 n 3 E( x 4 y 4. The first three terms above are less than or equal to zero. Therefore Var(ˆσ L (n 1(n 4 n 3 E ( (y ε L y + n 1 ( n 3 4E ( (y ε L x y 3 + E(x x y + 1 n 3 E( x 4 y 4. 1

13 We can simplify this bound using E ( (y ε L y c σl, Lemmas 1 and, and E ( x 4 y 4 c 4 (d + d µ 4, yielding Var(ˆσ L (n 1(n 4 n 3 c σl + n 1 ( n 3 4σ L c 3( d 1/ + d µ 4 + (d µ4 + E(y n 3 c4 (d + d µ 4 4 n c σl + 1 (4σ L c 3( d 1/ + d µ 4 + (d µ4 + E(y n 3 c4 (d + d µ 4. 13

The square root rule for adaptive importance sampling

The square root rule for adaptive importance sampling Art B. Owen Stanford University Yi Zhou January 2019 Abstract In adaptive importance sampling, and other contexts, we have unbiased and uncorrelated