Correction for Hidden Confounders in the Genetic Analysis of Gene Expression

Size: px

Start display at page:

Download "Correction for Hidden Confounders in the Genetic Analysis of Gene Expression"

Silas Carroll
5 years ago
Views:

1 1 Supplemental Information for: Correction for Hidden Confounders in the Genetic Analysis of Gene Expression Jennifer Listgarten, Carl Kadie, Eric E. Schadt, and David Heckerman Proceedings of the National Academy of Sciences, 21 Supplemental Methods We review linear mixed-effect models, how they have been used for correction of population structure (PS) or expression heterogeneity (EH) in the genetics association literature, how a problem can arise in application of EH models to data like our human and mouse data, and how to fix it. Then, we introduce our model for joint correction of EH and PS called LMM-EH-PS. Linear mixed-effects models can be understood at an intuitive level from a variety of viewpoints. We view the linear mixed-effects model as a probabilistic generative model [1 3] in which a target variable, y g, in our case a vector of gene probe expression levels across all individuals for one gene probe g ( y g is of dimension N 1 for N individuals), is modeled as the linear combination of three quantities: (1) fixed effects X (dimension N Q) contain the effect of S SNPs and Q S other fixed effects (e.g., a bias/offset term and covariates such as gender and age) (2) random effects, u g (of dimension N 1 for N individuals), which model hidden effects such as PS or EH (and will be integrated out given a suitable distribution), and (3) independent Gaussian noise, e g (dimension N 1), y g = X β g + τ g u g + σ g e g, where τ g and σ g (scalars) are respectively the magnitude of the confounding structure, and the residual noise; and β g (dimension Q 1) are the coefficients of the fixed effects. The conditional likelihood is also sometimes written in the following equivalent manner, more convenient for exposition of our method, p( y g u g, X, β g, τ g, σ g ) = N( y g X β g + u g τ g, Iσ 2 g),

2 2 where N( r m, Σ) denotes a Gaussian distribution for r with mean m and covariance matrix Σ; and I denotes the N xn identity matrix. Shortly we will describe how to jointly model all gene probes. Note that because in our work we only test for one SNP at a time, we do not incorporate all SNPs into the model jointly, instead restricting the fixed effects to include just one SNP at a time. When it is assumed that some individuals share the same random effects, u g, for example in the case where there are multiple individuals from the same strain in a genetic association study (who by definition have identical genetics), one can augment the model with an indicator matrix, Z for which Z nt = 1 if and only if individual n belongs to strain t, and otherwise zero, so that p( y g u g, X, β g, Z, τ g, σ g ) = N( y g X β g + Z u g τ g, Iσ 2 g), where now u g is of dimension T x1 for T strains. We use this more general model going forward, because it also encompasses the case where each individual comes from a unique strain (T = N). As an aside, one could instead add a level to the hierarchy so that the j th individual from strain t gets a random effect w tj g which is a noisy version of u t, the hidden variable for strain t that is, p(w tj g ) = N(w tj g u t g, σ t g2 ). In some circumstances this addition could improve the model, although we did not use it here. The distribution of u g is assumed to be a zero-mean Gaussian distribution with covariance matrix K of dimension T xt, p( u g K) = N( u g, K). (1) Intuitively, this distribution captures similarity between strains/individuals through the covariance matrix, K, which (to the extent that it has off-diagonal elements) couples the observations among individuals (Equation 5). In so doing, the model can correct for the fact that individuals in a study are not perfectly balanced for various factors such as race. Note that data for individuals are independent from one another given u g that is, knowing the value of the hidden variables decouples the observations in the model. Of course in practice u g is hidden and this decoupling does not occur. When correcting for population structure, an Identity-By-Descent (IBD), Identity-By-State (IBS), or covariance matrix has been used for K (e.g., [4 7]). In separate experiments, we found them to yield comparable results. Thus we used only the covariance matrix in this study [4, 7]. When modeling expression heterogeneity, Kang et al. use the covariance matrix of the gene expression data in their ICE model [8]. The K we use for modeling expression

3 3 heterogeneity is described in the next section. The likelihood, L, for the mixed-effects model can be found by integrating out the random effects, u g, giving L g p( y g X, β g, τ g, σ g, Z, K) (2) = p( y g, u g X, β g, τ g, σ g, Z, K) d u g (3) u g = N( y g Xβ g + Z u g τ g, Iσg) 2 p( u g K) d u g (4) u g = N( y g Xβ g, τg 2 ZKZ + Iσg) 2 (5) where Z denotes the transpose of Z. Now let us consider the joint model for all gene probes, g = 1,..., G. We assume that, conditioned on K, Z, X and β g, τ g, σ g, for all g, the expression data for each gene are mutually independent. That is, the likelihood of Y = [ y 1, y 2,..., y G ] (dimension N G) is given by p(y X, { β g, τ g, σ g }, Z, K) = g p( y g X, β g, τ g, σ g, Z, K) = g N( y g X β g, τ 2 g ZKZ + Iσ 2 g), (6) where { β g, τ g, σ g } denotes the parameters for all gene probes. This generative model is shown as a graphical model in Figure 1. Note that when K is unknown the gene probes are coupled together. Later we show that in the case of expression heterogeneity correction, we treat K as an unknown parameter and learn it from all of the gene expression data jointly. In the meantime, we treat K as known and thus model gene probes independently. The parameters { β g, τ g, σ g } can be estimated using the principle of REML (Restricted Maximum Likelihood), which is a degree-of-freedom correction to the principle of ML (Maximum Likelihood) parameter fitting and can be viewed as integrating out the fixed effects when estimating the variance parameters [9]. Alternatively, one can simply use the principle of Maximum Likelihood. We use ML parameter estimation, having found no discernible empirical difference between ML and REML in experiments when only a handful of fixed effects are used, such as is the case in this paper. The extensions to linear mixed-effects models that we develop here can easily be adapted to the REML framework. It turns out that the ML parameters estimates for β g, τ g, σ g for a given g are an analytic function of δ g σ2 g τ 2 g for e.g., [4]). Thus, the likelihood L g can be written in terms of δ g only, and can be maximized with respect only to δ g using numerical optimization. Once this value for δ g is identified, the values for the remaining parameters (see

4 4 K u 1 u 2 u N,, y 1 y 2 y N gene probe x x x Figure 1. Graphical model depiction of a linear mixed-effects model with one random effect for all gene probes. Letters enclosed in a circle denote random variables. Observed variables are shaded. Indexes 1,..., N denote N individuals. The square annotated with gene probe is a plate, denoting that the sub-model within the plate is repeated for each gene probe. Letters β, σ 2, τ 2 within the plate represent gene-probe-specific parameters (by convention we have dropped the g subscript used elsewhere for these parameters because the plate makes this dependence apparent). Parameter K, residing outside the plate, is a covariance matrix describing the similarity among the hidden variables ( confounder coefficients ), u = [u 1,..., u N ], and is shared across all gene probes. Variables x n denotes the slice of the design matrix, X = [ x 1,..., x N ], for individual n and includes SNP features, a bias/offset term, and other covariates. y = [y 1,... y N ] are the gene probe measurements for one gene probe and all individuals.

5 5 follow. Additionally, it has been shown that this numerical optimization can be made orders of magnitude faster than the naive approach by way of an algebraic equivalence that changes the time complexity for W iterations from O(W N 3 ) to O(W N + N 3 ) [4,1]. By re-writing the likelihood as a function of the matrix H g = H g (δ g ) ZKZ +Iδ g, and then using a singular value decomposition (SVD) of K, the computation of H g (δ g ) (where K denotes determinant of K) and H g (δ g ) 1 becomes only O(N) (conditioned on having computed the SVD of K) rather than O(N 3 ) given by the naive approach. This reformulation enables more starting points to be used for the optimization over δ g, thereby lessening the problem of local maxima in the likelihood function during parameter estimation. This approach can be found in the EMMA algorithm [4] which will form a core component of our algorithms. When using mixed-effects models to look for associations between SNPs and gene expression, one obtains a p-value for each hypothesis in the usual way that is, by fitting the model parameters under a null model (without SNPs), and under an alternative model (with SNPs), and then using, for example, a likelihood ratio test (LRT) or F-test [4]. In our empirical studies, we have found little difference between the two tests and use the LRT in our results. Pitfall of Pre-Computing K for EH correction As we have seen, the parameters { β g, σ g, τ g } are estimated using, say, ML, whereas the matrix K has been computed independently and ahead of time in a pre-processing step, often from the data itself (i.e., SNP or expression data). However, in the case of an EH-correction model such as ICE [8], there is a theoretical and practical concern in such an approach. The concern arises because the mixed-effects model, in the case of EH-correction, is a generative model for the expression data, and the parameter K is being estimated from the expression data, but in a way that is not, in general, consistent with the generative model. That is, the ICE estimate for K is not a consistent estimator for the model being used (e.g., an ML or REML estimate), as specified by the joint likelihood of the data given in Equation 6. Readers familiar with graphical models will see from Figure 1 that, roughly speaking, the parameter K should be estimated from the hidden variables u g (Equations 12-17) rather than from the observed variables y g as is done in ICE. One consistent estimate for K is the ML estimate for the model specified in Equation 6. This estimate is precisely what we use. Note however that one must have enough gene probes to support such an estimate, because the number of parameters in K scales quadratically with the number of individuals. In many applications of mixed-effects models, estimating K with no constraints as we do here would not be feasible. Because of the large number of simultaneous mixed-effects sub-models (one for each gene probe g) all sharing K

6 6 in our application, however, we can take such an approach. As an aside, it is interesting to note that given a null model for mean-centered data as in [8], where the parameters σ g, and τ g are the same for all gene probes, the parameters in the covariance of the observed data, Kτg 2 + Iσg, 2 are not identifiable. Consequently, we can set τ g = 1 and σ g = or equivalently u g = y g without loss of generalization, and the ICE procedure in [8] is in that case a consistent estimator. Such a model, however, is neither flexible nor likely to be useful in practice. The issue of using an estimator that is not consistent is not just a theoretical one, but can have a practical, negative affect on our ability to uncover eqtl associations in data. The issue can manifest itself by giving rise to seriously deflated p-value distributions, which in turn leads to overly conservative FDR [11] (False Discovery Rate) estimates. Additionally, by modeling K correctly in EH models, one can increase power. Both of these points are demonstrated in the main paper. An experiment we performed that highlights the issue of the inconsistent estimator was to (1) estimate the ICE model parameters on real data (with no SNP data) (2) generate synthetic data from these ICE models with no SNP fixed effects (i.e., strictly null eqtl data), (3) test a set of randomly selected SNPs with ICE to obtain a p-value distribution that theoretically should be uniform on [, 1] because only null hypotheses were generated. In our experiments of this nature, we repeatedly found that the resulting p-value distribution was deflated. For example, we typically observed λ in the range This phenomenon is of course data-dependent, but we have broadly observed this trend on experiments based on real data. Interestingly, the ICE algorithm applied to the HapMap data analyzed in the ICE paper [8], shows no deflation. In the next section, we introduce our LMM-EH-PS model, and describe our method for estimating K for EH, including the special case where only correction for EH is required (no PS). Note that the issue we have highlighted is not restricted to mixed-effects models, but in general may apply to any method that tackles the problem of expression heterogeneity using a two-step statistical modeling approach where some parameters related to confounding EH structure are first estimated from expression data and then are plugged into a generative model for that same expression data. If the first step is not using a consistent estimator for the parameters in the generative model used in the second step, then the estimation of K will be inconsistent. Surrogate Variable Analysis (SVA) suffers from this inconsistency. In particular, the prescribed way to use SVA for our eqtl study is to first compute the SVs from the expression data and then construct a linear regression model, independently for each gene probe, that uses the SVs as covariates and the gene expression data as targets. Because our own statistical approach in this paper is centered on a mixed-effects model, we did not investigate

7 7 this issue in great detail with SVA. However, we did find SVA used in this manner on our data to perform poorly in terms of calibration of p-values, the ability to detect cis-eqtl more strongly than trans eqtl on real data, and power to detect eqtl on synthetic data (see Results in main paper and below). LMM-EH-PS: Mixed-Effects Model for Joint Correction of EH and PS In our development of the standard linear mixed-effects model, we assumed a single random effect, u g arising from a single distribution with covariance matrix K. Indeed, this model is the one that has been used to date for PS correction, and separately for EH correction. However, to model both EH and PS, we must include multiple random effects (or, multiple variance components) in our model. Although we present such a model with two components (for EH and PS), the model and associated algorithms (i.e., parameter estimation) are easily extended to any number of components should the need arise. Our model, LMM-EH-PS, for a single vector of gene-probe expression measurements across all individuals, is fully specified by p( y g u g, v g, X, β g, τ PSg, Z, τ EHg, σ g ) = N( y g X β g + Z u g τ PSg + v g τ EHg, Iσ 2 g) p( u g K PS ) = N( u g, K PS ) p( v g K EH ) = N( v g, K EH ), so that the likelihood is given by L g p( y g X, β g, τ PSg, Z, τ EHg, σ g, K EH, K PS ) (7) = N( y g Xβ g + Z u g τ PSg + v g τ EHg, Iσg) 2 N( u g, K PS ) N( v g, K EH ) du dv (8) u v = N( y g Xβ g + Z u g τ PSg, τeh 2 g K EH + Iσg) 2 N( u g, K PS ) du (9) u = N( y g Xβ g, τeh 2 g K EH + τps 2 g ZK PS Z + Iσg), 2 (1) = N( y g Xβ g, τg 2 [w g K EH + (1 w g )ZK PS Z ] + Iσg), 2 (11) where K EH and K PS encode the PS and EH structure, respectively, u g, v g are the corresponding random effect vectors and where we re-write the likelihood in the more convenient form where w g = τ 2 EHg τehg 2 +τ PSg 2 [, 1], denotes the relative proportion of K EH (versus K PS ) to be used in the model for gene probe g, and τg 2 = τeh 2 g + τps 2 g scales K EH and K PS appropriately. Equation 1 is obtained by sequential application of the transformation from Equation

8 8 4 to 5 for a single component. We use the form in Equation 11 rather than the line above because it is convenient for parameter estimation. Note that K PS is of dimension T x T as before, whereas K EH is of dimension N x N because each individual has different gene expression measurements, even if they belong to the same strain. It is useful to note that Equation 11, which derived from a mixed-effects model with two random effects can also be interpreted as a mixed-effects model with a single random effect with covariance K w g K PS +(w g 1)K EH. This fact becomes useful later in deriving our algorithm for estimating K EH. Note too that this equivalence generalizes to any number of random effects (variance components). Parameter Estimation for Multiple Random Effects As discussed, we estimate parameters using the ML criterion. First, let us consider the (inconsistent) case where we compute K EH and K PS ahead of time, using, for example, the ICE covariance matrix and the SNP covariance matrix, respectively. In the following section, we will then consider the (consistent) case where K EH and the other parameters are estimated together. Given K EH and K PS, we can estimate the parameters β g, τ g, w g and σ g independently for each gene probe g. We do so by wrapping EMMA inside a numerical optimization algorithm that searches for the best linear combination of K EH and K PS. Specifically, we apply Brent s Method [12] to optimize the likelihood as a function of w g, L g (w g ). Each time Brent s Method is called, it evaluates L (w g ) for a new value of w g using EMMA with fixed K = w g K EH + (1 w g )ZK PS Z. In our experiments, we use Brent to search for optimal w g within six evenly-spaced intervals in [, 1]. We also compute the value of L g () and L g (1) to ensure that a model with only a single one of our random effects cannot, by virtue of local maxima, do better than the mixture. Within each call to EMMA, we use 1 starting points for δ g evenly spaced in log space between -1 and 1. We note that when there are R > 2 random effects, one must optimize multiple mixing coefficients analogous to w g (w g1, w g2,..., w gr ). One option is to use a generalization of Brent s Method. Another is to abandon the use of EMMA as the inner loop of a Brent optimization and instead use a more general-purpose numerical optimizer, such as the L-BFGS Quasi-Newton method [13], to simultaneously optimize β g, w g (for R random effects) and σ g. Such an approach requires as input the partial derivatives of L g ( w g ) with respect to each of the parameters, along with L g ( w g ). This approach can be used when there are only two random effects, but we found that the Brent s method approach performed roughly as well or better in all cases. Both approaches for estimating parameters in our LMM-EH-PS model (with two random effects) have time complexity O(W N 3 ) for W iterations and N individuals. The L-BFGS approach scales linearly in the number of components, with time

9 9 complexity O(RW N 3 ) for R variance components, whereas Brent-like alternatives increase more steeply. Estimation of Parameters Including K EH As discussed earlier, use of the covariance of the expression data for K EH, as is done in ICE, is not consistent with the mixed-effects model and, in our experience, leads to deflation of p-values and loss of power (see main paper). To avoid these practical problems, we estimate K EH along with the remaining parameters from our model and the data. Because the estimate of K EH depends on all gene probes, we must perform this estimation across all gene probes simultaneously (although, conditioned on a fixed value of K EH, the other parameters can still be learned independently for each gene probe). We do not include SNP fixed effects when learning K EH, similar to the approaches taken in [8, 14]. Thus, in practice, we first learn K EH without SNPs, then assume that K EH is fixed and apply our mixed model approach as outlined earlier. We have found our method for learning K EH to be robust to this approximation of ignoring the SNPs as shown later in the section Robustness of Estimating K EH in the Presence of True Associations. The joint parameter estimation can be accomplished using a variety of algorithms. The one we describe here builds on the parameter estimation for { β g, τ g, w g, σ g } developed in the previous section. In particular, our algorithm uses a coordinate ascent approach and alternates between 1. identifying the ML values of { β g, τ g, w g, σ g }, conditioned on a fixed value of K EH 2. identifying the ML value of K EH conditioned on fixed values for { β g, τ g, w g, σ g } In each step the likelihood will either increase or remain the same. As we have already described step 1 earlier in the exposition, we next consider step 2. Note that step 1 can be done independently for each gene probe, as in the usual mixed model framework, whereas step 2 requires use of all of the gene expression data simultaneously. When we want to do joint modeling of EH and PS, as for our LMM-EH-PS model described in Equation 11, we must estimate K EH in the context of a fixed K PS that is, we include K PS in the model, and find the K EH that gives the best likelihood over all gene probes, when each gene probe is allowed to have its own value for w g. If not for this latter flexibility, there would be no point in including K PS during estimation, as the final estimate for K would be the same in either case (i.e., K = w g K PS + (1 w g )KEH = w K EH, where w K EH would be what was learned if K PS were not included during parameter learning and where w g K PS + (1 w g )KEH would be what was learned if K PS were included during parameter learning). When we model only EH, that is, when PS is not present, we say that we are estimating K EH with no context. Although it would be possible to infer K EH with no

10 1 context and then add K PS after the fact to model the data, we have found that such an approach is inadequate. Let us now consider the algorithm where K EH is estimated in the context of K PS. The no-context algorithm is a special case. We use the Expectation-Maximization (EM) algorithm [15] to fit the parameter K EH in step 2, which proceeds roughly as follows. One starts with an initialization for the parameter K EH (we initialize with the ICE estimate in our experiments). Then one computes, for each gene probe g, the distribution for v g given the data, y g and the parameters, β g, τ g, σ g w g, K EH, K PS (E-step). Then, given these distributions, one re-estimates the parameter of interest, K EH (M-step). The algorithm iterates between these two steps, in our case, stopping after a fixed number of iterations. The E-step is given by, p( v g y g ) p( y g v g )p( v g ) = N( v g c g, C g ), where (12) C g = (K 1 EH + A 1 g A g = Iσ2 g + (1 w g )τ 2 g ZK PS Z τ 2 g w g ) 1, and c g = C g A 1 g a g, and (13) where we have used the Gaussian identities from, respectively, [16] and [17], that state,, and a g = y g X β g τ g wg. (14) p(x a, A)p(X b, B) = p(x c, C)p( a b, A + B) where C = (A 1 + B 1 ) 1 and c = C(A 1 a + B 1 b), (15) N(Ax m, M) N(x A 1 m, A 1 MA T ). (16) The M-step is given by ˆK EH (i, j) = G [C g (i, j) + c g (i)c g (j))]. (17) g A single E- and M-step has time complexity O(GN 3 ) for G gene probes and N individuals. In practice, we have found that the following schedule of alternating optimizations yields good estimates after 1-2 repetitions: i. Determine the ML values of { β g, τ g, w g, σ g }, conditioned on a fixed value of K EH. ii. Use 3 iterations of EM to improve the value of the likelihood with respect to K EH, conditioned on fixed values of { β g, τ g, w g, σ g }. If we want to estimate K EH in a model that does not include K PS, we simply remove the relevant terms from

11 11 the equations above. In particular, we need only change Equation 14 to A g = Iσ2 g τ 2 g, and a g = y g X β g τ g, and in this case, because A g is now diagonal, it is possible to use the same SVD trick as in [4,1] to factorize K EH so that the naive computation of C g = (K 1 EH + A 1 g ) 1 across all gene probes can be changed from O(GN 3 ) to O(N 3 ). After K EH is learned in this manner (without SNP data), we then treat K EH as fixed and use the mixed model parameter learning presented earlier with LRT tests to obtain a p-value for each single SNP gene-probe hypothesis. Time Complexity and Running Time The total time complexity for the K EH estimation procedure in a PS context is O(IW GN 3 ), for W iterations to estimate { β g, τ g, w g, σ g }, G gene probes, N individuals, and I iterations of steps i and ii outlined above. With no PS context, the K EH estimation procedure becomes O(IGW N + N 3 ). For our mouse data set, where we estimated K EH with a PS context, with N = 188, I = 2, and G 4,, estimation of K EH took approximately 1 hours on a 11 processor cluster. For our human data set, where we estimated K EH with no context, with N = 378, I = 2, and G 4,, estimation of K EH took approximately 5 hours on a 11 processor cluster. Because all parts of parameter estimation are trivially parallelizable (step 1 above can be done independently for each gene probe, and the E-step in step 2 can be broken down into independent parts for each gene probe), running the algorithm in parallel is straightforward. Instead of specifying W directly, we used a tolerance criterion on the log likelihood in the Brent searches of 1e 5, which in turn dictated W. Note that there are approaches in related areas (e.g., Gaussian Process Regression [16]) to speed up the O(N 3 ) computations. We have also considered the idea of using an approximation to the full covariance matrix K by forcing zeros in the corresponding precision matrix, K 1 (which in turn corresponds to use of a non-fully connected graphical model) for which computations can be less than cubic in N. These approaches may be fruitful avenues for further investigation.

12 12 Supplemental Experiments Power Experiments Figure 2 shows the results of power experiments on the synthetic mouse data from the main paper which contained 5% associations at strengths found in the real data as well as strengths multiplied by a factor of 3 and 5. Only factor 3 was shown in the main paper (Figure 4 in the main paper) LMM EH PS ICE PS ICE LMM EH LINREG LMM PS SVA SVA PS ROC (strength=1) ROC (strength=5) TPR.5 TPR FPR FPR 1.9 ROC (strength=3) 25 LMM EH PS ICE PS #Hits vs Estimated FDR (strength=3) TPR.5 # Hits FPR Estimated FDR Figure 2. Power curves for synthetic mouse data. Clock-wise, the first three plots show ROC curves for different strengths of association. The red line denotes what random guessing would have achieved. For the lowest strength of associations (first plot), all models recovered only a few associations. For stronger strengths of association, our model, LMM-EH-PS, achieved maximum power. The final plot illustrates how the best ICE-based model (ICE-PS), which yielded deflated p-values, penalized itself because of its overly conservative estimated FDR..

13 13 Genomic Control Given the substantial deflation observed with ICE-based models, one might consider the application of Genomic Control (GC) [18] to alleviate deflation in the ICE results. However, it is worth noting that Genomic Control was developed theoretically to mitigate inflation, not deflation of p-values, and in fact prescribes that no adjustment be made when λ < 1. Nevertheless, we applied GC (making the correction even when λ < 1) to p-values obtained by all models presented in the main paper, with results shown in Figures 3 and 4. One can see that, on the mouse data, where a large amount of deflation was observed for ICE-based models, that a Genomic Control correction did not fix the problem. For minor deflation, such as in the human data set, GC appears to have made the distribution more reasonable. However, note that because GC only re-scales all test statistic values by a single factor, GC can not change the rank order of hypotheses and thus has no effect on two of the three power experiments (i.e., ROC curve and non-parametric permutation test), nor on the cis-enrichment test. λ=1., LINREG 4 λ=1., SVA 4 λ=1., LINREG 4 λ=1., SVA λ=1., ICE 4 λ=1., LMM EH 4 λ=1., ICE 4 λ=1., LMM EH Figure 3. Application of Genomic Control to histograms of p-values from the real (left) and synthetic (right) human data. The corresponding histograms without GC are shown in Figure 1 of the main paper. Robustness of Estimating K EH in the Presence of True Associations A concern one might have in estimating K EH as we have done, is whether the approximation of ignoring SNP fixed effects during estimation is a good one in practice. To address this concern, we plotted, as shown in Figure 5, only the true nulls from the synthetic mouse power experiments. In these experiments, K EH was in each case estimated from all data, including the 5% of gene-probe SNP pairs that were not null (a reasonable upper bound for a real

14 14 2 λ=1., SVA PS 2 λ=1., SVA 15 λ=1., SVA PS 2 λ=1., SVA λ=1., LMM PS λ=1., LMM EH λ=1., LINREG λ=1., ICE λ=1., LMM PS λ=1., LMM EH λ=1., LINREG λ=1., ICE λ=1., ICE PS λ=1., LMM EH PS λ=1., ICE PS λ=1., LMM EH PS Figure 4. Application of Genomic Control to histograms of p-values from the real (left) and synthetic (right) mouse data. The corresponding histograms without GC are shown in Figure 3 of the main paper. Although λ = 1 now, distributions which were non-uniform before correction remain non-uniform. data set). Even as we increased the strength multiplier of the SNP fixed effects up to five (producing extremely strong associations as can be seen in the ROC curves in Figure 2), the estimated K EH continued to yield calibrated p-values. Experiment with Data Generated from SVA-PS To further assess the robustness of our method, we also generated null data from a linear mixed model with a PS correction and SVA covariates (SVA-PS) fitted to the real data. The SVA algorithm found 29 SVs on the real data (so we generated using 29), and then found 24 on the synthetically generated data. (In both cases we set SVA to use 1 permutations over a range of eigengene-significance from.1 to.5; it always achieved the same number of SVs across this range.) Figure 6 shows the resulting p-value histograms. As expected, linear regression (LINREG) and a linear mixed model with just a PS correction (LMM-PS) showed inflation. Interestingly, SVA-PS, the model from which we generated the data, showed inflation (λ = 2.5), which may result from overfitting. In particular, although SVA found 24 statistically significant SVs when looking at all of the expression data jointly, when we then used these 24 covariates for each gene probe s model, there were only 188 data points per gene probe with which to fit these 24 parameters. This hypothesis of overfitting is reinforced by the fact that, when we performed a similar experiment using linear regression with SV covariates, we observed similar inflation, but then when we tripled the sample size, the inflation disappeared. Lastly, we note that our model was able to successfully capture the confounding structure generated by SVA-PS. A two-sided, one-sample KS test from uniformity indicated that

15 15 1 λ=1.1, (strength=) λ=.99, (strength=3) λ=1.1, (strength=1) λ=.99, (strength=5) Figure 5. QQ plots of (natural) log p-values for synthetic mouse null data where K EH was estimated in the presence of increasingly strong associations. The vertical axis displays the observed quantiles, while the horizontal axis displays the expected quantiles. Each figure is based on synthetic data of which 3% are true associations, increasing in strength multiplier with each figure. The plots show the p-values only for the null tests, to demonstrate that even in the presence of strong associations, our model was able to infer a good value for K EH such that the p-values remained calibrated. Error bars on QQ plots of log p-values shown in dashed red were computed as follows. We assumed that there are M independent p-values, drawn from a uniform distribution on [, 1]. The kth largest p-value from this draw is the k th order statistic, which is known to have a k Beta(k, M k + 1) distribution. The mean of this distribution is M+1. Thus, at (log) k M+1 on the x-axis, we plot as error bars the (log) 9% confidence error envelope of this Beta distribution. the p-value distribution of our model was not significantly different from the theoretical null distribution (p=.24), but that the other three model s distributions were significantly different at a significance level α = 1e 3. This result reinforces the view that our model is robust to various types of confounding structure, and also that SVA easily induces spurious associations, even on data conforming to its own assumptions.

16 16 15 λ=2.53, SVA PS 6 5 λ=1.4, LMM PS λ=5.4, LINREG λ=.94, LMM EH PS Figure 6. P -value histograms on null synthetic mouse data generated from SVA-PS. All models, except for ours (LMM-EH-PS) show inflation. Even the model from which the data was generated, SVA-PS, shows inflation. Modification of SVA Because SVA contains a number of steps beyond using just projections into eigengene space, we also extended our evaluation from the main paper to include evaluation of a model that omits these steps and instead directly projects the expression data of each individual into eigengene space and then uses these projection covariates as covariates which we refer to as Eigenstrat Extension covariates in different models (using the same number of eigengenes as deemed significant by SVA). We use EE to denote a linear regression model with Eigenstrat Extension covariates. We use EE-PS to denote a mixed model with PS correction and these covariates. As in the main paper, we continue to use SVA and SVA-PS to denote, respectively, linear regression with SVA covariates, and a mixed model with PS correction and SVA covariates. Figure 7 shows p-value histograms on synthetic mouse data containing no associations (the same data as used in the main paper for Figures 3 and 4). We see that EE has less inflation than SVA, and that EE-PS has less inflation than SVA-PS, suggesting the EE-based models perform better than SVA-based models. Furthermore, Figure 8 shows that EE-based models have better power than their corresponding SVA-based models. Also, note that all four models perform worse than linear regression. In contrast, we see that our model and others outperform linear regression (see main paper).

17 17 λ=3.8, SVA λ=3.3, EE λ=3.58, SVA λ=3.52, EE λ=2.77, SVA PS λ=2.38, EE PS λ=3.13, SVA PS λ=2.25, EE PS Figure 7. P -value histograms on mouse data (as described in the main paper). Comparison of SVA-based algorithms and EE-based algorithms for the real data (left) and the synthetic null data (right). EE and EE-PS show less inflation respectively than SVA and SVA-PS LINREG EE SVA EE PS SVA PS ROC (strength=3) TPR FPR Figure 8. Power curves for synthetic mouse data. Comparison of EE-based algorithms to SVA-based algorithms and linear regression (all but EE and EE-PS are shown in Figure 4 of the main paper, which also shows our model outperforming linear regression). Linear regression performed better than EE and SVA based models. EE and SVA performed better when not combined with a PS correction. EE-based models performed better than SVA-based models.

18 LMM EH ICE LINREG SVA Figure 9. Human 2D gene-probe SNP map. Data shown are from chromosome 1 as described in the main paper (real, not synthetic).

absolute position). The darker a pixel, the more significant the hypotheses in that region. A cis diagonal band is visible in all but linear regression.

Trans-bands, presumably mostly spurious, are visually apparent in linear regression, and more subtly in SVA, but not in the other models.

Thus, to improve visual clarity, we used an averaging filter of window size 2 2 (i.e., each p-value becomes the average of the p-values for all hypotheses within a 2 2 gene-probe SNP window in the original matrix of p-values).

18 18 LMM EH ICE LINREG SVA Figure 9. Human 2D gene-probe SNP map. Data shown are from chromosome 1 as described in the main paper (real, not synthetic). The horizontal axis denotes enumerated SNPs and the vertical axis denotes enumerated gene probes (both displayed according to relative ordering along the chromosome, not absolute position). The darker a pixel, the more significant the hypotheses in that region. A cis diagonal band is visible in all but linear regression. We know from the cis enrichment experiments in the main paper that the models are detecting an appreciable number of cis associations. Trans-bands, presumably mostly spurious, are visually apparent in linear regression, and more subtly in SVA, but not in the other models. Because the density of hypotheses is much higher than screen or paper resolution, naive plotting of these data results is sub-optimal. Thus, to improve visual clarity, we used an averaging filter of window size 2 2 (i.e., each p-value becomes the average of the p-values for all hypotheses within a 2 2 gene-probe SNP window in the original matrix of p-values). Then these averaged values are sub-sampled, taking every twentieth averaged value in either direction (SNP and gene probe). Finally, we took the log of this matrix and displayed it using Matlab s image mapping functionality (imagesc with default settings).

19 LMM EH PS ICE PS LMM EH ICE LMM PS LINREG SVA PS SVA

The horizontal axes denotes SNPs and the vertical axis

A cis diagonal band is subtly visible for LMM-EH-PS, ICE,

We know from the cis enrichment experiments in the main

apparent in linear regression and LMM-PS, and more subtly

19 19 LMM EH PS ICE PS LMM EH ICE LMM PS LINREG SVA PS SVA Figure 1. Mouse 2D gene-probe SNP map. Data shown are from chromosome 1 as described in the main paper (real, not synthetic). The horizontal axes denotes SNPs and the vertical axis denotes gene probes, as in Figure 9. A cis diagonal band is subtly visible for LMM-EH-PS, ICE, LMM-EH, and ICE, but less so or not at all for the others. We know from the cis enrichment experiments in the main paper that the models are detecting an appreciable number of cis associations. Trans-bands, presumably mostly spurious, are visually apparent in linear regression and LMM-PS, and more subtly in SVA and PS-SVA, but not in the other models. See caption of Figure 9 for note about the construction of these images although here we used an averaging window size of 15 and a subsample rate of 1.

20 2 References 1. Bishop, C. M. Pattern Recognition and Machine Learning (Information Science and Statistics) (Springer, 26). 2. Heckerman, D. A tutorial on learning with Bayesian networks. In Learning in graphical models, (Kluwer, 1998). 3. Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference (Morgan Kaufmann, San Mateo, CA, 1988). 4. Kang, H. M. et al. Efficient control of population structure in model organism association mapping. Genetics (28). 5. Yu, J. et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 38, (26). 6. Zhao, K. et al. An Arabidopsis example of association mapping in structured samples. PLoS Genet. 3, e4 (27). 7. Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nature genetics 38, (26). 8. Kang, H. M., Ye, C. & Eskin, E. Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots. Genetics 18 (28). 9. Harville, D. Bayesian inference for variance components using only error contrasts. Biometrika (1974). 1. Patterson, H. & Thompson, R. Recovery of inter-block information when block sizes are unequal. Biometrika 58 (1971). 11. Storey, J. D. & Tibshirani, R. Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. 1, (23). 12. Brent, R. P. Algorithms for Minimization without Derivatives (Prentice-Hall, Englewood Cliffs, New Jersey, 1973).

21 Nocedal, J. & Wright, S. J. Numerical Optimization (Springer Verlag, 26). 14. Leek, J. T. & Storey, J. D. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet 3, e161+ (27). 15. Dempster, A. P., Laird, N. M. & Rubin, D. B. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39, 1 38 (1977). 16. Rasmussen, C. E. & Williams, C. K. I. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) (The MIT Press, 25). 17. Petersen, K. B. & Pedersen, M. S. The matrix cookbook. Tech. Rep. (26). 18. Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, (1999).

FaST linear mixed models for genome-wide association studies

FaST linear mixed models for genome-wide association studies Nature Methods FaS linear mixed models for genome-wide association studies Christoph Lippert, Jennifer Listgarten, Ying Liu, Carl M Kadie, Robert I Davidson & David Heckerman Supplementary Figure Supplementary