A Penalized Regression Model for the Joint Estimation of eqtl Associations and Gene Network Structure

Size: px

Start display at page:

Download "A Penalized Regression Model for the Joint Estimation of eqtl Associations and Gene Network Structure"

Felicia Cooper
5 years ago
Views:

1 A Penalized Regression Model for the Joint Estimation of eqtl Associations and Gene Network Structure Micol Marchetti-Bowick Machine Learning Department Carnegie Mellon University Abstract Background: A critical task in the study of biological systems is understanding how gene expression is regulated within the cell. This problem is typically divided into multiple separate tasks, including performing eqtl mapping to identify SNP-gene relationships and estimating gene network structure to identify gene-gene relationships. Aim: In this work, we pursue a holistic approach to discovering the patterns of gene regulation in the cell. We present a new method for jointly performing eqtl mapping and gene network estimation while encouraging a transfer of information between the two tasks. Data: We evaluate our approach on both synthetic data and on a real yeast eqtl dataset that contains, 57 SNP genotypes and, 409 gene expression measurements for 4 yeast samples. Methods: To construct a unified model for jointly performing eqtl mapping and gene network inference, we formulate the problem as a multiple-output regression task in which we aim to learn the regression coefficients while simultaneously estimating the conditional independence relationships among the set of response variables. The approach we develop uses structured sparsity penalties to encourage the sharing of information between the regression coefficients and the output network in a mutually beneficial way. Our model, inverse-covariance-fused lasso, is formulated as a biconvex optimization problem that we solve via alternating minimization. We derive new, efficient optimization routines to solve each convex sub-problem that are based on existing state-of-the-art methods. Results: We demonstrate the value of our approach by applying our method to both simulated data and a real yeast eqtl dataset data. Experimental results demonstrate that our approach outperforms a large number of existing methods on the recovery of the true sparse structure of both the eqtl associations and the gene network. Conclusions: We show that inverse-covariance-fused lasso can be used to perform joint eqtl mapping and gene network estimation on a yeast dataset, yielding more biologically coherent results than previous work. Furthermore, the same problem setting appears in many different applications, and therefore our model can be deployed in a wide range of domains.

2 Introduction A critical task in the study of biological systems is understanding how gene expression is regulated within the cell. Although this problem has been studied extensively over the past few decades, it has recently gained momentum due to the rapid advancement in techniques for high-throughput data acquisition. Within this broad task, two sub-problems of particular interest are (a) understanding how various genetic loci regulate gene expression, a problem known as eqtl mapping (Rockman and Kruglyak, 2006), and (b) determining which genes have a direct influence on the expression of other genes, a problem known as gene network estimation (Gardner and Faith, 2005). Prior work on learning regulatory associations has largely treated eqtl mapping and gene network estimation as completely separate problems. In this work, we pursue a holistic approach to discovering the patterns of gene regulation in the cell by integrating eqtl mapping and gene network estimation into a single model and performing joint inference. Specifically, given a dataset that contains both genotype information for a set of single nucleotide polymorphisms (SNPs) and mrna expression measurements for a set of genes, we aim to simultaneously learn the SNP-gene and gene-gene relationships. The key element of our approach is that we enable the transfer of knowledge between these two tasks in order to yield more accurate solutions to both problems. In particular, we assume that two genes that are tightly linked in a regulatory network are likely to be associated with similar sets of SNPs in an eqtl map, and vice versa. This allows us to use information about gene-gene relationships to learn more accurate eqtl associations, and similarly to use information about SNP-gene relationships to learn a more accurate gene network. We construct a unified model for this problem by formulating it as a multiple-output regression task in which we jointly estimate the regression coefficients and the inverse covariance structure among the response variables. Specifically, given SNPs x = (x,..., x p ) and genes y = (y,..., y q ), our goal is to regress y on x and simultaneously estimate the inverse covariance of y. Under this model, the matrix of regression coefficients encodes the SNP-gene relationships in the eqtl map, whereas the inverse covariance matrix captures the gene-gene relationships in the gene network. In order to ensure that information is transferred between the two components of the model, we incorporate a regularization penalty that explicitly encourages pairs of genes that have a high weight in the inverse covariance matrix to also have similar regression coefficient values. This structured penalty enables the two estimates to learn from one another as well as from the data. A large number of techniques have been developed to address eqtl mapping and gene network estimation in isolation. The traditional approach to eqtl mapping is to examine each SNP-gene pair independently and perform a univariate statistical test to determine whether an association exists between the two. Recently, a series of more complex multivariate models have been utilized for this task in order to jointly capture the effects of several SNPs on one gene (Michaelson et al., 200). Going one step further, a few approaches have been developed that jointly model multiple SNPs and multiple genes (Kim and Xing, 2009; Kim et al., 202). Conversely, the simplest method for gene network inference is to construct a graph in which each edge is weighted by the marginal correlation between the corresponding genes. Another related method that has recently gained popularity, known equivalently as Gaussian graphical model estimation or inverse covariance estimation, produces a network in which the edges are instead weighted by the conditional correlations (Dempster, 972). Other approaches include using Bayesian networks or ordinary differential equations to model the relationships between genes. (Marbach et al., 200). In this work, we will use a multiple-output (also called multi-task) regression model to perform eqtl mapping and incorporate a Gaussian graphical model over the genes in order to simultaneously infer the gene network. However, it is well known that model estimation is challenging in the 2

3 high-dimensional setting in which we have more variables in the model than samples from which to learn. This is the precisely the setting frequently encountered in genomics, where we see datasets with at most a few hundred samples but as many as million SNPs and thousands of genes. As a result, we will incorporate several regularization penalties in order to facilitate model estimation. The challenge of high-dimensional data has already inspired a remarkable number of useful methods for penalized multi-task regression. Parameter estimation can be improved by leveraging sparse regularizers to encourage the outputs to share an underlying representation. For example, mixed-norm penalties such as the l,2 or l, norm can be used to encourage all outputs to have nonzero coefficients for the same set of inputs (Obozinski et al., 2008). In other cases, structured sparsity can be used to encourage similarities only among responses that are known to be related. Examples of such approaches include the group lasso (Yuan and Lin, 2006), which encourages groups of related outputs to have nonzero coefficients for the same subset of inputs, and the graph-guided fused lasso (Kim and Xing, 2009), which encourages pairs of outputs that are connected in a graph to have similar coefficient values. These penalties leverage prior knowledge of the relationships among the outputs (e.g. group or graph structures) to enforce similarity constraints. The principal limitation of using such structured sparsity in multi-task regression is that it typically requires extensive prior knowledge of the relationships among the outputs, which is not always readily available. To circumvent this, another class of models have been developed that jointly learn the regression coefficients along with the output dependency structure (Rothman et al., 200; Zhang and Yeung, 200; Lee and Liu, 202; Sohn and Kim, 202; Rai et al., 202). However, none of these approaches use a regularization penalty to explicitly encourage shared structure between the estimate of the regression parameters and the output network. Furthermore, the majority of these methods do not learn the covariance structure of the outputs y but rather the conditional covariance of the outputs given the inputs y x. This can be interpreted as the covariance of the noise matrix, and does not capture the true relationships between outputs. In high-dimensional settings, inverse covariance estimation models can also benefit from penalties that induce sparsity or encourage a particular structure. For example, Tan et al. (204) use a row-column overlap norm penalty to encourage a network structure with hubs. When the goal is to learn a series of related networks for different conditions or time points, regularization can be used to encourage similarities between the networks (Mohan et al., 204; Peterson et al., 205). Here we present a novel approach that jointly performs eqtl mapping and gene network inference while simultaneously leveraging structured sparsity. In particular, our work has three significant contributions: () we construct a novel multiple-output regression model that jointly estimates the regression coefficients and the output network while encouraging shared structure; (2) we develop an efficient alternating minimization procedure for our model, which has a biconvex objective, and derive nontrivial extensions to two state-of-the-art optimization techniques to solve each sub-problem; (3) we perform synthetic and real data experiments to demonstrate that our approach leads to more accurate estimation of the sparsity pattern of the regression coefficients, more accurate estimation of the sparse network structure, and a lower prediction error than baseline methods. To the best of our knowledge, there are no existing approaches that jointly estimate regression coefficients and output relationships while also incorporating regularization penalties to encourage the explicit sharing of information between the two estimates. The remainder of this article is organized as follows. We begin by providing some background in Section 2, and then present our model and optimization technique in Sections 3 and 4. Next we demonstrate the value of our approach in Sections 5 and 6 by applying our model to both synthetic and real data, and comparing the results to several baselines. Finally, we conclude in Section 7 by discussing the implications of our findings. 3

4 2 Background Before introducing our approach, we provide some background on the problems of penalized multiple-output regression and sparse inverse covariance estimation, which will form the building blocks of our unified model. In what follows, we assume X is an n-by-p dimensional matrix of SNP genotypes, which we also call inputs, and Y is an n-by-q dimensional matrix of gene expression values, which we also call outputs. Here n is the number of samples, p is the number of SNPs, and q is the number of genes. The element x ij {0,, 2} represents the genotype value of sample i at SNP j, encoded as 0 for two copies of the minor allele, for one copy of the minor allele, and 2 for two copies of the minor allele. Similarly y ik R represents the expression value of sample i in gene k. We assume that the expression values for each gene are mean-centered. 2. Multiple-Output Lasso Given input matrix X and output matrix Y, the standard l -penalized multiple-output regression problem, also known as the multi-task lasso (Tibshirani, 996), is given by min B n Y XB 2 F + λ B () where B is a p-by-q dimensional matrix and β jk is the regression coefficient that maps SNP x j to gene y k. Here is an l norm penalty that induces sparsity among the estimated coefficients, and λ is a regularization parameter that controls the degree of sparsity. The objective function given above is derived from the penalized negative log likelihood of a multivariate Gaussian distribution, assuming y x N (x T B, ε 2 I) where we let ε 2 = for simplicity. Although this problem is formulated in a multiple-output framework, the l norm penalty merely encourages sparsity, and does not enforce any shared structure between the regression coefficients of different outputs. As a result, the objective function given in () decomposes into q independent regression problems. 2.2 Graph-Guided Fused Lasso Given a weighted graph W R q q that encodes a set of pairwise relationships among the outputs, we can modify the regression problem by imposing an additional fusion penalty that encourages genes y k and y m to have similar parameter vectors β k and β m when the weight of the edge connecting them is large. This problem is known as the graph-guided fused lasso (Kim et al., 2009; Kim and Xing, 2009; Chen et al., 200) and is given by min B n Y XB 2 F + λ B + γ (k,m) w km β k sgn(w km )β m (2) Here the l norm penalty again encourages sparsity in the estimated coefficient matrix. In contrast, the second penalty term, known as a graph-guided fusion penalty, encourages similarity among the regression parameters for all pairs of outputs. The weight of each term in the fusion penalty is dictated by w km, which encodes the strength of the relationship between y k and y m. Furthermore, the sign of w km determines whether to encourage a positive or negative relationship between parameters; if w km > 0 (i.e. genes y k and y m are positively correlated), then we encourage β k to be equal to β m, but if w km < 0 (i.e. genes y k and y m are negatively correlated), we encourage β k to be equal to β m. If w km = 0, then genes y k and y m are unrelated, and so we don t fuse their respective regression coefficients. 4

5 2.3 Sparse Inverse Covariance Estimation In the graph-guided fused lasso model defined in (2), the graph W must be known ahead of time. However, it is also possible to learn a network over the set of genes. One way to do this is to estimate their pairwise conditional independence relationships. If we assume y N (µ, Σ), where we let µ = 0 for simplicity, then these conditional independencies are encoded in the inverse covariance matrix, or precision matrix, defined as Θ = Σ. We can obtain a sparse estimate of the precision matrix using the graphical lasso (Friedman et al., 2008) given by min Θ n tr(y T Y Θ) log det(θ) + λ Θ (3) This objective is again derived from the penalized negative log likelihood of a Gaussian distribution, where this time the l penalty term encourages sparsity among the entries of the precision matrix. 3 The Inverse-Covariance-Fused Lasso In this section, we introduce a new approach for jointly estimating the coefficients in a multipleoutput regression problem and the edges of a network over the regression outputs. We apply this technique to the problem of simultaneously learning an eqtl map and a gene regulatory network from genome (SNP) data and transcriptome (gene expression) data. Although we focus exclusively on this application, the same problem formulation appears in other domains as well, such as the task of modeling the dependencies between a set of economic indicators and the stock prices of various companies while also understanding the relationships among the companies. 3. A Joint Regression and Network Estimation Model Given SNPs x R p and genes y R q, in order to jointly model the n-by-p regression parameter matrix B and the q-by-q inverse covariance matrix Θ, we begin with two core modeling assumptions, x N (0, T ) (4) y x N (x T B, E) (5) where T is the covariance of x and E is the conditional covariance of y x. Given the above model, we can also derive the marginal distribution of y. To do this, we first use the fact that the marginal distribution p(y) is Gaussian. We can then use the law of total expectation and the law of total variance to derive the mean and covariance of y, as follows. E y (y) = E x (E y x (y x)) = 0 (6) Cov y (y) = E x (Cov y x (y x)) + Cov x (E y x (y x)) = E + B T T B (7) Using these facts, we conclude that the distribution of y is given by y N (0, Θ ) (8) where Θ = E + B T T B denotes the marginal covariance of y. This allows us to explicitly relate Θ, the inverse covariance of y, to B, the matrix of regression parameters. Lastly, we simplify our model by assuming T = τ 2 I p p and E = ε 2 I q q. With this change, the relationship between B and Θ can be summarized as Θ B T B because B is now the only term that contributes to the off-diagonal entries of Θ and hence to the inverse covariance structure among the genes. We refer the reader to Equation B.44 of Appendix B in Bishop (2006). 5

6 3.2 Estimating Model Parameters with a Fusion Penalty Now that we have a model that explicitly captures B and Θ, we want to jointly estimate these parameters from the data while encouraging the relationship Θ B T B. To do this, we formulate our model as a convex optimization problem with an objective function of the form loss y x (B) + loss y (Θ) + penalty(b, Θ) (9) where loss y x (B) is a loss function derived from the negative log likelihood of y x, loss y (Θ) is a loss function derived from the negative log likelihood of y, and penalty(b, Θ) is a penalty term that encourages shared structure between the estimates of B and Θ. Given n i.i.d. observations of x and y, let X be a matrix that contains one observation of x per row and let Y be a matrix that contains one observation of y per row. Then we define the inverse-covariance-fused lasso optimization problem as min B,Θ n Y XB 2 F + n tr(y T Y Θ) log det(θ) + λ B + λ 2 Θ + γ (k,m) θ km β k + sgn(θ km )β m (0) This objective effectively boils down to a combination of problems () and (3), with the addition of a graph-guided fusion penalty from (2) to encourage transfer learning between the estimate of B and Θ. We deconstruct the above objective by describing the role of each term in the model: The first term n Y XB 2 F is the standard squared error loss for multiple-output regression, and is derived from the log likelihood of y x N (x T B, ε 2 I). Its role is to encourage the coefficients B to map X to Y. The second term n tr(y T Y Θ) log det(θ) is the loss function for inverse covariance estimation over Y, and is derived from the log likelihood of y N (0, Θ ). Its role is to encourage the network Θ to reflect the partial correlations among the outputs. The third term λ B = λ j,k β jk is an l norm penalty over the matrix of regression coefficients that induces sparsity in B. The fourth term λ 2 Θ = λ 2 k,m θ km is an l norm penalty over the precision matrix that induces sparsity in Θ. The final term γ (k,m) θ km β k +sgn(θ km )β m is a graph-guided fusion penalty that encourages similarity between the coefficients of closely related outputs; specifically, when y k and y m have a positive partial correlation and β jk β jm for any j, it imposes a penalty proportional to θ km, and when y k and y m have a negative partial correlation and β jk β jm for any j, it imposes a penalty proportional to θ km. 2 2 Note that θ km is negatively proportional to the partial correlation between y k and y m, meaning that a negative value of θ km indicates a positive partial correlation and a positive value of θ km indicates a negative partial correlation. The partial correlation coefficient between y j and y k is given by ρ jk = θ jk / θ iiθ jj and is defined as the correlation between the errors of the best linear predictions of y j and y k when conditioned on covariates {y m : m j, k} (see, e.g., Peng et al. (2009)). This explains why the sign is flipped in the fusion penalty in (0) relative to the one in (2). 6

7 A derivation of the two loss functions can be found in Appendix A. These come directly out of the modeling assumptions given in (5) and (8). The two l norm penalties induce sparsity in the estimates of B and Θ, which makes estimation feasible in the high-dimensional setting where p, q > n and furthermore is necessary for interpreting the eqtl map and gene network. Lastly, we prove that the graph-guided fused lasso penalty encourages the structure Θ B T B, thereby linking the two estimates. Note that λ, λ 2, and γ are regularization parameters that determine the relative weight of each penalty term. First we consider the optimization problem ˆΘ = arg min Θ f(θ) tr(b T BΘ) log det(θ). We can solve this problem in closed form by taking the gradient Θ f(θ) = B T B Θ and setting it to 0, which yields the solution ˆΘ = B T B. This suggests that the penalty tr(b T BΘ) encourages the desired structure, while the log determinant term enforces the constraint that Θ be positive semidefinite, which is necessary for Θ to be a valid inverse covariance matrix. Instead of directly using this penalty in our model, we demonstrate its equivalence to the graph-guided fused lasso penalty shown in (2) when we substitute Θ for W. We compare the trace penalty, denoted TRP, and the graph-guided fused lasso penalty, denoted GFL, below. TRP(B, Θ) = tr(b T BΘ) = GFL(B, Θ) = q k= m= q k= m= q θ km β T k β m () q θ km β k + sgn(θ km )β m (2) We show how these penalties are equivalent by considering three different cases. Case of θ km = 0. When θ km = 0, the corresponding term in () becomes 0 β T k β m = 0 and the corresponding term in (2) becomes 0 β k + sgn(θ km )β m = 0. Therefore there is nothing linking β k and β m when θ km = 0 in either penalty. Case of θ km < 0. When θ km < 0, the corresponding term in () is minimized when β T k β m is large and positive, which occurs when β k and β m point in the same direction. Similarly, the corresponding term in (2) becomes θ km β k β m and the penalty is minimized when β k = β m. Therefore when θ km is negative, both penalties encourage similarity between β k and β m with strength proportional to the magnitude of θ km. Case of θ km > 0. When θ km > 0, the corresponding term in () is minimized when β T k β m is large and negative, which occurs when β k and β m point in opposite directions. Similarly, the corresponding term in (2) becomes θ km β k +β m and the penalty is minimized when β k = β m. Therefore when θ km is positive, both penalties encourage similarity between β k and β m with strength proportional to the magnitude of θ km. We choose to use the graph-guided fused lasso penalty instead of the trace penalty because it more strictly enforces the relationship between B and Θ by fusing the regression parameter values of highly correlated genes. 3.3 Relationship to Other Methods There are currently two existing approaches that jointly estimate regression coefficients and network structure: multivariate regression with covariance estimation (MRCE), from Rothman et al. (200), and conditional Gaussian graphical models (CGGMs), originally from Sohn and Kim (202) and 7

8 further developed by Wytock and Kolter (203) and Yuan and Zhang (204). In this section, we describe how our approach differs from these others. All three methods (including ours) assume that the inputs X and outputs Y are related according to the basic linear model Y = XB + E, where E is a matrix of Gaussian noise. However, each approach imposes a different set of additional assumptions on top of this, which we discuss below. MRCE: This method assumes that E N (0, Ω ), which leads to Y X N (XB, Ω ). MRCE estimates B and Ω by solving the following objective: min B,Ω n tr((y XB)T (Y XB) Ω) log det(ω) + λ B + λ 2 Ω (3) It s very important to note that Ω is the conditional inverse covariance of Y X, which actually corresponds to the inverse covariance of the noise matrix E rather than the inverse covariance of the output matrix Y. We therefore argue that Ω doesn t capture any patterns that are shared with the regression coefficients B, since by definition Ω encodes the structure in Y that cannot be explained by XB. CGGM: This approach makes an initial assumption that X and Y are jointly Gaussian with the following distribution: ( X Y ) N ([ 0 0 ] [ Γ Λ, Λ T Ω In this formulation, the distribution of Y X is given by Y X N ( XΛΩ, Ω ). This corresponds to the reparameterization of B as ΛΩ, where Ω is the conditional inverse covariance matrix and Λ represents the direct influence of X on Y. CGGMs estimate Λ and Ω by solving the following optimization problem, where sparsity penalties are applied to Λ and Ω instead of B and Ω as was the case in (3): min Λ,Ω ]) n tr((y + XΛΩ ) T (Y + XΛΩ ) Ω) log det(ω) + λ Λ + λ 2 Ω (4) Here the meaning of Ω has not changed, and it once again represents the inverse covariance of the noise matrix. ICLasso: Our method implicitly assumes two underlying models: Y X N (XB, I) and Y N (0, Θ ). In this case, Θ represents the marginal inverse covariance of Y rather than the conditional inverse covariance of Y X, which was captured by Ω in (3) and (4). The optimization problem in (0) is obtained by combining the loss functions derived from the log likelihood of each model and then incorporating sparsity penalties over B and Θ and an additional graph-guided fusion penalty to encourage shared structure. Both MRCE and CGGMs have two important drawbacks that are not shared by our approach. First, both of these methods estimate Ω, the precision matrix of the noise term, rather than Θ, the precision matrix of the outputs Y. This means that they are estimating a network over the outputs in which each edge represents a partial correlation after conditioning on both the xs and all other ys. Second, neither method incorporates a structured sparsity penalty that explicitly encourages shared structure between the network and the regression coefficients. In fact, it would not make sense for these methods to apply a joint penalty over B and Ω because, as discussed above, we wouldn t expect these parameters to have any shared structure. By comparison, our method learns the true output network Θ and uses a graph-guided fused lasso penalty to explicitly encourage outputs that are closely related in Θ to have similar parameter values in B. 8

9 4 Optimization via Alternating Minimization In this section we present an efficient algorithm to solve the inverse-covariance-induced fused lasso problem defined in (0). We start by rewriting the fusion penalty as follows: GFL(B, Θ) = γ (k,m) θ km β k + sgn(θ km )β m = γ (k,m) max{θ km, 0} β k + β m + γ (k,m) max{ θ km, 0} β k β m, from which it is clear that GFL is a bi-convex function, meaning that with fixed B, it is convex in Θ, and vice versa. Thus, upon defining we can rewrite the original objective as g(b) = n Y XB 2 F + λ B h(θ) = n tr(y T Y Θ) log det(θ) + λ 2 Θ, min B,Θ g(b) + h(θ) + GFL(B, Θ). (5) The function g is the usual lasso formulation in (), the function h is the graphical lasso formulation in (3), and the graph-guided fusion penalty couples the two problems. Fortunately, since GFL is bi-convex, we can solve the joint problem (5) using an alternating minimization strategy. Next we leverage and extend state-of-the-art convex optimization routines to solve each convex sub-problem. Fix Θ, Minimize B. When Θ is fixed, minimizing the objective over B reduces to the wellknown graph-guided fused lasso problem, f Θ (B) = g(b) + GFL(B, Θ), (6) which we optimize using the proximal-average proximal gradient descent (PA-PG) algorithm from Yu (203). This algorithm is very simple. On each iteration, we first take a gradient step of the form B ηx (XB Y ) using some small step size η. Then we compute the weighted average of the component proximal operators for each pair of outputs, where the prox that corresponds to pair (k, m) is given by: ˆB = arg min B 2η B Z 2 F + β.k + sgn(θ km )β.m (7) and the weight of this term is given by θ km /θ tot where θ tot = (k,m) θ k,m. Due to the separability of (7) over the rows of B, we can solve for each β j independently. Furthermore, it s clear that for any i {k, m}, we have β ji = z ji. Solving for the remaining elements β jk and β jm leads to the following two-dimensional subproblem: ˆβ jk, ˆβ jm = arg min β jk,β 2η (β jk z jk ) 2 + (β jm z jm ) 2 + β jk + sgn(θ km )β jm. (8) jm which can be solved in closed form. Therefore the full solution to the prox operator can be written compactly as follows, where d km = z k + sgn(θ km )z m. ˆβ i = z i for i / {k, m} ˆβ k = z k sgn(d km ) min{η, 2 d km } ˆβ m = z m sgn(θ km ) sgn(d km ) min{η, 2 d km } 9

10 From these formulas, we can see that β jk and sgn(θ km )β jm are always fused towards each other. For example, when sgn(θ km ) < 0, we want to push β jk and β jm towards the same value. In this case, the larger of z jk and z jm will be decremented and the smaller value will be incremented by the same quantity. In contrast, when sgn(θ km ) > 0, the solution pushes β jk towards β jm. We summarize the procedure in Algorithm of Appendix B. In practice, we use the accelerated version of the algorithm, PA-APG. Using the argument from Yu (203), we can prove that this accelerated algorithm converges to an ɛ-optimal solution in at most O(/ɛ) steps, which is significantly better than the converge rate of subgradient descent, which is O(/ ɛ). Fix B, Minimize Θ. When B is fixed, minimizing the objective over Θ reduces to a variation of the well-known graphical lasso problem, f B (Θ) = h(θ) + GFL(B, Θ), (9) which can be optimized by adapting the block coordinate descent (BCD) algorithm of Friedman et al. (2008). Indeed, we can rewrite the objective by introducing two q q dimensional coefficient matrices U and L whose elements are defined as U km = n Y.k.m + λ 2 + γ β.k + β.m (20) L km = n Y.k.m λ 2 γ β.k β.m. (2) Using this notation, we collect all linear terms involving Θ + := max{θ, 0} and Θ := max{ Θ, 0} and reformulate the objective given in (9) as min Θ log det(θ) + Θ +, U Θ, L. (22) The graphical lasso is a special case of the above problem in which U = L. In our case, U and L differ because of the structure of the GFL penalty. Nevertheless, we can derive a block coordinate algorithm for this more general setting. First we dualize (22) to get the following problem: max log det Ξ. (23) L Ξ U where Θ = Ξ. Then it can be shown that the diagonal of the covariance Ξ must attain the upper bound, i.e. we must have Ξ jj = U jj j =,..., q. Next, we perform block coordinate descent by cycling through each column (or row, due to symmetry) of Ξ. We denoted an arbitrary column of Ξ by ξ j, with corresponding columns u j and l j in U and L, respectively. Let Ξ j be the submatrix of Ξ obtained by deleting column j and row j. Then, by applying Schur s complement, maximizing (23) with respect to ξ j with all other columns fixed amounts to: Dualizing again, with ξ j = Ξ j α, we obtain min l j ξ j u 2 ξ j j Ξ j ξ j. (24) min α 2 α Ξj α + u α + l α, (25) which is essentially a lasso problem that we can solve using any known algorithm. Algorithm 2 of Appendix B outlines our procedure for solving (9). We use coordinate descent and apply a variant of the soft-thresholding operator to solve for each coordinate. This algorithm converges very quickly because there is no tuning of the step size, and each iteration involves only a matrix-vector product. 0

11 B Truth B GFLasso B GFLasso2 B CGGM B ICLasso Θ Truth Covariance GFLasso Precision GFLasso2 Ω CGGM Θ ICLasso Figure : A comparison of results on a single synthetic dataset with p = 60, q = 60, and a blockstructured output graph. The far left panel contains the ground truth for B and Θ. The remaining panels show the estimates of the regression coefficients for each method (top) along with the graph structure that was used or estimated by the method (bottom). 5 Simulation Study We first evaluate our method on synthetic data with known values of B and Θ so that we can directly measure how well the the true parameter values are recovered. We compare our regression estimates to several baselines, including ˆB standard multi-task lasso (Lasso), graph-guided fused lasso using a sparse covariance matrix as its graph (GFLasso), graph-guided fused lasso using a sparse precision matrix as its graph (GFlasso2), sparse multivariate regression with covariance estimation (MRCE), and the conditional Gaussian graphical model (CGGM). We also compare our network estimates ˆΘ to the graphical lasso (GLasso). For each method, we selected hyperparameter values by minimizing the error on a held-out validation set. For this analysis, we used two different synthetic data settings, each imposing a different network structure over the outputs: block-structured network: the outputs are divided into non-overlapping groups, and each group forms a fully connected sub-graph in the network tree-structured network: the outputs are related to one another according to a tree with a single root node and a branching factor of three, and all nodes share edges with their parents and children in the network Each synthetic dataset is generated using p = 60 inputs, q {30, 60, 90, 20} outputs, and n = 50 samples. After constructing a network over the outputs using one of the structures described above, we generate the sparsity pattern and parameter values for the coefficient matrix B. For the block-structured network, we choose a random 5% of inputs for each group, and assign positive weights between these inputs and all nodes in the group. We draw coefficient values according to v j Unif(0.2, 0.8) and set β jk = v j k group, such that all members of the group have the same coefficient value for these inputs. We also randomly select 5% additional inputs for each group that will be shared across all members of that group and one other group, and generate coefficients in the same way. For the tree-structured network, we choose a random 5% of inputs for each node, and assign a positive weight between those inputs and both the node and each of its children. We similarly generate coefficient values according to v j Unif(0.2, 0.8) and set β jk = v j k {parent,children}. Next we generate X N (0, Σ) where Σ jk = 0.6 j k. Finally,

12 Average Precision/Recall Curve for B Average Precision/Recall Curve for Θ Precision Lasso GFLasso GFLasso2 MRCE CGGM ICLasso Recall Average Precision/Recall Curve for B Precision GLasso ICLasso Recall Average Precision/Recall Curve for Θ Precision Lasso GFLasso GFLasso2 MRCE CGGM ICLasso Recall Precision GLasso ICLasso Recall Figure 2: Average precision-recall curves for B (left) and Θ (right) for synthetic datasets with two different parameter settings: p = 60, q = 20, block-structured graph (top), and p = 60, q = 60, tree-structured graph (bottom). given X and B, we generate Y N (XB, I). We create multiple synthetic datasets by varying the network structure and the number of outputs. For each setting, we average our results over 5 synthetic datasets. A synthetic data example is shown in Figure. The ground truth for both B and Θ is given in the far left panel. The next three columns show the estimated values of B for three competing methods, and the results of our method are shown on the far right. In this example, the drawbacks of each of the baseline methods are evident. The covariance matrix used for the network structure in GFLasso captures many spurious patterns in Y that don t correspond to true patterns in the regression map, which confuses the estimate of B. The precision matrix used for the network structure in GFLasso2 does not accurately capture the true inverse covariance structure because of the low signal-to-noise ration in Y. This prevents the fusion penalty from effectively influencing the estimate of B. Finally, although CGGM gets a reasonable estimate of the network, this structure is not explicitly enforced in B, which still leads to a poor estimate of the regression parameters. In contrast, the cleanest estimate of both ˆB and ˆΘ comes from ICLasso. Next we quantitatively evaluate how well each method is able estimate the sparsity structure of B and Θ. Since the hyperparameter values selected by minimizing the error on a validation set do not usually correspond to the values that lead to the most accurate sparsity, we calculate a precision-recall curve on top of ˆB and ˆΘ. To do this, we vary the threshold at which each parameter value is considered zero or nonzero and we calculate the precision and recall at each threshold. Figure 2 shows the precision-recall curves for B and Θ averaged over 5 datasets in two different parameter settings. The top two plots show results for a block-structured network with q = 20. Here it s evident that ICLasso significantly outperforms all of the baselines in its estimate of both the regression map and network structure. The bottom two plots show results for a tree-structured 2

13 AUCPR of B Best F Score of B AUCPR of Θ Best F Score of Θ Lasso GFLasso 0.4 GFLasso2 MRCE 0.3 CGGM ICLasso 0.2 q=30 q=60 q=90 q= Lasso GFLasso 0.4 GFLasso2 MRCE 0.3 CGGM ICLasso 0.2 q=30 q=60 q=90 q= GLasso ICLasso 0.2 q=30 q=60 q=90 q= GLasso ICLasso 0.2 q=30 q=60 q=90 q=20 Figure 3: Area under the precision-recall curve and best F score over all regularization parameter values for B and Θ shown for different numbers of outputs q and averaged over 30 synthetic datasets generated with both graph types. Prediction Error of Y on a Test Set Lasso GFLasso GFLasso2 MRCE CGGM ICLasso Figure 4: Test set prediction error on synthetic data with p = 60, q = 20, and a block-structured graph. Each box shows the mean (red), standard error interval (blue), and standard deviation (black) over 5 random datasets. network with q = 60. Although ICLasso wins on B by a small margin, it s true advantage in this setting is its estimate of Θ. To further quantify these results, the the average area under the PR curve for B is 0.87 ± 0.0 for ICLasso and 0.79 ± 0.02 for GFLasso2, which is the closest baseline. The average area under the PR curve for Θ is 0.9 ± 0.0 for ICLasso and 0.82 ± 0.06 for GLasso. Figure 3 shows the average area under the precision-recall curve and best F score (over all hyperparameter values) for B and Θ averaged over both network structures for each value of q. This shows that ICLasso consistently outperforms all baselines over multiple random datasets generated with multiple parameter settings. Finally, we examine the prediction error of each model when using ˆB to predict Y from X. Figure 4 shows a box plot of the error values obtained in one parameter setting. This demonstrates that ICLasso obtains lower error than the other methods. The results on synthetic data demonstrate that ICLasso performs significantly better than competing methods on a wide variety of metrics (accurate recovery of sparse structure in B, accurate recovery of sparse structure in Θ, and prediction of Y ) when our data assumptions hold. 3

14 Table : Regression Error on Yeast Data density training error validation error Lasso.65% GFLasso 2.87% ICLasso 6.88% Yeast eqtl Mapping In order to evaluate our approach in a real-world setting, we applied ICLasso to a yeast eqtl dataset from Brem and Kruglyak (2005) that consists of 2, 956 SNP genotypes and 5, 637 gene expression measurements across 4 yeast samples. Before running our method, we removed, 799 SNPs with duplicate genotypes and retained only the top 25% of genes with the highest variance in expression, leaving p =, 57 SNPs and q =, 409 genes in our analysis. We used our approach to jointly perform eqtl mapping and gene network inference on the yeast dataset, treating the the SNPs as inputs X and the genes as outputs Y. We trained our model on 9 samples and used the remaining 23 samples as a validation set for tuning the hyperparameters. This yielded λ = 0.54, λ 2 = 0.57, and γ = 0.8. Given the trained model, we read the eqtl associations from the regression coefficient matrix ˆB, which encodes SNP-gene relationships, and obtained the gene network from the inverse covariance matrix ˆΘ, which encodes gene-gene relationships. In addition to ICLasso, we ran Lasso and GFlasso on the yeast data to obtain two additional estimates of B, and ran graphical lasso to obtain another estimate of Θ. 3 Table shows the density of ˆB obtained with each method, along with the prediction error of Y on the training set and on the held-out validation set, which were calculated using Y train X train ˆB 2 F and Y valid X valid ˆB 2 F, respectively. We chose not to sacrifice any data for a test set, but these results indicate that ICLasso achieves an equivalent or better fit to the training and validation sets than lasso and glasso. 6. Quantitative Analysis Because the true yeast eqtls and gene network structure are not known, there is no ground truth for this problem. We instead analyzed the output of each method by performing a series of enrichment analyses that together provide a comprehensive picture of the biological coherence of the results. An enrichment analysis uses gene annotations to identify specific biological processes, functions, or structures that are over-represented among a group of genes relative to the full set of genes that is examined (Subramanian et al., 2005). To evaluate our yeast data results, we performed three types of enrichment analyses: biological process and molecular function enrichment using annotations from the Gene Ontology (GO) database (Ashburner et al., 2000), and pathway enrichment using annotations from the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (Kanehisa and Goto, 2000). We used a hypergeometric test to compute a p-value for each term, and then adjusted the values to account for multiple hypothesis testing. Significance was determined using an adjusted p-value cutoff of The hyperparameters for each of these methods were selected with the same validation set, and we used λ = for lasso, λ = 0.23 and γ = for gflasso, and λ = 0.47 for glasso. 4

15 Table 2: GO and KEGG Enrichment Analysis on Yeast eqtl Map number of enriched terms average number of enriched SNPs increase GO-BP GO-MF KEGG GO-BP GO-MF KEGG average increase Lasso GFLasso % % ICLasso % % Table 3: GO and KEGG Enrichment Analysis on Yeast Gene Network number of enriched terms average number of enriched clusters increase GO-BP GO-MF KEGG GO-BP GO-MF KEGG average increase GLasso ICLasso % % We first analyzed ˆB by performing a per-snp enrichment analysis. For each SNP j, we used the nonzero elements in β j to identify the set of genes associated with the SNP. Next we performed GO and KEGG enrichment analyses on this group of genes by comparing their annotations to the full set of, 409 genes that we included in our study. We repeated this procedure for each SNP, and calculated the total number of terms that were enriched over all SNPs to obtain a global measure of enrichment for ˆB. In addition, we calculated the total number of SNPs that were enriched for at least one term in each category. These results are summarized in Table 2. It is evident that ICLasso vastly outperforms both GFLasso and Lasso on estimating the regression coefficients, since it has more than twice as many enriched terms in GO biological process, GO molecular function, and KEGG than either baseline method. Next we used a similar approach to evaluate the structure present in ˆΘ. We first obtained groups of genes by using spectral clustering to perform community detection among the genes based on the inferred network structure. After clustering the genes into 00 groups, 4 we performed GO and KEGG enrichment analyses on each cluster and calculated the total number of enriched terms along with the total number of clusters that were enriched for at least one term. These results are summarized in Table 3. Once again, our approach has more enrichment than the baseline in every category, which implies that the gene network estimated by ICLasso has a much more biologically correct structure than the network estimated by GLasso. 6.2 Qualitative Analysis The quantitative results in Tables 2 and 3 indicate that, compared to other methods, our approach identifies more eqtls that are associated with genes significantly enriched in certain biological processes and pathways. A more detailed examination of our results revealed that many of the enriched terms correspond to metabolic pathways, and that the eqtls we identified agree with those discovered in a previous study that analyzed the effect of genetic variations on the yeast metabolome. 4 We also clustered with 25, 50, and 200 groups and obtained similar results. 5

16 Breunig et al. (204) identified the metabolite quantitative trait loci (mqtls) for 34 metabolites and then examined each mqtl for the presence of metabolic genes in the same pathway as the linked metabolite. We found that 0 of these 34 metabolites were linked to metabolic genes where our identified eqtls reside. For example, Breunig et. al. determined that the metabolite valine is linked to an mqtl in a region spanned by the ILV6 gene, which encodes a protein involved in valine biosynthesis. In our study, we also identified an eqtl located in ILV6. Moreover, we found that the eqtl in ILV6 is associated with 365 genes that are significantly enriched for pathways involved in the metabolism and biosynthesis of various amino acids. This is consistent with the fact that the metabolism and biosynthesis of amino acids in the cell needs to be coordinated. Furthermore, our enrichment analysis shows that the eqtl-associated genes we identified are enriched for various metabolic pathways (e.g. sulfur, riboflavin, protein, starch, and sucrose metabolism, oxidative phosphorylation, glycolysis), as well as more general pathways, such as cell cycle pathways, and MAPK pathways. This is consistent with the roles of the mqtls identified by Breunig et al. Interestingly, among these genes, SAM, encoding an S-adenosylmethionine synthetase, is also among the eqtls in our list. Our results show that the eqtl we found in SAM is associated with 252 genes that are enriched for cytoplasmic translation and ribosome functions, consistent with the fact that SAM is the methyl donor in most methylation reactions and is essential for DNA methylation of proteins, nucleic acids, and lipids (Roberts and Selker, 995). 7 Conclusion In this work, we propose a new model called the inverse-covariance-induced fused lasso which jointly estimates regression coefficients B and an output network Θ while using a graph-guided fused lasso penalty to explicitly encourage shared structure. Our model is formulated as a biconvex optimization problem, and we derive new, efficient optimization routines for each convex subproblem based on existing methods. Our results on both synthetic and real data unequivocally demonstrate that our model achieves significantly better performance on recovery of the structure of B, recovery of the structure of Θ, and prediction error than all six baselines that we evaluated. In this paper, we demonstrated that our approach can effectively be used to perform joint eqtl mapping and gene network estimation on a yeast dataset, yielding more biologically coherent results than previous work. However, the same problem setting appears in many different applications, and the inverse-covariance-induced fused lasso model can therefore be effectively used within a wide range of domains. 6

17 References Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., et al. (2000). Gene ontology: tool for the unification of biology. Nature Genetics, 25(): Bishop, C. M. (2006). Pattern recognition. Machine Learning. Brem, R. B. and Kruglyak, L. (2005). The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proceedings of the National Academy of Sciences, 02(5): Breunig, J. S., Hackett, S. R., Rabinowitz, J. D., and Kruglyak, L. (204). metabolome variation in yeast. PLoS Genetics, 0(3):e Genetic basis of Chen, X., Kim, S., Lin, Q., Carbonell, J. G., and Xing, E. P. (200). Graph-structured multitask regression and an efficient optimization method for general fused lasso. arxiv preprint arxiv: Dempster, A. P. (972). Covariance selection. Biometrics, pages Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3): Gardner, T. S. and Faith, J. J. (2005). Reverse-engineering transcription control networks. Physics of Life Reviews, 2(): Kanehisa, M. and Goto, S. (2000). Kegg: kyoto encyclopedia of genes and genomes. Nucleic Acids Research, 28(): Kim, S., Sohn, K.-A., and Xing, E. P. (2009). A multivariate regression approach to association analysis of a quantitative trait network. Bioinformatics, 25(2):i204 i22. Kim, S. and Xing, E. P. (2009). Statistical estimation of correlated genome associations to a quantitative trait network. PLoS Genetics, 5(8):e Kim, S., Xing, E. P., et al. (202). Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eqtl mapping. The Annals of Applied Statistics, 6(3): Lee, W. and Liu, Y. (202). Simultaneous multiple response regression and inverse covariance matrix estimation via penalized gaussian maximum likelihood. Journal of Multivariate Analysis, : Marbach, D., Prill, R. J., Schaffter, T., Mattiussi, C., Floreano, D., and Stolovitzky, G. (200). Revealing strengths and weaknesses of methods for gene network inference. Proceedings of the National Academy of Sciences, 07(4): Michaelson, J. J., Alberts, R., Schughart, K., and Beyer, A. (200). Data-driven assessment of eqtl mapping methods. BMC Genomics, ():502. Mohan, K., London, P., Fazel, M., Witten, D., and Lee, S.-I. (204). Node-based learning of multiple gaussian graphical models. The Journal of Machine Learning Research, 5():

25 : Graphical induced structured input/output models

10-708: Probabilistic Graphical Models 10-708, Spring 2016 25 : Graphical induced structured input/output models Lecturer: Eric P. Xing Scribes: Raied Aljadaany, Shi Zong, Chenchen Zhu Disclaimer: A large