25 : Graphical induced structured input/output models

Size: px

Start display at page:

Download "25 : Graphical induced structured input/output models"

Alfred Palmer
5 years ago
Views:

1 10-708: Probabilistic Graphical Models , Spring : Graphical induced structured input/output models Lecturer: Eric P. Xing Scribes: Meghana Kshirsagar (mkshirsa), Yiwen Chen (yiwenche) 1 Graph structured fused lasso So far we have seen how probabilistic graphical models can be used to express the structure, i.e the dependence between the input and output variables. This lecture discusses alternate graph-based models which, instead of modeling the probabilistic assumptions about the data encode the input/output structure of the data in an obective function via novel regularizers. The basic form of such graph-based obective functions involves an empirical loss term and a regularizer which captures the graphical dependencies between the parameters. Such graphical induced models have a large range of applications, particularly in computational biology. One such application is in the broad area of Genetics, where the goal is to find SNPs (variations in the genome) that are relevant to a disease. The hypothesis is that multiple such mutations/variations are responsible together for diseases. However since the number of SNPs is very large (a few million), checking all combinations of SNPs for correlation with the disease is a computationally hard problem. This problem is further exacerbated due to the presence of multiple phenotypes i.e characteristics relevant to the disease - making it an association mapping problem where the goal is to find strong associations or correlations between several SNPs (also called the genotype) and multiple phenotypes (such as allergy, blood pressure etc). In addition, we want to consider the multiple correlated phenotypes ointly while finding the association with the genotype. The statistical challenge involved in this problem is: given multivariate input X (the SNPs) and multivariate output Y (the phenotypes), to identify the association between X and Y, where the output covariates Y can be a graph connecting the phenotypes, a tree structure connecting genes etc. The work by [1] proposes a formulation for the case where the output Y is a graph of traits, with edges between traits indicating their correlation. Another challenge is that: the number of examples available will be in the order of a few thousands whereas the number of covariates/features in X is of the order of a few million. This will make the problem statistically under-determined. Let X be an N J design matrix of genotypes for N individuals and J SNPs. Let Y denote an N K matrix of quantitative-trait (i.e phenotype) measurements over the same set of individuals. We use y k to denote the k-th column (i.e., the trait) of Y. Let the graph representing the relationship between the traits (called the Quantitative Trait Network) be G with vertices V and edges E. Due to the multivariate output, the parameters are a matrix B R J K instead of a vector. This matrix B = (β 1, β 2... β K ) where β k = (β 1k... β Jk ) R J encodes the structure and strength of the association. The parameter β k represents the association strength between SNP and trait k. See Figure 1 for an illustration. We want to bias our learning towards finding sparse associations because these are biologically more meaningful and at the same time make the problem statistically viable. The standard approach towards finding sparse models is to use Lasso. However this method cannot be directly applied here as it fails to capture the dependence amongst the parameters β k. The work by [1] uses two regularizer terms: the first is similar to the lasso and enforces the sparsity constraint, and the second term enforces the graph structure constraints and is called the graph-constrained fusion penalty. They propose two obective functions to express the 1

2 2 25 : Graphical induced structured input/output models graph structure. Model G c F Lasso : This model uses the graph without the edge-weights. The obective function with the two regularizers and a squared error loss term has the form: ˆB = arg min β (y k Xβ k ) T (y k Xβ k ) + λ k k β k + γ β m sign(r ml )β l (1) (m,l) E The fusion penalty β k β m where β k and β m are the association strengths for trait k and m respectively, tries enforce similar association strength for the two correlated traits. See Figure 1 for an illustration. Model G w F Lasso : This model uses the edge-weights in the graph. The obective function has the form: ˆB = arg min β (y k Xβ k ) T (y k Xβ k ) + λ k k β k + γ f(r ml ) β m sign(r ml )β l (2) (m,l) E Figure 1: A figure illustrating the association strengths for two correlated traits k and m. If the two traits m and l are highly correlated in the graph G with a relatively large edge weight, the fusion effect over the two traits will intensify, and as a result the difference between the two corresponding regression coefficients and will be penalized more than those for other pairs of traits with weaker correlation. Compared to G c F Lasso, G w F Lasso is significantly more flexible due to its usage of the edge weights to incorporate the strength of correlation. For example, when two groups of highly correlated traits show a relatively weaker correlation across the two subnetworks, G w F Lasso can handle the hierarchical subgroup structure and adust the amount of fusion accordingly by weighting each fusion term. The optimization problems in Eqn (1) and Eqn (2) are convex, and can be formulated as a quadratic programming problem using the similar approach for solving the fused lasso problem. Although there are many publicly available software packages that efficiently solve such quadratic programming problems, these approaches do not scale in terms of computation time when working with large problems involving hundreds or thousands of traits. Since the main difficulty of directly optimizing Eqn (1) and Eqn (2) arises from the non-smooth nature of the l 1 norm, this problem can be transformed to an equivalent form that involves only smooth functions. Then one can use a fast coordinate-descent algorithm to find the estimates of regression coefficients. It can be shown that solving the optimization problem in Eqn (2) is equivalent to solving the following

3 25 : Graphical induced structured input/output models 3 problem with a smooth function of the l 2 norm. min β k,d k,d ml subect to d k = 1,k d ml = 1, (m,l) E k (y k Xβ k ) T (y k Xβ k ) + λ k d k 0 for all, k d ml 0 for all, (m, l) E (β k ) 2 d k + γ (m,l) E f(r ml ) 2 (β m sign(r ml )β l ) 2 where d k and d ml are additional variables that we need to estimate. We solve the above problem using a coordinate-descent approach that iteratively updates variables of interest β k, and (d k,d ml ), until there is little improvement in the value of the obective function. Using this approach, we first fix values of d k and d ml, and find the update equation for β k by differentiating the obective function in Eqn 3 with respect to each β k and setting it to 0. The coordinate-descent procedure finds the optimal for fixed regularization parameters λ and γ. The regularization parameters and can be determined automatically by a cross-validation or by using a validation set. d ml (3) 2 Tree-guided Group Lasso 2.1 Motivation In a univariate-output regression setting, sparse regression methods that extend lasso have been proposed to allow the recovered relevant inputs to reflect the underlying structural information among the inputs. Group lasso apply an L 1 norm of the lasso penalty over groups of inputs, while using an L 2 norm for the input variables within each group. This L 1 /L 2 norm for group lasso has been extended to a more general setting to encode prior knowledge on various sparsity patterns, where the key idea is to allow the groups to have an overlap. However, the overlapping groups in their regularization methods can cause an imbalance among different outputs, because the regression coefficients for an output that appears in large number of groups are more heavily penalized than for other outputs with memberships to fewer groups. So the tree-guided group lasso for multi-task regression with structured sparsity has been has been proposed. It uses a novel weighting scheme that systematically weights each group in the tree-guided group-lasso penalty such that clusters of strongly correlated outputs are more encouraged to share common covariates than clusters of weakly correlated outputs. This model is also motivated by the genetic association mapping problem, where the goal is to identify a small number of SNPs (inputs) out of millions of SNPs that influence phenotypes (outputs) such as gene expression measurements. 2.2 Background on Sparse Regression and Multi-task Learning The basic linear model on multi-task regression is: y k = Xβ k + ɛ k, k = 1, 2,.., K, where β k is a vector of J regression coefficients for the k-th output, and ɛ k is a vector of N independent error terms having mean 0 and a constant variance. X denote the N J input matrix, and Y denote

4 4 25 : Graphical induced structured input/output models the N K output matrix. When J is large and the number of inputs relevant to the output is small, lasso offers an effective feature selection method for this model. Let B = (β 1, β 2,..., β K ) denote the J K matrix of regression coefficients for all K outputs. Then, lasso obtains ˆB lasso by solving the following optimization problem: ˆB lasso = argmin k (y k Xβ k ) T (y k Xβ k ) + λ k β k, where λ is a tuning parameter that controls the amount of sparsity in the solution. In multi-task learning, where the goal is to select input variables that are relevant to at least one task, an L 1 /L 2 penalty has been used to take advantage of the relatedness of the outputs. The L 1 /L 2 -penalized multi-task regression is defined as the following optimization problem: ˆB L1/L2 = argmin k (y k Xβ k ) T (y k Xβ k ) + λ β 2 The L 1 part of the penalty plays the role of selecting inputs relevant to at least one task, and the L 2 part combines information across tasks. Since the L 2 penalty does not have the property of encouraging sparsity, if the th input is selected as relevant, all of the elements of β take non-zero values. Thus, the estimate ˆB L1/L2 is sparse only across inputs but not across outputs. 2.3 Tree-Guided Group Lasso for Sparse Multiple-output Regression We assume that the relationships among the outputs can be represented as a tree T with the set of vertices V of size V. Given this tree T over the outputs, we generalize the L 1 /L 2 regularization to a tree regularization as follows. We expand the L 2 part of the L 1 /L 2 penalty into a group-lasso penalty, where the group is defined based on tree T as follows. Each node v V of tree T is associated with group G v whose members consist of all of the output variables (or leaf nodes) in the subtree rooted at node v. Given these groups of outputs that arise from tree T, tree-guided group lasso can be written as ˆB T ree = argmin k (y k Xβ k ) T (y k Xβ k ) + λ v V w v β G v 2 where β G v is a vector of regression coefficients. Each group of regression coefficients β G v weighted with w v that reflects the strength of correlation within the group. In order to define the weights w v s, we first associate each internal node v of the tree T with two quantities s v and g v that satisfy the condition s v + g v = 1. The s v represents the weight for selecting the output variables associated with each of the children of node v separately, and the g v represents the weight for selecting them ointly. Given an arbitrary tree T, we recessively apply the similar operation starting from the root node towards the leaf nodes as follows: where v V w v β G v 2 = λ W (v root ), { sv W (v) = c Children(v) W (c) + g v β G v 2 m G v βm if v is an internal node, if v is a leaf node. It can be shown that the following relationship holds beta ten w v s and (s v, g v ) s.

5 25 : Graphical induced structured input/output models 5 { gv W (v) = m Ancestors(v) s m if v is an internal node, m Ancestors(v) s m if v is a leaf node. The above weighting scheme extends the elastic-net-like penalty hierarchically. Thus, at each internal node v, a high value of s v encourages a separate selection of inputs for the outputs associated with the given node v, whereas high values of g v encourages a oint covariate selection across the outputs. If s v = 1and g v = 0 for all v V, then only separate selections are performed, and the tree-guided group lasso penalty reduces to the lasso penalty. On the other hand, if s v = 0 and g v = 1 for all v V, the penalty reduces to the L 1 /L 2 penalty that performs only a oint covariate selection for all outputs. 2.4 Example Figure 2: tree-guided group lasso Given the tree in the figure above, the tree-guided group-lasso penalty for the th input is give as follows: W (v r oot) = W (v 5 ) = g v5 β G v5 2 + s v5 ( W (v 4 ) + W (v 3 ) ) ( ) = g v5 β G v5 2 + s v5 g v4 β G v4 2 + s v4 ( W (v 1 ) + W (v 2 ) ) + s v5 β 3 = g v5 β G v5 2 + s v5 g v4 β G v4 2 + s v5 s v4 ( β 1 + β 2 ) + s v 5 β Parameter Estimation in order to estimate the regression coefficients in tree-guided group lasso, we use an alternative formulation of the problem that was previously introduced for group lass, given as ˆB T ree = argmin ( ) 2 k (y k Xβ k ) T (y k Xβ k ) + λ v V w v β Gv 2 Since the L 1 /L 2 norm in the above equation is a non-smooth function, it is not trivial to optimize it directly. We make use of the fact that the variational formulation of a mixed-norm regularization is equal to a weighted L 2 regularization as follows: where ( ) 2 v V w v β Gv 2 v V v d,v = 1, d,v 0,, v, and the equality holds for d,v = w v β,v 2 v V wv β,v 2 w 2 v β Gv 2 2 d,v, Thus, we can re-write the problem so that it contains only smooth functions, as follows:

6 6 25 : Graphical induced structured input/output models subect to ˆB T ree = argmin k (y k Xβ k ) T (y k Xβ k ) + λ v d,v = 1, d,v 0,, v, v V w 2 v β Gv 2 2 d,v where we introduced additional variables d,v s that need to be estimated. We solve the problem in the above equation by optimizing β k s and d,v s alternately over iterations until convergence. In each iteration, we first fix the values for β k s, and update d,v s. Then, we hold d,v s as constant, and optimize for β k s. We differentiate the obective in the above equation with respect to β k s, set it to zero, and solve for β k s to obtain the update equation: β k = ( X T X + λd ) 1 X T y k, where D is a J J diagonal matrix with v V w2 v/d,v in the th element along the diagonal. Finally, the regularization parameter λ can be selected using a cross-validation. References [1] Seyoung Kim, Eric P. Xing,Plos Genetics 2009 Statistical Estimation of Correlated Genome Associations to a Quantitative Trait Network. [2] Seyoung Kim, Eric P. Xing, Tree-Guided Group Lasso for Multi-Task Regression with Structured Sparsity.

25 : Graphical induced structured input/output models

10-708: Probabilistic Graphical Models 10-708, Spring 2016 25 : Graphical induced structured input/output models Lecturer: Eric P. Xing Scribes: Raied Aljadaany, Shi Zong, Chenchen Zhu Disclaimer: A large