Covariance decomposition in multivariate analysis

Size: px

Start display at page:

Download "Covariance decomposition in multivariate analysis"

Regina Sharp
5 years ago
Views:

1 Covariance decomposition in multivariate analysis By BEATRIX JONES 1 & MIKE WEST Institute of Statistics and Decision Sciences Duke University, Durham NC {trix,mw}@stat.duke.edu Summary The covariance between two variables in a multivariate Gaussian distribution is decomposed into a sum of path weights for all paths connecting the two variables in an undirected graph. These weights are useful in determining which variables are important in mediating correlation between the two path endpoints. The decomposition arises in undirected Gaussian graphical models and does not require or involve any assumptions of causality. The novel theory underlying this covariance decomposition is derived using basic linear algebra. The resulting weights can be interpreted as involving two components: a measure of the strengths of the relationships between the variables along the path, conditioning on all other variables; and a function of the entropy lost in that conditioning. The decomposition is feasible for very large numbers of variables if the corresponding precision matrix is sparse, a circumstance that arises in currently topical examples such as gene expression studies in functional genomics. Additional computational efficiencies are possible when the undirected graph is derived from an acyclic directed graph. Some key words: Cofactor expansions, Concentration graph, Conditional independence, Covariance Selection, Path analysis 1 From summer the mailing address for the first author will be: Beatrix Jones, Institute of Information and Mathematical Sciences, Massey University, Private Bag , North Shore Mail Centre, Auckland, New Zealand 1

2 1 Introduction Graphical models over undirected graphs (Lauritzen 1996) are increasingly used to exhibit the conditional (in)dependence structures in multivariate distributions, and advances in computational techniques are providing increased access to methods associated with graphical modelling in problems of increasing complexity and dimension (Dobra et al. 2004, Rich et al. 2004). Undirected Gaussian graphical models, where edges in the graph correspond to non-zero entries in the precision matrix, are a special case of particular interest (Dempster 1972, Giudici 1996, Jones et al. 2004). In exploring and aiming to interpret relationships exhibited in such multivariate analyses, we are often faced with questions about how subsets of possibly many variables play roles in - or mediate - observed relationships between pairs of variables. In this communication, we present a novel representation of the covariance between two random variables in a multivariate Gaussian distribution in terms of sums of components related to individual paths between the variables in an underlying graphical model. This decomposition directly defines path weights that contribute to the overall pairwise association, and so highlights - and quantifies - the nature of the roles played by intervening variables along multiple such paths. The covariance decomposition we present is derived using basic linear algebra, relying only on analytical expressions using cofactors for matrix determinants and inverses. The linear algebra literature has used similar approaches to, for instance, provide alternate expressions for the determinant of a matrix (Johnson et al. 1994); however, to our knowledge they have not been used to aid in interpretation of covariance patterns and associations in graphical models. We envision this decomposition to provide an alternative to the venerable results on path coefficients introduced by Wright (1921), which provide covariance decomposition along paths between two variables in a context of directional modelling. Originally developed in the context of quantitative genetics, Wright s results are now used in a variety of fields to examine the relative importance of direct and indirect variate relationships (e.g. sociology, Duncan, 1966, and ecology, Grace & Pugesek, 1998). However, use of the method of path coefficients is predicated on an a priori understanding or hypothesis of directed causal relationships between the variables (Wright, 1934). Application relies on successive marginalisation over variables without descendants, so altering the direction of one or more edges produces a different decomposition, even if the corresponding covariance matrix is unaltered. In contrast, our new theory of covariance decomposition arises directly from the covariance matrix and is therefore not reliant on directionality; consequently, it is suitable for use in exploratory analysis of (roughly Gaussian) multivariate data, where the interest lies in generating insights into the nature of contributions to an estimated or inferred pairwise association from potentially many intervening variables. The implied weight on any single path between two variables can be represented, and interpreted, in terms of the product of two components: one component is a a measure of the strength of the relationships between the variables along the path in the distribution conditional on all other 2

3 variables; the second component is a simple function of the entropy lost in the conditioning. As in Wright s directional path decompositions, these path weights will help prioritize which inter-variate relationships are most worthy of additional investigation. For example, the precision matrix of many gene expression levels may be inferred from an observational study of human patients; the highest weighted path between a pair of genes of interest may then generate insight into underlying biological pathways, or suggest which additional experiments in cell lines or model systems will be most fruitful for elucidating the major relationships between genes in the system. 2 Covariance Decomposition Over Paths Consider an n dimensional multivariate Gaussian distribution with a finite and non-singular covariance matrix Σ. It is convenient to work in terms of the inverse of Σ, the precision matrix Ω. In a graphical model over an undirected graph, the n nodes of the graph represent variables and lack of an edge between any two nodes represents the fact that the corresponding two variables are conditionally independent given all other variables. In the case of a Gaussian distribution, conditional independence of two variables implies, and is implied by, a zero entry in the corresponding element of Ω (Dempster 1972); i.e., the pattern of non-zero off-diagonal elements of the precision matrix is that of the adjacency matrix of the graph. The elements of the covariance matrix Σ can be written in terms of the cofactors Ω ij of Ω: Ω 11 Ω 21 Ω n1 1 Ω 12 Ω 22 Ω n2 Σ = det(ω)..... Ω 1n Ω 2n Ω nn Thus σ xy = Ω xy / det(ω). Recalling that Ω ij is ( 1) i+j times the determinant of the matrix resulting from eliminating the ith row and jth column from Ω, we see that the expression can be further expanded by expressing Ω xy using a cofactor expansion along the column corresponding to the variable x. Let Ω xy,ix be the appropriate power of 1 times the determinant of the matrix where we have eliminated the row corresponding to variable x, the column corresponding to y, and then the row corresponding to variable i and the column corresponding to x. Then σ xy = 1 det(ω) (ω 1xΩ xy,1x + ω 2x Ω xy,2x ω nx Ω xy,2(n 1)x ). Terms in this sum will drop out for cases with ω ix = 0. In the Gaussian case we are considering, this corresponds exactly to cases where there is no edge between variables i and x in the corresponding graphical model. Note that there is no 3

4 term with ω xx because it has already been eliminated from Ω in producing Ω xy. The non-zero terms can then be expanded further in terms of adjacent edges. For example, for the first term we can consider paths through node 1: ω 1x Ω xy,1x = ω 1x ω 21 Ω xy,1x, ω 1x ω n1 Ω xy,1x,n1, again letting Ω xy,ix,ji represent the determinant of the matrix where we have successively eliminated the row and column corresponding to variables x and y, i and x, and finally j and i, and also subsuming the signs in this term. Imagine continuing the expansion along adjacent edges until either (1) an edge to y is reached, or (2) there are no more adjacent edges. In the latter case, the determinant of the remaining matrix is zero, and the term does not contribute to σ xy. The first case constitutes a path from x to y; the remaining cofactor term is the determinant of Ω with the variables in the path removed. The contribution of a path thorough variables P = p 1, p 2... p m is then ( 1) m+1 ω p2p 1 ω p3p 2... ω pmp m 1 det(ω P ) det(ω), (1) where Ω P is the precision matrix with only the rows and columns for variables not in P. The fact that relevant power of 1 corresponds to the number of edges in the path can be seen by considering the matrix where x is the first variable listed, y is the mth variable listed, and the intervening variables are listed in the order they are traversed from x to y, with variables not in the path listed after m. For this case, when we eliminate the row corresponding to x (the first) and column corresponding to y (the m th ) we introduce a factor of ( 1) m+1. The rest of the eliminations involve the first row and first column, so the sign is not altered. We see that this argument generalizes to any ordering of the variables by noting that changing the index of a particular variable involves two elementary row operations, a row swap and a column swap, and thus does not alter the relevant determinant. Each ω ij has a sign opposite to that of the associated partial regression coefficients, so the path weight has the same sign as the product of the partial regression coefficients associated with the edges. An alternate representation of (1) is useful in its interpretation. It can be written S V where S = ( 1) m+1 ω p 2p 1 ω p3p 2... ω pmp m 1 det(ω P ) V = det(ω P ) det(ω P ). det(ω) S is the path weight in the reduced system obtained by conditioning on all variables not in the path. Unless the path variables are independent of all other variables in the system, this conditioning will reduce the variance of the path variables. V reflects the increase in the covariance resulting from restoring this 4

5 variance. V = exp( 2 H), where H is the entropy lost in the conditioning: H = 1 ( ) 2 log det(σp P ) det(σ P ) = 1 ( ) 2 log det(ω). det(ω P ) det(ω P ) H is always non-positive, so exp( 2 H) 1. 3 Examples Two examples provide useful illustration of the decomposition. Note that the figures of graphs in these examples are produced using GraphExplore, Wang & Dobra An Illustrative Synthetic Example Consider the following simple four variable example. Variables x, y, a and b have precision matrix given in Table 1, corresponding to the graph in Figure 1. Table 1: Precision matrix for the variables depicted in Figure 1. x a b y x a b y There are three paths from a to b, with weights given in Table 2. These weights indicate that the direct path between a and b does not contribute as much to the correlation as the indirect paths via x and y. The direct path also has sign opposite to the marginal correlation of a and b. We will call paths of this type moderating paths, and paths with the same sign as the marginal correlation mediating paths. We do not use the terms mediating and moderating to imply any causal relationship; any causal relationship compatible with the undirected graph is possible, possibly including unobserved variables. Nevertheless, the breakdown of the path weights implies that study of the mediating variables x and y, rather than just the direct interactions between a and b, are important in understanding the bulk of the correlation between a and b. Now consider the four paths from x to y in this same graph, with weights also given in Table 2. Here we see that both a and b are each utilized in one mediating path, and two moderating paths of smaller magnitude. In such situations it may be more useful to classify variables as mediating or moderating based on the sum 5

6 Figure 1: Undirected graph corresponding to the precision matrix in Table 1 Table 2: Path weights for all paths between a and b, and between x and y. a b path weight x y path weight a b x a y a x b x b y a y b x a b y x b a y Total (σ ab ) Total (σ xy ) of the paths through them, rather examining individual paths. Here a has this sum equal to and b has 0.091, so both are mediating variables. Similarly, one can think of the importance of a variable to the covariance in question as the sum of the path weights for paths through that variable. 3.2 An Illustrative Example from Gene Expression Analysis A larger example arises from exploration of aspects of large-scale graphical models developed for analyses of gene expression data in brain cancer genomics. This utilises a (sparse) acyclic directed graph, and the associated precision matrix, fitted to the gene expression data from Rich et al. (2004). This data consists of 8408 variables, each representing the expression level of a particular gene in tumor tissue. The variables are observed in 41 glioblastoma patients, and have been transformed so that they have roughly a multivariate Gaussian distribution. The graph was fit using the tools provided in Dobra (2004). A subgraph of the system is shown in Figure 2. The directed graph can be seen by considering the edges with direction indicated. The edges without direction were added during the conversion to an undirected graph, which requires linking the parents of each node (called moralisation ). It is appealing to work with the undirected 6

7 graph, because the inferred covariance matrix is consistent with many acyclic directed graphs in addition to the one depicted, and there is no a priori preference among these. Suppose our interest is in whether the expression of gene KIAA0913, which encodes a protein of unknown function, is worth investigating as a potential intermediary between the genes ZBP1 and TGM1. The expression levels for ZBP1 and TGM1 have covariance We will consider all paths between ZBP1 and TGM1. Figure 2: Subgraph of an 8408 node graph representing gene expression in glioblastomas. In working with this large example, two observations led to important efficiencies in accomplishing the computation. First, since Ω is very large compared to the number of variables in the paths, computation of the ratio between det(ω P ) and det(ω) is much more stable if it is performed using the identity ( ) A B det = det(a) det(d CA 1 B), (2) C D with A = Ω P and D = Ω P. Note that A 1 need not be computed; B has relatively few columns, so solving for each column of A 1 B is easier than computation of the full inverse, especially as both matrices are sparse in this example. Computations of the path weights in Table 3 was accomplished in Matlab in less than one minute. Second, the structure of the underlying acyclic directed graph can be used to eliminate the need to consider some paths. While the moralised edges added 7

8 between parents of the same node correspond to non-zero elements in the inverse covariance matrix, they do not represent marginal correlation. When all paths involving the moralised edge could replace that edge with the structures through the common offspring implied by the moralisation, this set of paths through the endpoints of the moralised edge has total weight zero. For example, in Figure 2, the nodes (genes) ZBP1-PAIP1 3-TGM1 and ZBP1-PAIP1 3- LOC TGM1 have path weights of the exactly the same magnitude and opposite sign. These sets of canceling paths can be removed from consideration before computations begin, resulting in major savings. The ZBP1-PAIP1 3- TGM1 path is included in Figure 2 just as an example: if all paths that occurred in canceling sets were included, there would be more than 60 additional paths (and the associated nodes) in the drawing. Table 3: Paths contributing to the covariance of ZBP1 and TGM1. All paths start at ZBP1 and end at TGM1, so these endpoints have been omitted. Path via S V Weight F F2-KCNJ MGC31957-KIAA0913-DIPA 1-KCNJ MGC31957-KIAA0913-DIPA 1-KCNJ12-F Total 0.68 Only four paths (all shown) actually contribute to the covariance. Their path weights are given in Table 3; the path endpoints, ZBP1 and TGM1, are common to all and have been omitted. Note that not all paths involving moralised edges can be disregarded; for example, ZBP1-F2-KCNJ12-TGM1 is an important moderating path. We also see that the paths through KIAA0913 do make an important contribution, accounting for about 40% of the covariance between ZBP1 and TGM1. Finally, Table 3 also gives the factorization of the path weights into S and V. Much of the variance of the path variables is explained by the over 8,000 additional variables in the system, so V has a major impact on the path weights. Acknowledgments Partial support for the research described here was provided by the Keck Foundation through the Keck Center for Neurooncogenomics at Duke University, and the National Science Foundation under grant DMS We would also like to acknowledge Carlos Carvalho for helpful conversations, and Adrian Dobra and Quanli Wang for software for graphical model fitting and visualisation. 8

9 References Dempster, A. (1972). Covariance selection. Biometrics 28, Dobra, A. (2004). Hdbcs: Bayesian covariance selection in high dimensions. adobra/hdbcs.html. Dobra, A., Hans, C., Jones, B., Nevins, J. & West, M. (2004). Sparse graphical models for exploring gene expression data. J. Mult. Anal. -,. Duncan, O. (1966). Path analysis: Sociological examples. Am. J. of Sociol. 72, Grace, J. & Pugesek, B. (1998). On the use of path analysis and related procedures for the investigation of ecological problems. Am. Nat. 152, Guidici, P. (1996). Learning in graphical Gaussian models. In Bayesian Statistics 5, J. M. Bernado, J. O. Berger, A. P. Dawid & A. F. M. Smith, eds. Oxford Univeristy Press, pp Johnson, C., Olesky, D., Tsatsomeros, M. & van den Dressche, P. (1994). An identity for the determinant. Linear Multilinear A. 36, Jones, B., Carvalho, C., Dobra, A., Hans, C., Carter, C. & West, M. (2004). Experiments in stochastic computation for high-dimensional graphical models. ISDS Discussion Paper, Duke University (submitted for publication). Lauritzen, S. L. (1996). Graphical Models. Clarendon Press, Oxford. Rich, J., Hans, C., Jones, B., Dobra, A., Dressman, H., Rashed, A., McClendon, R., Nevins, J., Bigner, D. & West, M. (2004). Predictive gene expression profiling and graphical association studies in glioblastoma survival. Submitted for publication. Wang, Q. & Dobra, A. (2004). Graphexplore: A software tool for graph visualization. Wright, S. (1921). Correlation and causation. J. Agr. Res. 20, Wright, S. (1934). The method of path coefficients. Ann. Math. Stat. 5,

Comment on Article by Scutari

Bayesian Analysis (2013) 8, Number 3, pp. 543 548 Comment on Article by Scutari Hao Wang Scutari s paper studies properties of the distribution of graphs ppgq. This is an interesting angle because it differs