Bayesian Analysis of Matrix Normal Graphical Models

Size: px

Start display at page:

Download "Bayesian Analysis of Matrix Normal Graphical Models"

Abraham Booth
6 years ago
Views:

1 Bayesian Analysis of Matrix Normal Graphical Models By HAO WANG and MIKE WEST Department of Statistical Science, Duke University, Durham, North Carolina , U.S.A. {hao, February 7,

2 Summary We develop Bayesian analysis of matrix-variate normal data with conditional independence graphical structuring of the characterising variance matrix parameters. This leads to fully Bayesian analysis of matrix normal graphical models, including discussion of novel prior specifications, the resulting problems of posterior computation addressed using Markov chain Monte Carlo methods, and graphical model assessment that involves approximate evaluation of marginal likelihood functions under specified graphical models. Modelling and inference for spatial/image data via a novel class of Markov random fields that arise as natural examples of matrix normal graphical models is discussed. This is complemented by the development of a broad class of dynamic models for matrix-variate time series within which stochastic elements defining time series errors and structural changes over time are subject to graphical model structuring. Three examples illustrate these developments and highlight questions of graphical model uncertainty and comparison in matrix data contexts. Some key words: Gaussian graphical models; Graphical model comparison; Hyper-inverse Wishart distributions; Markov random fields; Marginal likelihood evaluation; Matrix normal models; Matrix-variate dynamic graphical models; Posterior simulation. 2

3 1 Introduction Matrix-variate normal distributions (Dawid, 1981; Gupta & Nagar, 2000) have been studied in analyses of two-factor linear models for cross-classified multivariate data (Finn, 1974; Galecki, 1994; Naik & Rao, 2001), in spatio-temporal models (Mardia & Goodall, 1993; Huizenga et al., 2002) and other areas. Some computational and inferential developments, including iterative calculation of maximum likelihood estimates have been developed (Dutilleul, 1999; Mitchell et al., 2005, 2006), and some empirical Bayesian methodology has recently been introduced for Procrustes analysis with matrix models (Theobald & Wuttke, 2006). The current paper develops a complete Bayesian analysis of matrix normal graphical models, i.e., matrix normal distributions in which each of the characterising variance matrices is constrained by a set of conditional independence restrictions consistent with an underlying graphical model (Whittaker, 1990; Lauritzen, 1996; Giudici, 1996; Giudici & Green, 1999; Jones et al., 2005). The framework includes fully Bayesian analysis of the matrix normal (full graphs) as a special case, and effective computational methods for marginal likelihood computation on a specified graphical model that underlies inference about conditional independence structures. The developments include novel Markov random field models, with potential utility in spatial and image analysis, that emerge naturally as a sub-class of matrix normal graphical models. The paper then extends the random sampling framework to matrix-variate time series models that inherit the graphical model structure to represent conditional independencies in matrix series over time. Part of our motivation lies in the interest in scaling multivariate and matrix-variate models to deal with increasingly higher-dimensional problems, such as multiple economic indicators or assets measured across multiple funds, companies, sectors or countries. Applied contexts exist in many other areas in the natural and engineering as 3

4 well as economic and social sciences. As variable dimension increases, the scientific and empirical rationale for increased sparsity of the structure of dependencies among variables is increasingly forceful. In parallel, the need for increased parametric constraints to reflect increased sparsity are vital if computations for model implementation are to scale satisfactorily. Our models combine sparsity-enabling graphical models in priors for each if the two variance matrices in a matrix normal model to address these issues. We focus on decomposable graphs for initial studies to maintain focus and scope of the paper, though the general ideas and approach will apply to non-decomposable models. Beginning with preliminaries and notation for matrix normal models in Section 2, we describe matrix normal graphical models in Section 3. This includes a novel extension of the hyper-markov inverse Wishart (hereafter HIW) distribution that uses the hyper-markov properties (Dawid & Lauritzen, 1993) and parameter expansion (Liu et al., 1998; Liu & Wu, 1999) ideas to incorporate inherent parameter identification constraints. Section 4 describes Gibbs sampling for posterior computations, and a custom method of approximating model marginal likelihoods based on the posterior simulation. Section 5 provides a simple example of analysis of a simulated data set, illustrating aspects of the computation. Section 6 introduces a novel class of Markov random fields of potential utility in spatial, imaging and texture modelling; this is a specific example of a sub-class of matrix normal graphical models of interest in its own right that is also used here to illustrate marginal likelihood computation as a key component of model comparison. Section 7 shows how the matrix graphical structure can be naturally embedded in a broad class of matrix time series models, and develops a detailed analysis of a macro-economic data set for additional illustration of the effectiveness and utility of the new matrix-variate models. 4

5 2 Matrix Variate Normal Distribution and Notation 2.1 Matrix Variate Normal Distribution The q p random matrix Y has a matrix-variate normal distribution with mean matrix M, left (or column) and right (or row) non-singular variance matrices U = (u ij ) (q q) and V = (v ij ) (p p) respectively, and defining notation Y N(M, U, V ) when the density function is p(y ) p(y U, V ) = k(u, V )etr{ (Y M) U 1 (Y M)V 1 /2} with k(u, V ) = (2π) qp/2 U p/2 V q/2. Denote row i = 1 : p of Y by y i and column j = 1 : q by y j, so that Y = (y ij ) = (y 1,, y p ) = (y 1,, y n ). Use the same notation for elements, rows and columns of M. Then y i N(m i, u ii V ) for row i = 1 : q and y j N(m j, v jj U) for column j = 1 : p. 2.2 Precision Matrices and Conditional Dependencies Write Ω = U 1 = (ω ij ) and Λ = V 1 = (λ ij ) for the left and right precision matrices, respectively. For each row i = 1 : q we have a complete conditional p variate normal distribution with E(y i y i ) = m i ω 1 ii s (1:q\i) ω is (y s m s ), V (y i y i ) = ω 1 ii V. Similarly, for each column j = 1 : p we have a complete conditional q variate normal distribution with E(y j y j ) = m j λ 1 jj t (1:p\j) λ tj (y t m t ), V (y j y j ) = λ 1 jj U. 5

6 For scalar elements y ij, the complete conditional univariate normals have E(y ij y ij ) = m ij (ω ii λ jj ) 1 ω 1 ii V (y ij y ij ) = (ω ii λ jj ) 1. s (1:q\i) s (1:q\i) t (1:p\j) ω is λ tj (y st m st ) ω si (y sj m sj ) λ 1 jj t (1:p\j) λ tj (y it m it ), These equations show how zeros in the off-diagonal elements of the two precision matrices define conditional independencies: Zeros in the right precision matrix Λ define, and are defined by, conditional independencies among columns. For t j, λ tj = 0 if and only if the complete conditional distribution of y j does not depend on y t, or y j y t conditional on all y k, k (1 : p)\(j, t). A similar relationship exists between the left precision matrix Ω and the conditional independencies among rows. In terms of univariate elements, for s i, j t elements y ij and y st may, conditional upon y (ij,st) be dependent through either rows or columns. Conditional independence is equivalent to: (a) at least one zero among λ tj and ω is when s i, j t; (b) ω is = 0 when s i, j = t; (c)λ jt = 0 when s = i, j t. This structure is a key aspect of matrix normal models and, in particular, underlies our extension of Gaussian graph models. Without loss of generality in the rest of this section, we assume the data are zero mean to develop the ideas. 2.3 Graphical Model Structuring and Density Factorisations Two graphical structures can be overlaid on a matrix normal model, one for each of the two precision matrices. Suppose two such undirected graphs are specified. Thus Λ = V 1 has off-diagonal zeros, corresponding to conditional independencies of elements of Y within any one row, that define a graph G V = (N V, E V ). Here the node (or 6

7 vertex) set N V = {j = 1 : p} is just column (within row) indicators, and the edge set E V contains only those pairs of column indices (j, t) for which λ jt 0. Similarly, G U = (N U, E U ) has node set N U = {i = 1 : q} of row (within column) indicators and the edge set E U contains only those pairs of row indices (i, s) for which ω is 0. The theory, and aspects of methodology, of multivariate normal graphical models can be now overlaid. We focus here on decomposable graphs, so that each of the two graphs is assumedly decomposable. Thus, the row multivariate normal distribution is Markov with respect to the graph G U and the column multivariate normal is Markov with respect to the graph G V. The Markovian factorisation translates directly to conditional factorisations of the matrix normal distrubution. For example, over G V the density of Y factorises as p(y U, V, G V ) = P V P V p(y PV U, V PV ) S V S V p(y SV U, V SV ), (1) where P V is the set of complete prime components, or cliques, of the graph, and S V is the set of separators. For each subgraph g {P V, S V }, Y g is the q g matrix with variables from the g columns of Y defined by the subgraph, and V g the corresponding sub-matrix of V. Each density term in the product of ratios of equation (1) is that of a matrix-variate normal p(y g U, V g ) = k(u, V g )etr{ Y gu 1 Y g V 1 g /2} (2) with a full precision matrix; that is, Λ g = V 1 g constraints apart from positive-definiteness and symmetry. has no zero entries and no parametric We can similarly represent the joint density in factorised form over the graph G U. 2.4 Identification Although the product U V is uniquely identified, U and V individually are not since, for any c 0, p(y U, V ) = p(y cu, V/c). There are a number of potential approaches to 7

8 imposing identification constraints, including constraints such as tr(v ) = p (Theobald & Wuttke, 2006). However, the development of graphical model structuring complicates the matter, and leads us to develop model identification within the context of hyper- Markov priors over each of U and V over decomposable graphs. Then it becomes natural to adopt a very simple mathematical constraint for identification, namely fixing the value of one diagonal entry in one of the variance matrices. We develop the analysis now with v 11 = 1. 3 Matrix Gaussian Graphical Model Form and Priors For multivariate Gaussian graphical models, the general class of priors based on the hyper-markov laws introduced in Dawid & Lauritzen (1993) have the desirable property that the laws are compatible and consistent across different graphs, as discussed with full details of notation in Jones et al. (2005) and Giudici & Green (1999). The resulting priors for variance matrices are hyper inverse Wishart distributions, denoted by HIW. On decomposable graphs, the implied priors on sub-variance matrices on all components and separators are inverse Wishart. To extend this class of priors to U, V in the matrix normal graphical model, one first step is simply to adopt independent HIW priors for each of U and V separately. This maintains compatibility and consistency across graphs G U, G V. Then, to incorporate the identification constraint v 11 = 1, we use a parameter expansion (PX) approach. The general PX idea involves expanding the parameter space by adding new nuisance parameters, and has been used simply algorithmically to accelerate Monte Carlo Markov chain samplers (Liu et al., 1998; Liu & Wu, 1999). However, as noted by (Gelman, 2004, 2006), PX can also usefully induce new families of prior distributions in which the added parameters are assigned informative priors. Generalizing the HIW priors for multivariate Gaussian graphical models, we thus define the matrix-variate graphical model as follows, assuming an initial random sampling context for q p data matrices Y i, (i = 1 : n). We assume a matrix normal distribution 8

9 and graphical models G U, G V as above. Then Y i U, V i.i.d N(0, U, V ), i = 1 : n, (3) U HIW GU (b, B), (4) V HIW GV (d, D), (5) V = V /v 11, (6) with U, V independent. Here we use the traditional HIW notation in which the degrees of freedom parameters b, d are positive, while B, D are variance matrices with dimensions those of U, V. The PX approach has introduced V as an unconstrained, right variance matrix, while V satisfies v 11 = 1 and v 11 represents the added parameter. The prior p(u, V ) is conditionally conjugate, in that the implied complete conditional posteriors for each of U, V are HIW. This leads to straightforward Gibbs sampling as described in Section 4.1 below, and is based on simply coupling of two independent HIW priors in an expanded format that integrates the identification constraint. We can also interpret v11. For each column Y j in Y, V (Y j ) = v jj U = (vjj/v 11)U, so v11 converts column scales to those relative to the scale of the first column. Note that the model maintains consistency and compatibility of the priors across graphs since they are inherited directly from the coupled HIW priors in the expanded parameter space of {V, U}. The induced prior for V maintains consistency over cliques, and as we move across graphs G V the priors p(v G V ) are compatible in the sense of having the same induced priors over correlation structures. That is, for two graphs G V and G V with a common clique C, let R C be the correlation matrix of V C. Then the implied priors are equal, p(r C G V ) = p(r C G V ). However, the induced laws are no longer in complete agreement for V = V /v 11 due to the different parameterizations and interpretation, and this is natural and appropriate. Each element in diag(v C ) now represents the relative scale of variance of that column to the variance of the first column. Therefore, if G V and 9

10 G V imply different conditional dependencies between the first column and columns corresponding to C, then the induced priors over V C should indeed be different to reflect such differences. The prior density for V on G V is obtained directly by transformation from that for V. Note that, on any graph G V, V is determined only by its free elements, i.e., those elements appearing in the sub-matrices corresponding to the cliques of the graph, and the non-free elements of V are deterministic functions of the free elements. Let ν be the number of free elements in V on G V. The transformation from V to V = V /v11 and v11 then has Jacobian J(V v11, V ) = (v11) ν 1, and so we obtain the joint density of the induced prior for V and the PX parameter v11 as p(v, v11) = HIW GV (v11v d, D)(v11) ν 1. This coupled with the HIW prior p(u) on G U defines a class of conditionally conjugate priors and the posterior analysis can be developed. 4 Posterior and Marginal Likelihood Computation 4.1 Gibbs Sampling For specified graphs G U, G V, the full model of equation (6) with the parameter expansion yields the following set of complete conditional distributions: n (U Y 1:n, V, v11) HIW GU (b + np, B + Y i V 1 Y i ), i=1 (v 11 Y 1:n, V, U) IG(a/2 ν, tr(dv 1 )/2) where a = P V P V (2 P V + d) S V S V (2 S V + d), V Y 1:n, U, v 11 HIW GV (d + nq, Dv n i=1 Y i U 1 Y i )I(V 11 = 1). The last component, is the HIW distribution conditioned on the 1 1 element of the variance matrix set at unity. These form the basis of an efficient Gibbs sampler to 10

11 generate from the full posterior p(u, V, v11 Y 1:n ). The Gibbs iterates involve sampling from the HIW, inverse gamma and new conditional HIW distributions as the defining parameters change. Simulation of the HIW is based on Carvalho et al. (2007). Sampling conditional HIW distributions of the form required involves a simple modification; following Lemma 2.18 of Lauritzen (1996), we can always find a perfect ordering of the nodes in G V so that node 1 in the first clique, say C, and then initialise the HIW sampler of Carvalho et al. (2007) to begin with a simulation of the implied conditional inverse Wishart distribution for the variance matrix on that first clique. Sampling V C from an inverse Wishart distribution conditional on the first diagonal element set to unity is straightforward. 4.2 Marginal Likelihood Exploration of uncertainty about graphical model structures involves, in part, consideration of the marginal likelihood function over graphs G U, G V. On any given two graphs, the value of marginal likelihood function is p(y 1:n ) p(y 1:n G U, G V ) = p(y 1:n U, V )p(u)p(v )dv du (7) where the priors in the integrand depend on the graphs although we drop that in the notation for clarity. Graphical model search and comparison involves computing posterior probabilities over graphs, and the value of p(y 1:n ) as graphs are varied provides the corresponding likelihood function. In traditional multivariate models, this can be evaluated in closed form on decomposable graphs, and is a computational cornerstone of applied use of such models (Giudici, 1996; Giudici & Green, 1999; Jones et al., 2005; Carvalho & West, 2007a,b). Here, however, the integral cannot be evaluated in closed form or easily numerically. We can, however, generate useful approximations via a novel use of the so-called Candidate s formula (Besag, 1989; Chib, 1995). Having obtained MCMC posterior samples, we note that Candidate s formula (Be- 11

12 sag, 1989; Chib, 1995) can be applied in a number of ways to generate approximations to (7). We capitalise on this observation to generate two alternative approximations, so that we can compare them to gain insights into approximation quality. (A) Observe that, for any value U, V, v 11, p(y ) = p(y, U, V, v 11) p(u, V, v 11 Y ) directly from Bayes theorem. The idea is then to estimate the components of this equation that have no closed form, then plug-in chosen values U, V, v 11, such as the MCMC-based approximate posterior means used in our examples, to provide an estimate of p(y ). The numerator p(y, U, V, v 11) = p(y U, V, v 11)p(U)p(V v 11)p(v 11) can be directly and easily computed. The denominator is p(u, V, v 11 Y ) = p(v v 11, U, Y )p(v 11, U Y ) where the first term has an easily evaluated closed form, as in the Gibbs sampling step. The second term may be approximated by p(v11, U Y ) = p(v11 Y, V )p(u Y, V, v11)p(v Y )dv 1 M M p(v11 Y, V (j) )p(u Y, V (j), v11) j where the sum is over MCMC posterior draws V (j) ; this is easy to compute as it is a sum of the product of inverse gamma and hyper-inverse Wishart density evaluations. (B) In parallel, an alternative representation is, again directly from Bayes theorem, p(y ) = p(y, V ) p(v Y ) 12

13 The numerator can be analytically evaluated as p(v, Y ) = p(y, U, V, v11)du dv11 = q V (2π) nqp/2 H(b, B, G U )H(d, D, G V ) H(b + nq, B + n i=1 Y iv 1 Y i, G U )H(a, tr(dv 1, G V )) where q V is the constant q V = P V P V V PV (nq+d+2 P V )/2 S V S V V SV (nq+d+2 S V )/2, the H(,, G ) terms are normalizing constants of the corresponding HIW distributions, and a = P V (2 P V + d) S V (2 S V + d) 2ν. P V P V S V S V The density function in the denominator is approximated as p(v Y ) = p(v v11, U, Y )p(v11, U Y )dv11 du 1 M M j P (V Y, U (j), v (j) 11 ) where the sum is over posterior MCMC draws (U (j), v (j) 11 ) can be easily performed, with terms given by conditional hyper-inverse Wishart density evaluations. By comparing the two, resulting marginal likelihood estimates we may assess how well the approximation performs, as is illustrated in Section 5. 5 Example: A Simulated Random Sample The first example fixes ideas and provides illustration of the performance of the MCMC and marginal likelihood approximation. Here n = 48 observations were drawn from the (q = 8) (p = 7) dimensional N(0, U, V ) distribution with graphical structure in 13

14 rows and columns; the precisions (with denoting zero elements to clearly display the graphical structure) are Λ = V 1 = Ω = U 1 = The analysis was developed using priors with b = d = 3 and B = 5I 8, D = 5I 7 and simulation sample size 8000 after an initial, discarded burn-in of 2000 iterations. Figure 1 presents some trace plots of Monte Carlo samples. Figure 2 compares the images of the true underlying precision matrices with Monte Carlo posterior estimates, the sample mean of the 5000 simulated precision matrices. Parallel checking for assessing the dual approximation of marginal likelihood is presented in Figure 3. Beyond its use as an implementation check (the code is in Matlab and available to interested readers from the authors web site) this illustrates the concordance of the two, parallel marginal likelihood estimates, which are close together even for rather small MCMC sample sizes and which differ negligibly on the log probability scale. 14

15 6 Markov Random Fields from Matrix Graphical Models A rather interesting class of matrix graphical structures arises under autoregressive (AR) correlation specifications for the two precision matrices. This generates a novel (to our knowledge) class of Markov random field models that is of potential interest in application areas such as texture image modelling. We use this construction here for a second, much higher-dimensional synthetic example. Take U and V as covariances matrices of stationary AR process. For example here, we choose q = p = 60 taking U as the variance matrix of an AR(5) model with AR parameters (0.91, 0.44, 0.38, 0.31, 0.22) and marginal variance 0.55, and V as the variance matrix of an AR(4) model with AR parameters (0.47, 0.23, 0.14, 0.19) and marginal variance This model is used to repeatedly simulate 50 observations and each draw from the model is a sampled Markov random field; the columns of each sample are correlated realisations from the underlying AR(5) model, and the rows correlated realisations of the AR(4) model. Figure 4 images the two underlying precision matrices along with two representative samples. To illustrate model fitting and evaluation, we use a prior specified with d = b = 3, D = (d + 2)I 60 and B = 0.01(b + 2)I 60. MCMC analysis uses burn-in of 1000 and then saved 2000 samples starting with initial value V = I 60. The MCMC was run repeatedly across a range of models differing in the order of the underlying AR models for rows and columns, exploring all combinations of AR(1) to AR(9) structures for each of the precision matrices. Applying the model marginal likelihood approximation to each model allows us to evaluate model orders. Table 1 shows the top 5 models selected by the largest log-marginal likelihood. As can be seen, the true model orders lead to the largest marginal likelihood and, more importantly in terms of assessing the effectiveness of the methodology, the two parallel marginal likelihood assessments are in concordance and differ negligibly on the scale of interest. 15

16 7 Dynamic Matrix-Variate Graphical Models for Time Series 7.1 General Model Class One of our interests is in the development of models for matrix time series data and in extending prior methodology for multivariate graphical model structuring as in Carvalho & West (2007a,b) and examples in Carvalho et al. (2007), to the matrix context. We develop a first such extension in the class of matric-variate dynamic linear models, or exchangeable time series models (Quintana & West, 1987; Quintana, 1992; West & Harrison, 1997), that for some years has been a central model context for financial time series and provides building blocks for more elaborate and highly structured models (Quintana et al., 2003; Carvalho & West, 2007a,b). Consider q p univariate times series Y t,ij following the matrix dynamic linear model (DLM) defined by Y t = (I q F t)θ t + ν t, ν t N(0, U, V ) (8) Θ t = (I q G t )Θ t 1 + Υ t, Υ t N(0, U Σ t, V ) (9) for t = 1, 2,..., where: Y t = (Y t,ij ), the q p matrix observation at time t; Θ t = (Θ t,ij ), the qs p state matrix comprised of q p state vectors Θ t,ij each of dimension s 1; Υ t = (ω t,ij ), the qs p matrix of state evolution innovations comprised of q p innovation vectors ω t,ij each of dimension s 1; ν t = (ν t,ij ), the q p matrix of observational errors; Σ t is an s s variance matrix related to scale and structure of innovations at time t; 16

17 The regression s vector F t and s s state evolution matrix G t are known for each t. In examples below F t = F and G t = G as is common in many practical models, but the model class is more general and includes dynamic regressions when F t involves predictor variables. The matrix normal distributions for the innovations and errors as specified here are such that Υ t follows a matrix-variate normal distribution with mean 0, left covariance matrix U Σ t and right covariance matrix V (Dawid, 1981; Carvalho & West, 2007a,b). In terms of the scalar time series, the model comprises p q univariate DLMs with individual s vector state parameters, namely Observation: Y t,ij = F tθ t,ij + ν t,ij, ν t,ij N(0, v i w j ) Evolution: Θ t,ij = G t Θ t 1,ij + ω t,ij, ω t,ij N(0, v i w j Σ t ) for each t, so it is clear how the covariance structures V,U and Σ t separately reflect dependencies among the rows and columns of the state parameter matrices, and the innovations that drive changes in time, as well as the observations. Importantly, each of the scalar series shares the same F t and G t elements, and the reference to the model as one of exchangeable time series reflects these symmetries. Suppose now that U and V are constrained by graphs G U and G V. Complete the model specification, conditional on U, V, with the initial prior (Θ 0 U, V, D 0 ) N(m 0, U C 0, V ) (10) Conditional on U, V, the model and prior setup implies a complete conjugate structure for sequential learning as data is processed, as follows. Theorem 1. Under the initial prior of Equation (10) and with data observed sequentially to update information set as D t = {D t 1, Y t }, the sequential updating for the matrix normal DLM with trend/seasonal effects is given as follows: 17

18 (i) Posterior at t-1: (Θ t 1 D t 1, U, V ) N(m t 1, U C t 1, V ) (ii) Prior at t: (Θ t D t 1, U, V ) N(a t, U R t, V ) where a t = (I n G t )m t 1 and R t = G t C t 1 G t + Σ t (iii) One-step forcast at t-1: (Y t D t 1, U, V ) N(f t, U q t, V ) where f t = (I n F tg t )m t 1 and q t = F tr t F t + 1 (iv) Posterior at t: (Θ t D t, U, V ) N(m t, U C t, V ) with m t = a t + (I q A t )e t C t = R t A t A tq t where A t = R t F t /q t and e t = Y t f t Proof. This is a direct extension of the theory in multivariate DLMs for data vec(y t ), see West & Harrison (1997). The main novelty here concerns the separability of covariance structures. That is: 18

19 (a) For all t, the distributions for state matrices have separable covariance structures; for example, (Θ t D t, U, V ) is such that cov(vec(θ t ) D t, U, V ) = V U C t. (b) The q p matrix of one-step ahead prediction errors, e t, does not depend on U and V. (c) The sequential updating equations for the set of qs p state matrices are implemented in parallel based on computations for the component univariate DLMs, each of them involving the same scalar q t, s vector A t and s s matrices R t, C t at time t. This reduction in required computation is a direct consequence of the exchangeable structure of the model for the set of series. In practical modelling, the sequence of variances matrices Σ t is typically controlled and highly structured using discount factors (West & Harrison, 1997). The most parsimonious structure uses one discount factor δ such that 0 < δ << 1 and sets Σ t = C t 1 (1 δ)/δ. The discount factor represents a loss of information, or increase in uncertainty, between time points corresponding to stochastic the time t innovation. Slightly more elaborate structures are used in models when the state vectors have identifiable sub-vector components, such as might be related to groups of regression parameters, trend parameters, seasonal parameters, and so forth. In such cases, two or more discount factors are usually applied to define Σ t. We give an example of this in the data analysis example below. 7.2 MCMC and Marginal Likelihoods in Dynamic Models Consider now the question of inference on U, V in this model context. First, note that with any set of observations Y 1:n over times t = 1 : n, the sequential updating analysis 19

20 on any graph G U and G V leads to the full joint distribution p(y 1:n U, V ) = p(y n U, V, D n 1 ) p(y 1 U, V, D 0 ) n = N(e t 0, q t U, V ), t=1 marginalised with respect to the sequence of state vectors. Thus, effectively, the sequence of one-step forecast error matrices represent a conditionally independent random sample from matrix normal distributions. Apart from the scalar factors q t, this is essentially the framework of Sections 3 and 4 and both the MCMC analysis and the approximate computation of marginal likelihood values are immediately accessible. The analysis details involves a tiny change to insert the q t constants, but otherwise applies directly. As a result, we are able to directly analysis data under such models for any specified graphs G U, G V under a specified prior U HIW GU (b, B) V HIW GV (d, D) V = V /v 11 with U V. We can now re-analyse the model and approximately evaluate marginal likelihoods p(y 1:n ) on any specified pair of graphs by marginalisation over U, V using the two parallel assessments developed earlier. 7.3 A Macroeconomic Data Example The model developed above is illustrated here. The interest is in analysing the conditional dependence structure over a period of years in monthly series of observations on changes in labour market employment statistics. The data series considered are Current Employment Statistics (CES) for the 8 US states New Jersey (NJ), New York (NY), Massachusetts (MA), Georgia (GA), North Carolina (NC), Virginia (VA), Illinois (IL) and Ohio (OH). We explore these data across 9 industrial sectors, namely construction 20

21 (C), manufacturing (M), transportation & utilities (T&U), information (I), financial activities (FA), professional & business (P&BS) services, education & health (E&H) services, leisure & hospitality (L&H) and government (G). In our model framework, we have q = 8, p = 9 and monthly data over several years. Then U characterizes the residual conditional dependencies among states while V does the same for industrial sectors, all in the context of an overall model that incorporates time-varying state parameters for underlying trend and annual seasonal structure in the series. The trend and seasonal elements are represented in standard form, the former as random walks and the latter as randomly varying seasonal effects. The details in the model notation above are as follows. In month t, the monthly employment change in state i and industrial sector j is Y t,ij, modelled as a first-order polynomial/seasonal effect model (West & Harrison (1997)) with the state vector comprising a local level parameter and 12 seasonal factors, so that the state dimension is s = 13. The univariate DLMs of equation (10) are as follows: F t F = (1, 1, 0,..., 0) for all t, have 2 leading ones followed by 11 zeros; matrix G t G = P where P = 0 I ; Θ t,ij = (µ t,ij, φ t,ij ) where µ t,ij is the local level of the series and φ t,ij = (φ t,ij,k, φ t,ij,k+1,, φ t,ij,11, φ t,ij,0,, φ t,ij,k 1 ) is the current seasonal factor satisfying the constraint 1 φ t,ij = 0 for all i, j and t; Σ t is the variance matrix of trend/seasonal effects structured as Σ t = Σ t,µ 0 0 Σ t,φ 21

22 where the univariate entry Σ t,µ and the block Σ t,φ are defined via two discount factors δ l (for level) and δ s (for seasonal) and the corresponding block components of C t as Σ t,µ = C t 1,µ (1 δ l )/δ l and Σ t,φ = P C t 1,φ P (1 δ s )/δ s. The discount factor δ l determines the rate at which the trend parameter µ t,ij are expected to vary between months, with 100(δ 1 l 1)% of information (as measured by precision matrix) about these parameters decaying each month. The factor δ s playes the same role for seasonal parameters. Our analysis uses δ l = 0.9, δ s = For each model analyses, the initial prior is very vague, being specified via m 0 = 0, the matrix, and C 0 = 100I 13. The constraint that 1 φ t,ij = 0 is imposed by transforming m 0 and C 0 as discussed in West & Harrison (1997). Applying this model, we aim to detect and estimate sustained movement and changes in trend and seasonality, generating on-line detrended and deseasonalized estimates matrix series e t whose row and column covariance patterns are defined by the parameters U, V. As an example, the (NC,FA) data and some aspects of the sequential model fit are graphed in Figure 5. It is beyond the scope of the current paper to develop and discuss computational methods of graphical model search that are of interest to automate the process of generating candidate graphs (G U, G V ). This is an open research area and one we are currently exploring using ideas from MCMC and stochastic search (Dobra et al., 2004; Jones et al., 2005; Rich et al., 2005; Hans et al., 2007). For our purposes here, we generated a collection of potentially interesting and relevant pairs of graphs via an obvious, direct ad-hoc method. That is: For each US state i, say NC, sample the IW reference posterior for V. Explore which entries in Λ = V 1 might be plausibly small or zero simply by inspecting the implied sets of complete conditional regression coefficients derived from 22

23 the reference posterior mean of the precision matrix Λ. Use this informally to assess a range of candidate graphs G V. Figure 6 shows 8 candidate graphs G V so generated. Apply a similar strategy industrial sector j to obtain a few candidate graphs G U. The 9 selected candidate graphs appear in Figure 7. For each of the resulting 72 models on all possible pairs (G U, G V ), the above analysis was run to generate approximate marginal likelihoods. An overall summary of the likelihood over graphs is given in Table 2. Figures 8 and 9 shows the graph combination with the highest likelihood, and we note that both the states and industrial sector graphs seem to reflect relevant dependencies in the econometric context. In particular, the state graph groups together the tightly related northeastern US states into one clique; it connects as statistical neighbours the mid-atlanta/southeastern physical and close economic neighbours VA and NC; and it links non-atlantic seaboard northeastern neighbours OH and IL. The business sector graph connects the core manufacturing, commodity production/flow and industrial support sectors M,T&H,C,P&B,L&H and E&H in an intuitive chain, with the remaining sectors free-standing and isolated from this chain. 8 Discussion This paper has presented and illustrated novel matrix-data modelling to incorporate conditional independence structures, in terms of Gaussian graphical models, in rows and columns of matrix data. The innovations include fully Bayesian analysis of the resulting models for random samples using MCMC methods, and a first computational methodology for marginal likelihood evaluations to provide entrée into the realm of graphical model uncertainty assessment. A cornerstone of the theoretical model structure is a class of novel priors for matrix normal graphical models, using a parameter 23

24 expansion (PX) approach. The posterior computation using a Gibbs sampler leads to clean convergence and mixing of the Gibbs chains as well as stable and accurate MCMC evaluations of marginal likelihood functions under specified graphical models. Recent work of Hobert & Marchev (2008) and Roy & Hobert (2007) shows theoretical support for PX Gibbs samplers; in our model, PX not only yields good estimation adue to its good mixing and convergence, but is also fundamental to the new model/prior framework in addressing identification issues while yielding tractable and computationally accessible resulting posteriors. The first synthetic data example is of modest size but serves as an excellent illustration of the ability of Markov chain Monte Carlo methods in estimating covariance matrices and approximating marginal likelihood in this framework. The second example is illustrative of the statistical and computational methodology in a higher-dimensional problem, while introducing a novel class of Markov random field models that emerge quite naturally from the matrix graphical model context. The final example builds on our extensions of the modelling theory and computations to a broad class of matrixvariate dynamic models with structured dependencies among both rows and columns of time series data matrices using graphical model forms. Current interests lies in developing these new spatial and time series graphical models, both methodologically and in terms of applications, and in extending the methodology to more formal and automated graphical model search. Acknowledgement The authors acknowledge the support of grants from the U.S. National Science Foundation and National Institutes of Health. 24

25 References Besag, J. (1989). A candidate s formula: A curious result in Bayesian prediction. Biometrika 76, Carvalho, C., Massam, H. & West, M. (2007). Simulation of hyper-inverse wishart distributions in graphical models. Biometrika 94, Carvalho, C. M. & West, M. (2007a). Dynamic matrix-variate graphical models. Bayesian Analysis 2, Carvalho, C. M. & West, M. (2007b). Dynamic matrix-variate graphical models - A synopsis. In Bayesian Statistics VIII, J. Bernardo, M. Bayarri, J. Berger, A. Dawid, D. Heckerman, A. Smith & M. West, eds. Oxford University Press. Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the American Statistical Association 90, Dawid, A. P. (1981). Some matrix-variate distribution theory: Notational considerations and a Bayesian application. Biometrika 68, Dawid, A. P. & Lauritzen, S. L. (1993). Hyper-Markov laws in the statistical analysis of decomposable graphical models. Ann. Statist. 21, Dobra, A., Jones, B., Hans, C., Nevins, J. & West, M. (2004). Sparse graphical models for exploring gene expression data. J. Mult. Anal. 90, Dutilleul, P. (1999). The MLE algorithm for the matrix normal distribution. Journal of Statistical Computation and Simulation 64, Finn, J. D. (1974). A General Model for Multivariate Analysis. New York: Holt, Rinehart and Winston. 25

26 Galecki, A. (1994). General class of covariance structures for two or more repeated factors in longitudinal data analysis. Communications in Statistics - Theory and Methods 23, Gelman, A. (2004). Parameterization and Bayesian modeling. Journal of the American Statistical Association 99, Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models. Bayesian Analysis 3, Giudici, P. (1996). Learning in graphical Gaussian models. In Bayesian Statistics 5, J. M. Bernado, J. O. Berger, A. P. Dawid & A. M. Smith, eds. Oxford Univeristy Press, pp Giudici, P. & Green, P. J. (1999). Decomposable graphical Gaussian model determination. Biometrika 86, Gupta, A. K. & Nagar, D. K. (2000). Matrix Variate Distributions, vol. 104 of Monographs and Surveys in Pure and Applied Mathematics. London: Chapman & Hall/CRC. Hans, C., Dobra, A. & West, M. (2007). Shotgun stochastic search in regression with many predictors. Journal of the American Statistical Association 102, Hobert, J. P. & Marchev, D. (2008). Prior distributions for variance parameters in hierarchical models. Annals of Statistics To appear. Huizenga, H. M., de Munck, J. C. & Waldorp, L. J. Grasman, R. (2002). Spatiotemporal EEG/MEG source analysis based on a parametric noisecovariance model. IEEE Transactions on Biomedical Engineering 49,

27 Jones, B., Carvalho, C., Dobra, A., Hans, C., Carter, C. & West, M. (2005). Experiments in stochastic computation for high-dimensional graphical models. Statistical Science 20, Lauritzen, S. L. (1996). Graphical Models. Oxford: Clarendon Press. Liu, C., Rubin, D. B. & Wu, Y. N. (1998). Parameter expansion to accelerate EM: The PX-EM algorithm. Biometrika, Liu, J. S. & Wu, Y. N. (1999). Parameter expansion for data augmentation. Journal of the American Statistical Association 94, Mardia, K. V. & Goodall, C. R. (1993). Spatial-temporal analysis of multivariate environmental monitoring data. In Multivariate Environmental Statistics, G. P. Patil & C. R. Rao, eds. Elsevier. Mitchell, M. W., Genton, M. G. & Gumpertz, M. L. (2005). Testing for separability of space-time covariances. Environmetrics 16, Mitchell, M. W., Genton, M. G. & Gumpertz, M. L. (2006). A likelihood ratio test for separability of covariances. Journal of Multivariate Analysis 97, Naik, D. N. & Rao, S. S. (2001). Analysis of multivariate repeated measures data with a Kronecker product structured covariance matrix. Journal of Applied Statistics 29, Quintana, J. (1992). Optimal portfolios of forward currency contracts. In Bayesian Statistics IV, J. Berger, J. Bernardo, A. Dawid & A. Smith, eds. Oxford University Press. Quintana, J., Lourdes, V., Aguilar, O. & Liu, J. (2003). Global gambling. In Bayesian Statistics VII, J. Bernardo, M. Bayarri, J. Berger, A. Dawid, D. Heckerman, A. Smith & M. West, eds. Oxford University Press. 27

28 Quintana, J. & West, M. (1987). Multivariate time series analysis: New techniques applied to international exchange rate data. The Statistician 36, Rich, J., Hans, C., Jones, B., Iversen, E., McClendon, R., Rashed, A., Dobra, A., Dressman, H., Bigner, D., Nevins, J. & West, M. (2005). Gene expression and genetic markers in glioblastoma survival. Cancer Research 65, Roy, V. & Hobert, J. P. (2007). Convergence rates and asymptotic standard errors for mcmc algorithms for Bayesian probit regression. Journal of the Royal Statistical Society: Series B 69, Theobald, D. L. & Wuttke, D. S. (2006). Empirical Bayes hierarchical models for regularizing maximum likelihood estimation in the matrix Gaussian procrustes problem. Proceedings of the National Academy of Sciences 103, West, M. & Harrison, P. (1997). Bayesian Forecasting and Dynamic Models. New York: Springer-Verlag, 2nd ed. Whittaker, J. (1990). Graphical Models in Applied Multivariate Statistics. Chichester, United Kingdom: John Wiley and Sons. 28

29 Graph Structure log likelihood (A) log likelihood (B) V U AR(4) AR(5) AR(5) AR(5) AR(6) AR(5) AR(4) AR(6) Table 1: Relative log-marginal likelihood of the top five models in the MRF graphical model. Each entry is the estimated log-marginal likelihood relative to that of the most likely model on Candidate s method (A). 29

30 G V G U Table 2: Relative log-marginal likelihood of pairs of graphs in the matrix dynamic graphical model applied to the US states: Industrial sector time series. Each entry is the estimated log-marginal likelihood relative to that of the most likely model, where each is computed as the average of the two approximate values generated from the two parallel versions (A) and (B) of Candidate s formula. The differences between the two estimates of marginal likelihoods are, in all cases, in the second decimal place or smaller. 30

31 Figure 1: MCMC traceplots of diagonal elements in V in the analysis of the simulated random sample of Section 5. This illustrates the stability and fast-mixing of the MCMC that is consistent across all parameters in U and V. 31

32 Figure 2: Images of the two true precision matrices Λ and Ω (a and c) in the model generating the simulated random sample of Section 5, together with corresponding MCMC-based posterior means (b and d). 32

33 Figure 3: Log-marginal likelihood values on the two true graphs in the simulation example of Section 5. The two estimates, from method (A) and (B) of Section 4.2, were successively re-evaluated at differing MCMC sample sizes. The plot confirms the concordance of the two estimates even at low MCMC samples sizes, and suggests accuracy in terms of differences between the two estimates on the log-likelihood scale. 33

Figure 4: Images displaying the band structure of the two precision matrices (upper row) used in the MRF 60 60

34 Figure 4: Images displaying the band structure of the two precision matrices (upper row) used in the MRF matrix graphical model example of Section 6, together with images of two simulated draws (lower row) from the model. 34

35 Figure 5: Aspects of one data time series in the econometric example of Section 7.3. Monthly changes in employment (upper frame) (NC, FA) together with the one-step ahead forecasts. Standardized one-step ahead forecast errors et /qt (middle frame). On-line estimated seasonal pattern (lower frame). 35

36 Figure 6: Candidate graphs for V in the econometric example of Section 7.3, represented as edge adjacency matrices displayed as dots (1) and missing dots (0). Figure 7: Candidate graphs for U in the econometric example of Section 7.3, represented as edge adjacency matrices displayed as dots (1) and missing dots (0). 36

37 Figure 8: Econometric example of Section 7.3: Highest marginal likelihood graph that shows conditional dependencies among states for the CES data. Figure 9: Econometric example of Section 7.3: Highest marginal likelihood graph that shows conditional dependencies among industrial sectors for the CES data. 37

Dynamic Matrix-Variate Graphical Models A Synopsis 1

Proc. Valencia / ISBA 8th World Meeting on Bayesian Statistics Benidorm (Alicante, Spain), June 1st 6th, 2006 Dynamic Matrix-Variate Graphical Models A Synopsis 1 Carlos M. Carvalho & Mike West ISDS, Duke