A Recursive Model-Reduction Method for Approximate Inference in Gaussian Markov Random Fields

Size: px

Start display at page:

Download "A Recursive Model-Reduction Method for Approximate Inference in Gaussian Markov Random Fields"

Morgan Barrett
5 years ago
Views:

1 1 A Recursive Model-Reduction Method for Approximate Inference in Gaussian Markov Random Fields Jason K. Johnson and Alan S. Willsky Abstract This paper presents recursive cavity modeling a principled, tractable approach to approximate, near-optimal inference for large Gauss-Markov random fields. The main idea is to subdivide the random field into smaller subfields, constructing cavity models which approximate these subfields. Each cavity model is a concise yet faithful model for the surface of one subfield sufficient for near-optimal inference in adjacent subfields. This basic idea leads to a tree-structured algorithm which recursively builds a hierarchy of cavity models during an upward pass and then builds a complementary set of blanket models during a reverse downward pass. The marginal statistics of individual variables can then be approximated using their blanket models. Model thinning plays an important role, allowing us to develop thinned cavity and blanket models thereby providing tractable approximate inference. We develop a maximum-entropy approach that exploits certain tractable representations of Fisher information on thin chordal graphs. Given the resulting set of thinned cavity models, we also develop a fast preconditioner, which provides a simple iterative method to compute optimal estimates. Thus, our overall approach combines recursive inference, variational learning and iterative estimation. We demonstrate the accuracy and scalability of this approach in several challenging, large-scale remote sensing problems. I. INTRODUCTION Markov random fields (MRFs) play an important role for modeling and estimation in a wide variety of contexts including physics [1], [2], communication and coding [3], signal and image processing [4], [5], [6], [7], [8], [9], pattern recognition [10] remote sensing [11], [12], [13], sensor networks [14], and localization and mapping [15]. Their importance can be traced in some cases to underlying physics of the phenomenon being modeled, in others to the spatially distributed nature of the sensors and computational resources, and in essentially all cases to the expressiveness of this model class. MRFs are graphical models [16], [17], that is, collections of random variables, indexed by nodes of graphs, which satisfy certain graph-structured conditional independence relations: Conditioned on the values of the variables on any set of nodes that separate the graph into two or more disconnected components, the sets of values on those disconnected components are mutually independent. An implication of this Markov property thanks to the Hammersley-Clifford Theorem [18], [1] is that the joint distribution of the variables at all nodes can be compactly described in terms of local interactions among variables at small, completely connected subsets of nodes (the cliques of the graph). MRFs have another well recognized characteristic, namely that performing optimal inference on such models can be prohibitively complex because of the implicit coding of the global distribution in terms of many local interactions. For this reason, most applications of MRFs involve the use of suboptimal or approximate inference methods, and many such methods have been developed [19], [20], [21], [22]. In this paper we describe a new, systematic approach, to approximately optimal inference for MRFs that focuses explicitly on propagating local approximate models for subfields of the overall graphical model that are close (in a sense to be made precise) to the exact models for these subfields but are far simpler and, in fact allow computationally tractable exact inference with respect to these approximate models. The building blocks for our approach variable elimination, information projections, and inference on cycle-free graphs (that is, graphs that are trees) are well-known in the graphical model community. What is new here is their synthesis into a systematic procedure for computationally tractable inference that focuses on recursive reduced-order modeling (based on information-theoretic principles) and exact inference on the resulting set of approximate models. The resulting algorithms also have attractive structure that is of potential value for distributed implementations such as in sensor networks. To be sure our approach has connections with work of others perhaps most significantly with [23], [24], [25], [12], [26], [27], and we discuss these relationships as we proceed. The authors are with the Laboratory for Information and Decision Systems at the Massachusetts Institute of Technology, 77 Mass. Ave., Cambridge, MA ( {jasonj,willsky}@mit.edu). This research was funded by the Air Force Office of Scientific Research under Grant FA and by a grant from Shell International Exploration and Production, Inc.

2 2 While the principles for our approach apply to general MRFs, we focus our development on the important class of Gaussian MRFs (GMRFs). In the next section we introduce this class, discuss the challenges in solving estimation problems for such models, briefly review methods and literature relevant to these challenges and to our approach, and provide a conceptual overview of our approach that also explains its name: Recursive Cavity Modeling (RCM). In Section III we develop the model-reduction techniques required by RCM. In particular, we develop a tractable maximum-entropy method to compute information projections using convex optimization methods and tractable representations of Fisher information for models defined on chordal graphs. In Section IV we provide the details of the RCM methodology, which consists of a two-pass procedure for building cavity and blanket models and a corresponding hierarchical preconditioner for iterative estimation. Section V demonstrates the effectiveness of RCM with its application to several remote sensing problems. We conclude in Section VI with a discussion of RCM and further directions that it suggests. 1 A. Gaussian Markov Random Fields II. PRELIMINARIES Let G = (V, E) denote a graph with node (or vertex) set V and edge set E V V. Let x v denote a random variable associated with node v V, and let x denote the vector of all of the x v. If x is Gaussian with mean ˆx and invertible covariance P, its probability density can be written as p(x) exp{ 1 2 (x ˆx)T P 1 (x ˆx)} exp{ 1 2 xt Jx + h T x} (1) Jˆx = h P = J 1 (2) The form on the right-hand side of (1) is often referred to as the information form of the statistics of a Gaussian process where J is the information matrix. The fill pattern of J provides the Markov structure [28]: x is Markov with respect to G if and only if J u,v = 0 for all {u, v} E. In applications, ˆx and P typically specify posterior statistics of x after conditioning on some set of observations. The most common example is one in which we have an original GMRF with respect to G, together with measurements, corrupted by independent Gaussian noise, at some or all of the nodes of the graph. Since an independent measurement at node v simply modifies the values of h v and J v,v, the resulting field conditioned on all such measurements is also Markov with respect to G. B. The Estimation Problem Given J and h, we wish to compute ˆx and (at least) the diagonal elements of P thus providing the marginal distributions for each of the x v. However solving the linear equations in (2) and inverting J can run into scalability problems. For example, methods that take no advantage of graphical structure require O(n 3 ) computations for graphs with n nodes. If the graph G has particularly nice structure, however, very efficient algorithms do exist. In particular, if G is a tree there are a variety of algorithms which compute ˆx and the diagonal of P with total complexity that is linear in n and that also allow distributed computation corresponding to messages being passed along edges of the graph. For example, if J is tri-diagonal, the variables x v form a Markov chain, and efficient solution of (2) can be obtained by Gaussian elimination of variables from one end of the chain to the other followed by back-substitution corresponding to a forward Kalman filtering sweep followed by a backward Rauch-Tung-Striebel smoothing sweep [14]. Because of the abundance of applications involving MRFs on graphs with cycles, there is considerable interest and a growing body of literature on computationally tractable inference algorithms. For example, the generalization of the Rauch-Tung-Striebel smoother to trees can in principle be applied to graphs with cycles by aggregating nodes of the original graph using so-called junction tree algorithms [17] to form an equivalent model on a tree. However, the dimensions of variables at nodes in such a tree model depend on the so-called tree-width of the original graph [29], with overall inference complexity in GMRFs that grows as the cube of this tree-width. Thus, 1 A C++ implementation of the algorithms described in this paper is available at

3 3 these algorithms are tractable only for graphs with small tree-width, precluding use for many graphs of practical importance such as a 2D s s lattice (with n = s 2 nodes) for which the tree-width is s (so that the cube of the tree-width is n 3/2, resulting in complexity that grows faster than linearly with graph size) or a 3D s s s lattice for which the tree-width is s 2 (so that the cube of the tree-width is n 2 ). Since exact inference is only feasible for very particular graphs, there is great interest in algorithms that yield approximations to the correct means and covariances and that have tractable complexity. One well-known algorithm is loopy belief propagation (LBP) [30], [20] which take the local message-passing rules which yield the exact solution on trees, and apply them unchanged and iteratively to graphs with cycles. There has been recent progress [31], [20] in understanding how such algorithms behave, and for GMRFs it is now known [31], [22] that if LBP converges, it yields the correct value for ˆx but not the correct values for the P v,v. Although some sufficient conditions for convergence are known [31], [32], LBP does not always converge and may converge slowly in large GMRFs. There are several other classes of approximate algorithms that are more closely related to our approach, and we discuss these connections in the next subsection. As we now describe, RCM can be viewed as a direct, recursive approximation of an exact (and hence intractable) inference algorithm created by aggregating nodes of the original graph into a tree. In particular, by employing an information-theoretic approach to reduced-order modeling, together with a particular strategy for aggregating nodes, we construct tractable, near-optimal algorithms that can be applied successfully to very large graphs. C. The Basic Elements of RCM As with exact methods based on junction trees, RCM makes use of separators that is, sets of nodes which, if removed from the graph, result in two or more disconnected components. By Markovianity, the sets of variables in each of these disconnected components are mutually independent conditioned on the set of values on the separator. This suggests a divide and conquer approach to describing the overall statistics of the MRF on a hierarchicallyorganized tree. Each node in this tree corresponds to a separator at a different scale in the field. For example, the root node of this tree might correspond to a separator that separates the entire graph into, say k, disconnected subgraphs. The root node then has k children one corresponding to each of these disconnected subgraphs and the node for each of these children would then correspond to a separator that further dissects that subgraph. This continues to some finest level at which exact inference computations on the subgraphs at that level are manageable. The problem with this approach, as suggested in Section II-B, is that the dimensionality associated with the larger separators in our hierarchical tree can be quite high for instance, n in square grids. This problem has led several researchers [24], [7], [33], [34], [14] to develop approaches for GMRFs based on dimensionality reduction that is, replacing the high-dimensional vector of values along an entire separator by a lower-dimensional vector. While approaches such as [33], [34] use statistically-motivated criteria for choosing these approximations, there are significant limitations of this idea. The first is that the use of low-dimensional approximations can lead to artifacts (that is, modeling errors which expose the underlying approximation), both across and along these separators. The second is that performing such a dimensionality reduction requires that we have available the exact mean and covariance for the vector of variables whose dimension we wish to reduce, which is precisely the intractable computation we wish to approximate! The third limitation is that these approaches are strictly top-down approaches that is, they require establishing the hierarchical decomposition from the root node on down to the leaf nodes a priori, an approach often referred to as nested dissection. We also employ nested dissection in our examples, but the RCM approach also offers the possibility of bottom-up organization of computations, beginning at nodes located close to each other and working outward a capability that is particularly appealing for distributed sensor networks. The key to RCM is the use of the implicit, information form, corresponding to models for the variables along separators, allowing us to consider model-order rather than dimensionality reduction. In this way, we still retain full dimensionality of the variables along each separator, overcoming the problem of artifacts. Of course, we still have to deal with the computational complexity of obtaining the information form of the statistics along each separator. Doing that in a computationally and statistically principled fashion is one of the major components of RCM. Consider a GMRF on the graph depicted in Fig. 4(a) so that the information matrix for this field has a sparsity pattern defined by this graph. Suppose now that we consider solving for ˆx and the diagonal of P from (2) by variable elimination. In particular, suppose that we eliminate all of the variables within the dashed region

4 4 in Fig. 4(a) except for those right at the boundary. Doing this in general will lead to fill in the information matrix for the set of variables that remain after variable elimination. As depicted in Fig. 4(b), this fill is completely concentrated within the dashed region that is, within this cavity. That is, if we ignore the connections outside the cavity, we have a model for the variables along the boundary that is generally very densely connected. This suggests approximating this high-order exact model for the boundary by a reduced or thinned model as in Fig. 4(c) with the sparsity suggested by this figure. Indeed, if we thin the model sufficiently, we can then continue the process of alternating variable elimination and model thinning in a computationally tractable manner. Suppose next that there are a number of disjoint cavities as in Fig. 4 in each of which we have performed alternating steps of variable elimination to enlarge the cavity followed by model thinning to maintain tractability. Eventually, two or more of these cavities will reach a point at which they are adjacent to each other, as in Fig. 5(a). At this point, the next step is one of merging these cavities into a larger one (Fig. 5(b)), eliminating the nodes that are interior to the new, larger cavity (Fig. 5(c)), and then thinning this new model. If each step does sufficient thinning, computational tractability can be maintained. Eventually, this outwards elimination ends, and a reverse inwards elimination procedure commences, again done in information form. This inwards procedure eliminates all of the variables outside of each subfield, except for the variables adjacent to the subfield, producing what is known as a blanket model. Eliminating these variables involves computations that recursively produce blanket models for smaller and smaller subfields, as illustrated in Fig. 6, which shows that, once again, model thinning (going from Fig. 6(c) to Fig. 6(d)) plays a central role. Finally, once this inward sweep has been completed, we have information forms for the marginal statistics for each of the subfields that were used to initialize the outwards elimination procedure. Inverting these many smaller, now-localized models to obtain means and variances is then, by construction, computationally tractable. RCM has some relationships to other work as well as some substantive differences. The general conceptual form we have outlined is closely related to the nested dissection approach [23], [12] to solving large linear systems. The approach to model thinning in [23], [12], however, is simply zeroing a set of elements (retaining just those elements which couple nearby nodes along the boundary). The statistical interpretation of this approach and its extensibility to less regular lattices and fields, however, are problematic. In particular, zeroing elements can lead to indefinite (and hence meaningless) information matrices, and even if this is not the case, such an operation in general will modify all of the elements of the covariance matrix (including the variances of individual variables). In contrast, we adopt a principled, statistical approach to model thinning, using so-called information projections, which guarantee that the means, variances and edgewise correlations in the thinned model are unchanged by the thinning process. Information-theoretic approaches to approximating graphical models have a significant literature [25], [35], [29], [36], most of which focuses on doing this for a single, overall graphical model and not in the context of a recursive procedure such as we develop. One effort that has considered a recursive approach is [26] which examines timerecursive inference for Dynamic Bayes Nets (DBNs) that is, for graphical models that evolve in time, so that we can view the overall graphical model as a set of coupled temporal stages. Causal recursive filtering then corresponds to propagating frontiers that is, a particular choice of what we would call cavity boundaries corresponding to the values at all nodes at a single point in time. The method in [26] projects each frontier model into a family of factored models so that the projection is given by a product of marginals on disjoint subsets of nodes. Such an operation can be viewed as a special case of the outward propagation of cavity models where nodes in the boundary are required to be mutually independent. In our approach, we instead adaptively thin the graphical model by identifying and removing edges that correspond to weak conditional dependencies so that the thinned models typically do not become disconnected. Also, the other two elements of our approach specifically, the hierarchical structure which requires merging operations as in Fig. 5 and the inward recursion for blanket models as in Fig. 6 don t arise in the consideration of DBNs [26]. This distinction is important for large-scale computation because the hierarchical, tree structure of RCM is highly favorable for parallel computing 2, whereas frontier propagation methods require serial computations. There are also parallels to the group renormalization method using decimation [37], [13], which constructs a multi-scale cascade of coarse-scale MRFs by a combination of node-elimination and edge-thinning, and estimates the most probable configuration at each scale using iterative methods. 2 In exact inference methods using junction trees, the benefits of parallel computing are limited by the predominant computations on the largest separators, a limitation that RCM avoids through the use of thinned boundary models.

5 5 D(p q) = D(p p F ) + D(p F q) M(p) p D(p p F ) = h(p F ) h(p) q D(p F q) p F F Fig. 1. Illustration of information projection and the Pythagorean relation. III. MODEL REDUCTION In this section we focus on the problem of model reduction, the solution of which RCM employs in the recursive thinning of cavity and blanket models. In Section III-A, we pose model reduction as information projection to a family of GMRFs and develop a tractable maximum-entropy method to compute these projections. In Section III-C we present a greedy algorithm that uses conditional mutual information to select which edges to remove. In our development we assume that the model being thinned is tractable. This is consistent with RCM in which we thin models, propagate them to larger ones that are still tractable and then thin again to maintain tractability. A. Information Projection and Maximum Entropy Suppose that we wish to approximate a probability distribution p(x) by a GMRF defined on a graph G = (V, E). Over the family F of GMRFs on G we select q(x) to minimize the information divergence (relative entropy [38]) relative to p: D(p q) = p(x)log p(x) dx 0 (3) q(x) As depicted in Fig. 1, minimizing (3) can be viewed as a projection of p onto F. Many researchers have adopted (3) as a natural measure of modeling error [25], [36], [29]. The problem of minimizing information divergence takes on an especially simple characterization when the approximating family F is an exponential family [39], [35], [40], that is, a family of the form p θ (x) exp{θ φ(x)} where θ R d are the exponential parameters and φ : R n R d is a vector of linearly independent sufficient statistics. The family is defined by the set θ Θ for which exp{θ φ(x)}dx <. The vector of moments η = Λ(θ) E θ {φ(x)} plays a central role in minimizing (3). In particular, it can be shown that θ Θ minimizes D(p p θ ) if and only if we have moment matching relative to F; that is if and only if the expected values of the sufficient statistics that define F are the same under p θ and the original density p. This optimizing element, which we refer to as the information projection of p to F and denote by p F, is the unique member of the family F for which the following Pythagorean relation holds: D(p q) = D(p p F ) + D(p F q) for any q F (see [35] and Fig. 1). Information projections also have a maximum entropy interpretation [41], [35] in that among all densities q M(p) that match the moments of p relative to F, p F is the one which maximizes the entropy h(q) = q(x)log q(x)dx. Moreover, the increase in entropy from p to p F is precisely the value of the information loss D(p p F ) = h(p F ) h(p). The family of GMRFs on a graph G is represented by an exponential family with sufficient statistics φ G, exponential parameters θ G and moment parameter η G given by: φ G (x) (x v ) v V (x 2 v) v V (x u x v ) {u,v} E θ G = (h v ) v V ( 1 2 J v,v) v V ( J u,v ) {u,v} E η G = (ˆx v ) v V (P v,v + ˆx 2 v) v V (P u,v + ˆx uˆx v ) {u,v} E (4) Note that θ G specifies h and the non-zero elements of J while η G specifies ˆx and the corresponding subset of elements in P. These parameters are related by a one-to-one map η G = Λ(θ G ), defined by P = J 1 and ˆx = J 1 h, which is bijective on the image of realizable moments M(G). Given a distribution p(x), the projection to F(G) is given as follows. Using the distribution p, we compute the moments relative to G, or equivalently, the means ˆx v, variances P v,v and edge-wise cross-covariances P u,v on G.

6 6 The information matrix J of the projection is then uniquely determined by the following complementary sets of constraints [42], [28]: (J 1 ) u,v = P u,v, (u, v) E (5) J u,v = 0, (u, v) E (6) where E = E {(v, v), v V }. Eq. (5) imposes covariance-matching conditions over G while (6) imposes Markovianity with respect to G. Also, by the maximum-entropy principle, an equivalent characterization of J is that P J 1 is the maximum entropy completion [43] of the partial covariance specification P G = (P u,v, (u, v) E ). Given J, the remaining moment constraints are satisfied by setting h = Jˆx. Then, (h, J) is the information form of the projection to G. Hence, projection to general GMRFs may be solved by a shifted projection to the zero-mean GMRFs and we may focus on this zero-mean case without any loss of generality. The family of zero-mean GMRFs is described as in (4) but without the linear-statistics x and corresponding parameters h and moments ˆx. B. Maximum-Entropy Relative to a Chordal Super-Graph We now develop a method to compute the projection to a graph by embedding this graph within a chordal super-graph and maximizing entropy of the chordal GMRF subject to moment constraints over the embedded subgraph. This approach allows us to exploit certain tractable calculations on chordal graphs to efficiently compute the projection to a non-chordal graph. 1) Chordal GMRFs: A graph is chordal if, for every cycle of four or more nodes, there exists an edge (a chord) connecting two non-consecutive nodes of the cycle. Let C(G) denote the set of cliques of G: the maximal subsets C V for which the induced subgraph G C is complete, that is, every pair of nodes in C is an edge of the graph. A useful result of graph theory states that a graph is chordal if and only if there exists a junction tree: a tree T = (Γ, E Γ ) whose nodes γ Γ are identified with cliques C γ C(G) and where for every pair of nodes α, β Γ we have C α C β C γ for all γ along the path from α to β. Then, each edge (α, β) E Γ determines a minimal separator S = C α C β of the graph. Moreover, any junction tree of a chordal graph G yields the same collection of edge-wise separators, which we denote by S(G). The importance of chordal graphs is shown by the following well-known result: Any strictly-positive probability distribution p G (x) that is Markov on a chordal graph G can be represented in terms of its marginal distributions on the cliques C C(G) and separators S S(G) of the graph as p G (x) = C p C(x C ) S p S(x S ). (7) In chordal GMRFs, this leads to the following formula for the sparse information matrix in terms of marginal covariances: J = C [P 1 C ] V S [P 1 S ] V. (8) Here, [...] V denotes zero-padding to a V V matrix indexed by V. In the exponential family, this provides an efficient method to compute θ G = Λ 1 (η G ). Also, given the marginal covariances of an arbitrary distribution p(x), not necessarily Markov on G, (8) describes the projection of p to F(G). The complexity of this calculation is O(nw 3 ) where w is the size of the largest clique. 3 2) Entropy and Fisher Information in Chordal GMRFs: Based on (7), it follows that the entropy of a chordal MRF likewise decomposes in terms of marginal entropy on the cliques and separators of G. In the moment parameters of the GMRF, we have h(η G ) = h C (η GC ) h S (η GS ), (9) C S where h C and h S denote marginal entropy of cliques and separators, computed using h U (η GU ) = 1 2 (log detp U(η GU ) + U log 2πe). (10) For exponential families, it is well-known that h(η) = Λ 1 (η) so that, for GMRFs, differentiating (9) reduces to performing the conversion (8). Thus, both h(η G ) and h(η G ) can be computed with O(nw 3 ) complexity. 3 This complexity bound follows from the fact that, in chordal graphs, the number of maximal cliques is at most n 1 and, in GMRFs, the computations we perform on each clique are cubic in the size of the clique.

7 7 Next, we recall that the Fisher information with respect to parameters η G is defined G(η G ) E ηg { log p(x; η G ) T log p(x; η G )} = T h(η G ), where denotes gradient with respect to η G and the expectation is with respect to the unique element p F(G) with moments η G. Then, G(η G ) is a symmetric, positive-definite matrix and also describes the negative Hessian of entropy in exponential families. By twice differentiating (9), it follows that, in chordal GMRFs, the Fisher information matrix has a sparse representation in terms of marginal Fisher information defined on the cliques and separators of the graph: G(η G ) = [G C (η GC )] G [G S (η GS )] G, (11) C S where G U (η GU ) T h U (η GU ) is the marginal Fisher information on U and [...] G denotes zero-padding to a matrix indexed by nodes and edges of G. From (11), we observe that the fill pattern of G(η G ) defines another chordal graph with the same junction tree as G, but where each clique C C(G c ) maps to a larger clique with O( C 2 ) nodes (corresponding to a full sub-matrix of G(η G ) indexed by nodes and edges of G C ). For this reason, direct use of G(η G ), viewed simply as a sparse matrix, is undesirable if G contains larger cliques. However, we can specify implicit methods that exploit the special structure of G to implement multiplication by either G or G 1 with O(nw 3 ) complexity. Observing that G(η G ) = θg η G represents the Jacobian of the mapping from η G to θ G, we can compute matrix-vector products dθ G = G dη G for an arbitrary input dη G (viewed as a change in moment coordinates). Differentiating (8) using d(p 1 U ) = P 1 U dp UP 1 U we obtain: dj = S [P 1 C dp CP 1 C ] V + C [P 1 S dp SP 1 S ] V. (12) Similarly, we can compute dη G = G 1 dθ G by differentiating η G = Λ(θ G ). In appendix A, we summarize a recursive inference algorithm, defined relative to a junction tree of G, that computes η G given θ G and derive a corresponding differential form of the algorithm that computes dη G given dθ G. These methods are used to efficiently implement the variational method described next. 3) Maximum-Entropy Optimization: Given a GMRF p on G, we develop a maximum-entropy (ME) method to compute the projection to an arbitrary (non-chordal) sub-graph S. Let G be a chordal super-graph of G and let R = E(G ) \ E(S) such that η G = (η S, η R ). We may compute η G (p) using recursive inference on a junction tree of G (see Appendix A). To compute the projection to S, we maximize entropy in the chordal GMRF subject to moment constraints over the sub-graph S. This may be formulated as a convex optimization problem: min η R f(η R ) h(η S, η R ) s.t. (η S, η R ) M(G ) (13) Here, η G M(G ) are the realizable moments of the GMRF defined on G. 4 Starting from η (0) G = η G (p), we compute a sequence η (k) G = (η S, η (k) R ) using Newton s method. For each k, this requires solving the linear system G (k) R η(k) R = θ(k) R (14) where G (k) R = T f is the principle sub-matrix of G(η (k) G ) corresponding to R and θ (k) R = f is the corresponding sub-vector of θ (k) G = Λ 1 (η (k) G ) computed using (8). We then set η (k+1) R = η (k) R + λ η(k) R, where λ (0, 1] is determined by back-tracking line search to stay within M(G ) and to insure that entropy is increased. This procedure converges to the optimal ηg = (η S, η R ), for which the corresponding exponential parameters satisfy θr = 0. Then, θ S is the information projection to S. Finally, we discuss an efficient method to compute the Newton step: If the width w of the chordal graph is very small, say w < 10, we could explicitly form the sparse matrix G R and efficiently solve (14) using direct methods. However, this approach has O(nw 6 ) complexity, which is undesirable for larger values of w. Instead, we use an inexact Newton s step, obtained by approximate solution of (14) using standard iterative methods, for instance, 4 In the chordal graph G, the condition that η G is realizable is equivalent to P C(η G C ) 0 for all C C(G ), which can be verified with complexity O(nw 3 ). Thus, M(G ) is convex because the set of positive-definite matrices on each clique is convex.

8 8 preconditioned conjugate gradients (PCG). Such methods generally require an efficient method to compute matrixvector products G R η R, which we can provide using the implicit method, based on (12), for multiplication by G. Also, to obtain rapid convergence, it is important to provide an efficient preconditioner, which approximates (G R ) 1. For our preconditioner, we use (G 1 ) R, 5 implemented using an implicit method for multiplication by G 1 described in Appendix A. In this way, we obtain iterative methods that have O(nw 3 ) complexity per iteration. Using the PCG method, we find that a small number of iterations (typically, 3-12) is sufficient to obtain a good approximation to each Newton step, leading to rapid convergence in Newton s method, but with significantly less overall computation for larger values of w than is required using direct methods. C. Greedy Model Thinning In this section, we propose a simple greedy strategy for thinning a GMRF model. This entails selecting edges of the graph which correspond to weak statistical interactions between variables and pruning these edges from the GMRF by information projection. The quantity we use to measure the strength of interaction between x u and x v is the conditional mutual information [38], ) } ( p(x u, x v x \{u,v} ) I u,v (p) E p {log = 1 p(x u x \{u,v} )p(x v x \{u,v} ) 2 log 1 J2 u,v J u,u J v,v which is the average mutual information between x u and x v after conditioning on the other variables x \{u,v}. In GMRFs, we can omit edge {u, v} from G, without any modeling error, if and only if x u and x v are conditionally independent given x \{u,v}, that is, if and only if I u,v (p) = 0. This suggests using the value of I u,v (p), which is tractable to compute, to select edges {u, v} E to remove. To motivate this idea further, we note that I u,v (p) is closely related to the information loss resulting from removing edge {u, v} from G by information projection. Let G \ {u, v} = (V, E \ {u, v}) denote the sub-graph of G with edge {u, v} removed and let K denote the complete graph on V. Then, observing that G \ {u, v} is a sub-graph of K \ {u, v}, we have, by the Pythagorean relation with respect to p K\{u,v}, D(p p G\{u,v} ) = D(p p K\{u,v} ) + D(p K\{u,v} p G\{u,v} ) = I u,v (p) + D(p K\{u,v} p G\{u,v} ) I u,v (p) where we have used D(p p K\{u,v} ) = I u,v (p) and D(p K\{u,v} p G\{u,v} ) 0. Thus, I u,v (p) is a lower-bound on the information loss D(p p G\{u,v} ). Moreover, for p F(G) having a small value of I u,v (p), we find that D(p K\{u,v} p G\{u,v} ) tends to be small relative to I u,v (p) so that I u,v (p) then provides a good estimate of D(p p G\{u,v} ). In other words, removing edges with small conditional mutual information is roughly equivalent to picking those edges to remove that result in the least modeling error. We use the following greedy approach to thin a GMRF defined on G. Let δ > 0 specify the tolerance on conditional mutual information for removal of an edge. We compute I u,v (p) for all edges {u, v} E and select a subset of edges R E with I u,v (p) < δ to remove. The information projection to the sub-graph S = (V, E \ R) is then computed using our ME method as described in Section III-B (relative to a chordal super-graph of G). Because the values of I u,v in this information projection will generally differ from their prior values, we may continue to thin the GMRF until all the remaining edges have I u,v > δ. Also, by limiting the number of edges removed at each step, it is possible to take into account the effect of removing the weakest edges before selecting which other edges to remove, which can help reduce the overall information loss. 0, IV. RECURSIVE CAVITY MODELING We now flesh out the details of RCM. In Section IV-A, we specify the hierarchical tree representation of the GMRF that we use, and in Section IV-B, we define information forms and the three basic operations we use: composition, elimination and model reduction. These forms and operators are the components we use to build our 5 To motivate this preconditioner, we note that (G R) 1 is given by the Schur complement H R H R,SH 1 S HS,R with respect to H G 1 = cov{φ(x)}. Hence, our preconditioner H R = (G 1 ) R arises by neglecting the intractable term H R,SH 1 S HS,R, which is a good approximation if the correlation H S,R is weak relative to H R and H S.

9 9 T = (Γ, E Γ ) π(γ) γ 0 U U γ U π(γ) U 0 = V α(γ) β(γ) U α U β (a) (b) Fig. 2. (a) The tree T = (Γ, E Γ) based on (b) the collection U of nested subsets of vertices V of the underlying graph G. two-pass, recursive, message-passing inference algorithm on the hierarchical tree. First, as described in Section IV-C, we perform an upward pass on the tree which constructs cavity models. Next, as described in Section IV-D, we perform a downward pass on the tree which constructs blanket models and also estimates marginal variances and edge-wise covariances in the GMRF. Lastly, in Section IV-E, we describe a hierarchical preconditioner, using the cavity models computed by RCM, and an iterative estimation algorithm that computes the means for all vertices of the GMRF. Before we proceed, we define some basic notation with respect to the graph G = (V, E) describing the Markov structure of x. Given U V, let U V \U denote the set complement of U in V and let U {v U (u, v) E} denote the blanket of U in G. Also, U (U ) is the surface of U and U U \ U is its interior. These definitions are illustrated in Fig. 3(a), (b) and (c). A. Hierarchical Tree Structure We begin by requiring that the graphical model is recursively dissected into a hierarchy of nested subfields as indicated in Fig. 2. First, we describe a bottom-up construction. Let the set V be partitioned into a collection L of many small, disjoint subsets chosen so as to induce low-diameter, connected subgraphs in G over which exact inference is tractable. These small sets of vertices are recursively merged into larger and larger subfields until only the entire set V remains. Only adjacent subfields are merged so as to induce connected subgraphs. Also, merging should (ideally) keep the diameter of these connected subgraphs as small as possible. To simplify presentation only, we assume that subfields are merged two at a time. This generates a collection U 2 V containing the smallest sets in L as well as each of the merged sets up to and including V. Alternatively, such a dissection can be constructed in a top-down fashion by recursively splitting the graph, and resulting sub-graphs, into roughly equal parts chosen so as to minimize the number of cut edges at each step. For instance, in 2D lattices this is simply achieved by performing an alternating series of vertical and horizontal cuts. In any case, this recursive dissection of the graph defines a tree T = (Γ, E Γ ), in which each node γ Γ corresponds to a subset U γ U and with directed edges E Γ linking each dissection cell to it immediate sub-cells. We let π(γ) denote the parent of node γ in this tree. Also, the children of γ are denoted π 1 (γ) = {α(γ), β(γ)}, or more simply {α, β} where γ has been explicitly specified. The following vertex sets are defined for each U γ U relative to the graph G: B γ U γ, R γ U γ. (15) As seen in Figs. 3(a), (b) and (c), the blanket B γ is the outer boundary of U γ while the surface R γ is its inner boundary, and either serves as a separator between U γ and U γ. Also, the following separators are used in RCM: S γ R α R β, S α B γ R β, S β B γ R α. (16) The separator S γ, used in the RCM upward pass, is the union of the surfaces of the two children of a subfield (see Fig. 3(d) and (e)). The separators S α and S β, used in the RCM downward pass, are each the union of its parent s blanket and its sibling s surface (see Fig. 3(d) and (f)). These separators define a Markov tree representation, with respect to T, of the original GMRF defined on G [24]: For each leaf γ of T define the state vector x γ x Uγ. For each non-leaf γ let x γ x S γ. By construction, each S γ is a separator of the graph, that is, the subfields U α, U β and U γ are mutually separated by S γ. Hence, all conditional independence relations required by the Markov tree are satisfied by the underlying GMRF. However, we are interested in the large class of models for which exact inference on such a Markov tree representation is not feasible because of the large size of some of the separators. As discussed in Section II-C, we instead perform reduced-order modeling of these variables, corresponding to a thinned, tractable graphical model on each separator. L

10 10 U γ B γ U γ R γ U γ U γ U γ (a) (b) (c) U γ S γ U γ S α Uα U β U α (d) (e) (f) Fig. 3. Illustrations of the graph G of a GMRF and of our notation used to indicate subfields: (a) the subfield U γ and its complement U γ; (b) the blanket B γ = U γ; (b) the interior Uγ and surface R γ = U γ; (d) partitioning of U γ into sub-cells U α and U β ; (e) separator S γ = R α R β ; (f) separator S α = R β B γ. B. Information Kernels In the sequel, we let (h U, J U ), where U V, h U R U and J U R U U is symmetric positive definite, represent the information kernel f U : R U R + defined by: f U (x U ; h U, J U ) = exp{ 1 2 xt UJ U x U + h T Ux U } (17) The subscript U indicates the support of the information kernel, and of the matrices h U and J U. Generally, f U corresponds (after normalization) to a density over the variables x U parameterized by h U and J U. In RCM, the set U is typically a separator of the graph, and h U and J U are approximations to the exact distribution in question so that J U is sparse. We also use matrices J U,W, where U, W V and J U,W R U W, to represent the function f U,W (x U, x W ; J U,W ) = exp{ x T UJ U,W x W }, (18) which describes the interaction between subfields U, W. We adopt the following notations: Let h U [W] denote the sub-vector of h U indexed by W U. Likewise, J U [W 1, W 2 ] denotes the sub-matrix of J U indexed by W 1 W 2 and we write J U [W] = J U [W, W] to indicate a principle sub-matrix. Given two disjoint subfield models (h U1, J U1 ) and (h U2, J U2 ) and the interaction J U1,U 2 we let (h U, J U ) = (h U1, J U1 ) J U1,U 2 (h U2, J U2 ) denote the joint model on U = U 1 U 2 defined by h U = ( hu1 h U2 ), J U = ( ) JU1 J U1,U 2 J T U 1,U 2 J U2 which corresponds to multiplication of information kernels or addition of their information forms. Given an information form (h U, J U ) and D U to be eliminated, we let (ĥs, ĴS) = ˆΠ S (h U, J U ) ˆΠ \D (h U, J U ) denote 6 the operation of Gaussian Elimination (GE) defined by S = U \ D and ĥ S = h U [S] J U [S, D]J U [D] 1 h U [D] (19) Ĵ S = J U [S] J U [S, D]J U [D] 1 J U [D, S] (20) The matrix ĴS is the Schur complement of the sub-matrix J U [D] in J U. Straightforward manipulations lead to the following well-known result: (ĴS) 1 = (J 1 U )[S] and (ĴS) 1 ĥ S = (J 1 U h U)[S]. (21) Thus, the information form (ĥs, ĴS) corresponds to the marginal on S with respect to the model (h U, J U ). Also, GE may be implemented recursively as follows: given an elimination order (d 1,...,d n ) of the elements in D, compute (20) as ˆΠ \dn ˆΠ \d1 (U,h U, J U ), that is, by eliminating one variable at a time. Note also that only those 6 Two notations are introduced, as in some cases S = U \ D is given explicitly, while in others it is only implicitly specified in terms of U and D.

11 11 J[U γ ] Ĵ Rγ J Rγ (a) (b) (c) Fig. 4. Initialization of a cavity model for a small subfield U γ L, corresponding to a leaf of T : (a) the initial subfield model J[U γ], a sub-matrix of J; (b) the cavity model Ĵ Rγ = ˆΠJ[U γ] after Gaussian elimination of the interior variables Uγ = U γ \ R γ; (c) the final thinned cavity model JRγ = Π δ Ĵ Rγ defined on the surface R γ of subfield U γ. entries of h U and J U indexed by D are modified by GE. Hence, GE is a localized operation within the graphical representation of the GMRF as suggested by Figs. 4(a) and 4(b). However, eliminating D typically has the effect of causing ĴS[ D] to become full as shown in Fig. 4(b). This creation of fill can spoil the graphical model so that recursive GE becomes intractable with worst-case cubic complexity in dense graphs. Given an information matrix J U we denote the result of model-order reduction by J U = Π δ J U. The model reduction algorithm in Section III requires specifying a parameter δ which controls the tolerance on conditional mutual information for the removal of an edge. The procedure then determines which edges in the graph corresponding to J U to remove and determines the projection to this thinned graph. This projection preserves variances and edgewise cross-covariances on the thinned graph, which is equivalent to ˆΠ C JU = ˆΠ C J U for each clique C U of the thinned graph. In the following sections, we first develop our two-pass approximate inference procedure, focusing on calculation of just the information matrices, which are all independent of h. Then, we provide additional calculations involving h and ˆx, presented as a separate two-pass procedure which then serves as a preconditioner in an iterative method. C. Upward Pass: Cavity Model Propagation In this first step, messages are passed from the leaves of the tree L up towards the root V. These upward messages take the form of cavity models, encoding conditional statistics of variables lying in the surfaces of given subfields. To be precise, each cavity model, represented by an information matrix J Rγ, approximates a conditional density p(x Rγ x Bγ = 0) so that J Rγ is a tractable, thin matrix. 1) Leaf-Node Initialization: For each U γ L we initialize a cavity model as follows: We begin with the local information matrix J[U γ ] as depicted in Fig. 4(a). This specifies the conditional density p(x Uγ x Bγ = 0) f(x Uγ ; 0, J[U γ ]). We then eliminate all variables within the interior of U γ by Gaussian elimination: ĴR γ = ˆΠ Rγ J[U γ ]. This has the effect of deleting all nodes in the interior of U γ and updating the matrix parameters on the surface. As indicated in Fig. 4(b), this also induces fill within the information matrix. To ensure tractable computations in later stages, we thin this model: J Rγ = Π δ Ĵ Rγ, yielding a reduced-order cavity model, as shown if Fig. 4(c), for each subfield U γ L. Then we are ready to proceed up the tree growing larger cavity models from smaller ones. 2) Growing Cavity Models: Let U γ V be a subfield in U where we have already constructed the two cavity models for R α and R β as depicted in Fig. 5(a). Then, we construct the cavity model for U γ as follows: a) Join Cavity Models: First, we form the composition of the two sub-cavity models as indicated in Fig. 5(b): J S γ = J Rα J[R α, R β ] J Rβ. Note that S γ = R α R β is a superset of R γ. b) Variable Elimination: Next, we must eliminate the extra variables D γ S γ \ R γ, to obtain the marginal information matrix ĴR γ = ˆΠ Rγ JS γ. To ensure scalability, rather than eliminating all variables at once, we eliminate variables recursively beginning with those farthest from the surface and working our way towards the surface. This is an efficient computation thanks to model reductions performed previously in U α and U β. c) Model Thinning: This preceding elimination step induces fill across the cavity (Fig. 5(c)). Hence, to maintain tractability as we continue, we perform model-order reduction yielding J Rγ = Π δ Ĵ Rγ which is the desired reduced-order cavity model represented in Fig. 5(d). This projection step requires that we compute moments of the graphical model specified by ĴR γ. Thanks to model thinning in the subtree of T rooted at γ, these moments can be computed efficiently.

12 12 J Rα JRβ J S γ Ĵ Rγ J Rγ (a) (b) (c) (d) Fig. 5. Recursive construction of a cavity model: (a) cavity models J Rα, J Rβ of sub-cells U α, U β ; (b) joined cavity model JS γ = J Rα J[R α, R β ] J Rβ defined on separator S γ = R α R β ; (c) the cavity model Ĵ Rγ = ˆΠ J S γ after Gaussian elimination of variables S γ \ R γ; (d) the final thinned cavity model JRγ = Π δ Ĵ Rγ defined on the surface R γ of subfield U γ. J Bγ J Sα Ĵ Bα J Bα U α JRβ U α U α U α (a) (b) (c) (d) Fig. 6. Recursive construction of a blanket model: (a) the cavity model JRβ of the sibling subfield U β and the blanket model JBγ of the parent; (b) joined cavity/blanket model JSα = J Rβ J[R β, B γ] J Bγ defined on the separator S α = R β B γ; (c) the blanket model Ĵ Bγ = ˆΠ J Sα after Gaussian elimination of variables S α \ B γ; (d) the final thinned blanket model JBα = Π δ Ĵ Bα defined on the blanket B α of subfield U α. D. Downward Pass: Blanket Model Propagation The next, downward pass on the tree T involves messages in the form of blanket models, that is, graphical models encoding the conditional statistics of variables lying in the blanket of some subfield. Each subfield s blanket model is a concise summary of the complement of that subfield sufficient for near-optimal inference within the subfield. Specifically, the blanket model J Bγ is a tractable approximation of the conditional model p(x Bγ x Rγ = 0). 1) Root-Node Initialization: Note that the blanket of V is the empty set so that a blanket model is not required for the root of T. As we move down to the children U α(0) and U β(0), we note that B α(0) = R β(0) and, hence, a blanket model for U α(0) is given by the cavity model for U β(0), which was computed in the upward pass. Hence, we already have blanket models for U α(0) and U β(0) and are ready to build blanket models for their descendents. 2) Shrinking Blanket Models: Suppose that we already have the blanket model for U γ as represented in Fig. 6(a). Then, we can construct the blanket model for the child U α as follows: a) Joining Blanket and Sub-Cavity Model: First, we form the composition of the blanket model defined on B γ with the cavity model defined on R β (from the sibling of α) as shown in Fig. 6(b): J Sα = J Bγ J[B γ, R β ] J Rβ. Note that S α = B γ R β is a superset of B α. b) Variable Elimination: Next, we eliminate all variables in D γ S α \B α, yielding ĴB α = ˆΠ Bα JSα. To ensure scalable computations, we again perform variable elimination recursively, starting with vertices farthest from the blanket and working our way towards U α. The result is depicted in Fig. 6(c). c) Model Thinning: Lastly, we thin this resulting blanket model: J Bα = Π δ Ĵ Bα, yielding our reduced-order blanket model for subfield U α (Fig. 6(d)). The blanket model for U β is computed in an identical manner with the roles of α and β reversed. 3) Leaf-Node Marginalization: Once we have constructed a blanket model for each of the smallest subfields U γ L, we can join this model with the conditional model for the enclosed subfield (that is the model used to seed the upwards pass), to obtain a graphical model approximation of the (zero-mean) marginal density p(x Ūγ ) on Ū γ U γ B γ, given in information form by J Ūγ = J[U γ ] J[U γ, B γ ] J Bγ. Inverting each of these localized models, that is, computing P Ūγ = ( J Ūγ ) 1, yields variances of all variables and covariances for each edge of G. E. An RCM-Preconditioner for Iterative Estimation In the preceding sections, we have described a recursive algorithm for constructing a hierarchical collection of cavity and blanket models, described by thin information matrices. In this section, we describe how to extend these computations to compute the estimates ˆx solving Jˆx = h. We begin by describing a two-pass algorithm, based on the cavity models computed previously, which computes an approximation of ˆx, and then describe an iterative procedure, using the two-pass algorithm as a preconditioner, that iteratively refines the estimate.

13 13 1) Upward-Pass: We specify a recursive algorithm that works its way up the tree, computing a potential vector h Rγ at each node γ Γ of the dissection tree. Let J Rγ denote the cavity models computed previously by the RCM upward pass. For each leaf-node, we solve J[U γ ] x Uγ = h[u γ ] for x Uγ and then compute h Rγ = J Rγ x Uγ. At each non-leaf node, we compute h Rγ as follows: a) Join: Form the composite model (h S γ, J S γ) = (h Rα, J Rα ) J[R α, R β ] (h Rβ, J Rβ ), where S γ = R α R β, by joining the two cavity models from the children. b) Sparse Solve: Given this joint model, we solve J S γ x S γ = h S γ using direct methods, which is tractable because J S γ is a thin, sparse matrix. 7 c) Sparse Multiply: Finally, we compute the potential vector h Rγ = J Rγ x S γ[r γ ], which is a tractable computation because J Rγ is sparse. 2) Downward-Pass (Back-Substitution): Once the root node is reached, we have the information form (h S 0, J S 0) = (h Rα(0), J Rα(0) ) J[R α(0), R β(0) ] (h Rβ(0), J Rβ(0) ) at the top-level separator of the dissection tree, which is an approximate model for the marginal distribution p(x S 0). Hence, we can compute an approximation for the means ˆx S 0 by solving J S 0 ˆx S 0 = h S 0. Conditioning on this estimate, we can then recurse back down the tree filling in the missing values of ˆx along each separator, thereby propagating estimates down the tree. In this downward pass, each node below the root of the tree receives an estimate ˆx Rγ of the variables in the surface R γ of the corresponding subfield. Again using the model (h S γ, J S γ), formed by the upward computations, we interpolate into the subfield, computing ˆx D γ where D γ = S γ \ R γ, by solution of the linear system of equations J D γ ˆx D γ = h D γ J D γ J S γ[d γ ] h D γ h S γ[d γ ] J S γ[d γ, R γ ] ˆx Rγ. (22) The estimate ˆx D γ is computed with respect to the approximation of p(x D γ ˆx Rγ ) f(x D γ; h D γ, J D γ) (after normalization), which is approximate because of the model thinning steps in RCM. Once the leaves of the tree are reached, the interior of each subfield is interpolated similarly, thus yielding a complete estimate ˆx. 3) Richardson Iteration: The preceding two-pass algorithm may be used to compute an approximate solution of Jx = b for an arbitrary right-hand side b. The resulting estimate is linear in b and we denote this linear operator by M. Using M as a preconditioner 8, we compute a sequence of estimates {ˆx (n) } defined by ˆx (0) = 0 and ˆx (n+1) = ˆx (n) + M (h J ˆx (n) ). (23) Let ρ denote the spectral radius of I MJ. If ρ < 1 then ˆx (n) converges to ˆx J 1 h with ˆx (n) ˆx ρ n ˆx. For small δ, this condition is met and we achieve rapid convergence to the correct means. V. APPLICATIONS IN REMOTE SENSING In this section we develop two applications of RCM in remote sensing: (1) interpolation of satellite altimetry measurements of sea-surface height, and (2) estimation of the surface of a large salt-deposit beneath the Gulf of Mexico. The purpose of these examples is to demonstrate that RCM scales well to very large problems while yielding estimates and error covariances that are close to those that would have resulted if exact optimal estimation had been performed instead. Although the specific statistical models used in these examples are perhaps over-simplified, the results that follow (which include space-varying measurement densities and hence space-variant estimation errors) do serve to demonstrate the applicability of RCM to very large spatial estimation problems. 7 We use a sparse Cholesky factorization of J S γ and back-substitution based on h S γ. Also, some computation can be saved if we use an elimination order beginning with S γ \ R γ because we only need to compute x S γ[r γ] in the back-substitution. 8 To implement M b efficiently, we pre-compute and store calculations that do not depend on b. For instance, we compute a sparse Cholesky factorization for each J Rγ using a low-fill elimination order. This leads to an extremely fast preconditioner because only back-substitution steps are required each time we apply M to a different b vector.

14 14 A. Model Specifications Throughout this section, we consider GMRFs of the form { p(x, y) exp 1 ( Dx σ 2 r )} y Cx 2 σd 2 where x R n represents the vector of field values at the vertices of a regular 2D lattice and y R m is a vector of local, noisy measurements of the underlying field at an irregular set of points scattered throughout the field. Here, Dx 2 represent our prior for x, which serves to regularize the field, and the data-fidelity term y Cx 2 represents our measurement model. We consider two prior models commonly used in image-processing. The thin-membrane (TM) model is defined such that each row d k corresponds to an edge {u, v} E, and has two non-zero components: d k,u = +1 and d k,v = 1. This gives a regularization term Dx 2 = (x u x v ) 2 (25) {u,v} E that penalizes gradients, favoring level surfaces. The thin-plate (TP) model is defined such that each row d v corresponds to a vertex v V and has non-zero components d v,v = 1 and d v,u = 1 N(v) for adjacent vertices u N(v). This gives a regularization term Dx 2 = 2 x v 1 N(v) x u (26) v V u N(v) that penalizes curvature, favoring flat surfaces. In general, the locations of the measurements y defines an irregular pattern with respect to the grid defined for x. Moreover, the location of individual measurements may fall between these grid points. For this reason each measurement y t is modeled as the bilinear interpolation c t x of the four nearest grid points to the actual measurement location corrupted by zero-mean, white Gaussian noise: y t = c t x+v t where v t N(0, σd 2 ). The posterior density p(x y) may be expressed in information form with parameters J = DT D σ 2 r + CT C σ 2 d, h = CT y σd 2. Thus, the fill-pattern of J (and hence the posterior Markov structure of x) is determined both by D T D and C T C. In the TM model, D T D has non-zero off-diagonal entries only at those locations corresponding to nearest neighbors in the lattice. In the TP model, there are also additional connections between pairs of vertices that are two steps away in the square lattice, including diagonal edges. Finally, for each measurement y k there is a contribution of c k c T k to J, which creates edges between those four grid points closest to the location of measurement k. This results in a sparse J matrix where all edges are between nearby points in the lattice. Hence, we can apply RCM to the information model (h, J) to calculate approximations of the estimates ˆx v (y) = E{x v y} and error variances ˆσ v 2 = E{(x v ˆx v (y)) 2 y} for all vertices v V and error covariances E{(x u ˆx u (y))(x v ˆx v (y))} for all edges {u, v} E. In Appendix B, we also describe an iterative algorithm to estimate the model parameters σ r and σ d. B. Sea-Surface Height Estimation First, we consider the problem of performing near-optimal interpolation of satellite altimetry of sea-surface height anomaly (SSHA), measured relative to seasonal, space-variant mean-sea level. 9 We model SSHA by the thin-membrane model, which seems an appropriate choice as it favors a level sea-surface. We estimate SSHA at the vertices of an lattice covering latitudes between ±60 and a full 360 of longitude, which yields a resolution of 1 5 in both latitude and longitude. The final world-wide estimates and associated error variances, obtained using RCM with δ = 10 4 and model parameters σ r 1cm and σ d 3.5cm, are displayed in Fig. 7. In this example, RCM requires about three minutes to execute, including run-time of both the cavity and blanket modeling procedures as well as the total run-time of the iterative procedure to compute the means. About 30 iterations are required to obtain a residual error h Jˆx (k) less than 10 4, where each iteration takes 2-3 seconds. 9 This data was collected by the Jason-1 satellite over a ten day period beginning 12/1/2004 and is available from the Jet Propulsion Laboratory (24)

15 Estimated Sea Surface Height Anomaly (cm) 30 20 10 0 10 20 Standard Deviation of Estimation Error (cm) 5 4 3 2 Fig. 7.

Salt-Top Estimation Next, we consider the problem of estimating the salt-top, that is, the top surface of a large salt deposit located several kilometers beneath the sea-floor somewhere in the Gulf

15 15 Estimated Sea Surface Height Anomaly (cm) Standard Deviation of Estimation Error (cm) Fig. 7. Estimated sea-surface height anomaly and space-variant standard deviation of estimation error computing using RCM. C. Salt-Top Estimation Next, we consider the problem of estimating the salt-top, that is, the top surface of a large salt deposit located several kilometers beneath the sea-floor somewhere in the Gulf of Mexico. The data for this estimation problem, provided courtesy of Shell International Exploration, Inc., consists of a large set of picks chosen by analysts while viewing cross-sections of seismic sounding data. Hence, these picks fall along straight line segments in latitude and longitude, and it is our goal to interpolate between these points. For this problem, we use the thin-plate model for the surface of the salt-deposit, which allows for undulations typically seen in the salt-top, with a lattice at a resolution of 60 feet and with model parameters σ r 12 feet and σ d 35 feet. The final estimates and error variances are shown in Fig. 8. These results were obtained using RCM with a tolerance of δ = 10 4, which required about five minutes to run, including the total time required for iterative computation of the means. The run-times for the TP model are somewhat slower than for the TM model because the Markov blankets arising in the TP model are twice as wide as in the TM model, so the cavity and blanket models are more complex. VI. CONCLUSION We have presented a new, principled approach to approximate inference in very large GMRFs employing a recursive model reduction strategy based on information theoretic principles and have applied this method to perform near-optimal interpolation of sea-surface satellite altimetry. These results show the practical utility of the method for near-optimal, large-scale estimation. Several possible directions for further research are suggested by this work. First, the accuracy of RCM in applications such as that illustrated here provides considerable motivation for the development of a better theoretical understanding of its accuracy and stability. For instance, if it were possible to compute and propagate upper-bounds on the information divergence in RCM this would be very useful and may lead to a robust formulation. Although we have focused on examples using Gaussian prior models, we expect RCM will also prove useful in non-linear edge-preserving methods such as [44]. Although these methods use a non-gaussian prior, their solution generally involves solving a sequence of Gaussian problems with an adaptive, space-variant process noise. Hence, RCM could be used as a fast computational engine in these methods. We also are interested to apply RCM to higher-dimensional GMRFs, such as arise in seismic and tomographic 3D estimation problems or

16 Standard Deviation of Estimation Error (feet) 700 600 500 400 300 200 100 Fig. 8. Estimated salt-top and space-variant standard deviation of estimation error computing using RCM.

16 16 Standard Deviation of Estimation Error (feet) Fig. 8. Estimated salt-top and space-variant standard deviation of estimation error computing using RCM. for filtering of dynamic GMRFs. We anticipate that it will be important to take advantage of the inherently parallel nature of RCM to address these computationally intensive applications. Another direction to explore is based on the rich class of multi-scale models, such as models having multi-grid or pyramidal structure. For example, the work in [11] demonstrates the utility and drawbacks of using multi-resolution models defined on trees to estimation of ocean height from satellite data. Such models allow one to capture long-distance correlations much more efficiently than a single-resolution nearest-neighbor model, but the tree structure used in [11] leads to artifacts at tree boundaries, something that RCM is able to avoid. This suggests the idea of enhancing models as in [11] by including new edges that eliminate these artifacts but that introduce cycles into these multi-resolution graphical models. However, if such models can be developed, RCM offers a principled, scalable approximate inference algorithm well-suited for solution of such hierarchical, multi-resolution models. Finally, while the specifics of this paper concern Gaussian models, the general framework we have outlined should apply more generally. This is especially pertinent for inference in discrete MRFs where computation of either the marginal distributions or the mode grows exponentially in the width of the graph [6], [10], which suggests developing counterparts to RCM for these problems. A. Recursive Inference Algorithm APPENDIX In this appendix we summarize a recursive algorithm for computing the moments η G = Λ(θ G ) of a zero-mean, chordal GMRF. Also, by differentiating each step of this procedure, we obtain an algorithm to compute the firstorder change in moment parameters dη G due to a perturbation dθ G. The complexity of both algorithms is O(nw 3 ), where n is the number of variables and w is the size of the largest clique. These algorithms are used as sub-routines in the model-reduction procedure described in Section III. In RCM, these methods are only used for thin cavity and blanket models and are tractable in that context. Let T = (Γ, E Γ ) be a junction tree of G. We obtain a directed version of T by selecting an arbitrary clique to be the root node and orienting the edges away from the root. For each non-root node γ, let π(γ) denote its parent. We split each clique C γ into a separator S γ = C γ C π(γ) and the residual set R γ = C γ \ C π(γ). At the root, these are defined S γ = and R γ = C γ. Now, we specify our recursive inference procedure. The input to this procedure is the sparse matrix J, which is defined over a chordal graph and parameterized by θ G. The output is a sparse matrix P, defined on the same chordal graph, with elements specified by η G. In the differential form of the algorithm, we also have a sparse input dj and sparse output dj, corresponding to dθ G and dη G. 1) Upward Pass: For each node γ Γ of the junction tree, starting from the the leaves of the tree and working upwards, we perform the following computations in the order shown: Q γ = (J[R γ ]) 1 A γ = Q γ J[R γ, S γ ] J[S γ ] J[S γ ] + J[S γ, R γ ] A γ

Fisher Information in Gaussian Graphical Models

Fisher Information in Gaussian Graphical Models Jason K. Johnson September 21, 2006 Abstract This note summarizes various derivations, formulas and computational algorithms relevant to the Fisher information