A Recursive Model-Reduction Method for Approximate Inference in Gaussian Markov Random Fields

Size: px
Start display at page:

Download "A Recursive Model-Reduction Method for Approximate Inference in Gaussian Markov Random Fields"

Transcription

1 1 A Recursive Model-Reduction Method for Approximate Inference in Gaussian Markov Random Fields Jason K. Johnson and Alan S. Willsky Abstract This paper presents recursive cavity modeling a principled, tractable approach to approximate, near-optimal inference for large Gauss-Markov random fields. The main idea is to subdivide the random field into smaller subfields, constructing cavity models which approximate these subfields. Each cavity model is a concise yet faithful model for the surface of one subfield sufficient for near-optimal inference in adjacent subfields. This basic idea leads to a tree-structured algorithm which recursively builds a hierarchy of cavity models during an upward pass and then builds a complementary set of blanket models during a reverse downward pass. The marginal statistics of individual variables can then be approximated using their blanket models. Model thinning plays an important role, allowing us to develop thinned cavity and blanket models thereby providing tractable approximate inference. We develop a maximum-entropy approach that exploits certain tractable representations of Fisher information on thin chordal graphs. Given the resulting set of thinned cavity models, we also develop a fast preconditioner, which provides a simple iterative method to compute optimal estimates. Thus, our overall approach combines recursive inference, variational learning and iterative estimation. We demonstrate the accuracy and scalability of this approach in several challenging, large-scale remote sensing problems. I. INTRODUCTION Markov random fields (MRFs) play an important role for modeling and estimation in a wide variety of contexts including physics [1], [2], communication and coding [3], signal and image processing [4], [5], [6], [7], [8], [9], pattern recognition [10] remote sensing [11], [12], [13], sensor networks [14], and localization and mapping [15]. Their importance can be traced in some cases to underlying physics of the phenomenon being modeled, in others to the spatially distributed nature of the sensors and computational resources, and in essentially all cases to the expressiveness of this model class. MRFs are graphical models [16], [17], that is, collections of random variables, indexed by nodes of graphs, which satisfy certain graph-structured conditional independence relations: Conditioned on the values of the variables on any set of nodes that separate the graph into two or more disconnected components, the sets of values on those disconnected components are mutually independent. An implication of this Markov property thanks to the Hammersley-Clifford Theorem [18], [1] is that the joint distribution of the variables at all nodes can be compactly described in terms of local interactions among variables at small, completely connected subsets of nodes (the cliques of the graph). MRFs have another well recognized characteristic, namely that performing optimal inference on such models can be prohibitively complex because of the implicit coding of the global distribution in terms of many local interactions. For this reason, most applications of MRFs involve the use of suboptimal or approximate inference methods, and many such methods have been developed [19], [20], [21], [22]. In this paper we describe a new, systematic approach, to approximately optimal inference for MRFs that focuses explicitly on propagating local approximate models for subfields of the overall graphical model that are close (in a sense to be made precise) to the exact models for these subfields but are far simpler and, in fact allow computationally tractable exact inference with respect to these approximate models. The building blocks for our approach variable elimination, information projections, and inference on cycle-free graphs (that is, graphs that are trees) are well-known in the graphical model community. What is new here is their synthesis into a systematic procedure for computationally tractable inference that focuses on recursive reduced-order modeling (based on information-theoretic principles) and exact inference on the resulting set of approximate models. The resulting algorithms also have attractive structure that is of potential value for distributed implementations such as in sensor networks. To be sure our approach has connections with work of others perhaps most significantly with [23], [24], [25], [12], [26], [27], and we discuss these relationships as we proceed. The authors are with the Laboratory for Information and Decision Systems at the Massachusetts Institute of Technology, 77 Mass. Ave., Cambridge, MA ( {jasonj,willsky}@mit.edu). This research was funded by the Air Force Office of Scientific Research under Grant FA and by a grant from Shell International Exploration and Production, Inc.

2 2 While the principles for our approach apply to general MRFs, we focus our development on the important class of Gaussian MRFs (GMRFs). In the next section we introduce this class, discuss the challenges in solving estimation problems for such models, briefly review methods and literature relevant to these challenges and to our approach, and provide a conceptual overview of our approach that also explains its name: Recursive Cavity Modeling (RCM). In Section III we develop the model-reduction techniques required by RCM. In particular, we develop a tractable maximum-entropy method to compute information projections using convex optimization methods and tractable representations of Fisher information for models defined on chordal graphs. In Section IV we provide the details of the RCM methodology, which consists of a two-pass procedure for building cavity and blanket models and a corresponding hierarchical preconditioner for iterative estimation. Section V demonstrates the effectiveness of RCM with its application to several remote sensing problems. We conclude in Section VI with a discussion of RCM and further directions that it suggests. 1 A. Gaussian Markov Random Fields II. PRELIMINARIES Let G = (V, E) denote a graph with node (or vertex) set V and edge set E V V. Let x v denote a random variable associated with node v V, and let x denote the vector of all of the x v. If x is Gaussian with mean ˆx and invertible covariance P, its probability density can be written as p(x) exp{ 1 2 (x ˆx)T P 1 (x ˆx)} exp{ 1 2 xt Jx + h T x} (1) Jˆx = h P = J 1 (2) The form on the right-hand side of (1) is often referred to as the information form of the statistics of a Gaussian process where J is the information matrix. The fill pattern of J provides the Markov structure [28]: x is Markov with respect to G if and only if J u,v = 0 for all {u, v} E. In applications, ˆx and P typically specify posterior statistics of x after conditioning on some set of observations. The most common example is one in which we have an original GMRF with respect to G, together with measurements, corrupted by independent Gaussian noise, at some or all of the nodes of the graph. Since an independent measurement at node v simply modifies the values of h v and J v,v, the resulting field conditioned on all such measurements is also Markov with respect to G. B. The Estimation Problem Given J and h, we wish to compute ˆx and (at least) the diagonal elements of P thus providing the marginal distributions for each of the x v. However solving the linear equations in (2) and inverting J can run into scalability problems. For example, methods that take no advantage of graphical structure require O(n 3 ) computations for graphs with n nodes. If the graph G has particularly nice structure, however, very efficient algorithms do exist. In particular, if G is a tree there are a variety of algorithms which compute ˆx and the diagonal of P with total complexity that is linear in n and that also allow distributed computation corresponding to messages being passed along edges of the graph. For example, if J is tri-diagonal, the variables x v form a Markov chain, and efficient solution of (2) can be obtained by Gaussian elimination of variables from one end of the chain to the other followed by back-substitution corresponding to a forward Kalman filtering sweep followed by a backward Rauch-Tung-Striebel smoothing sweep [14]. Because of the abundance of applications involving MRFs on graphs with cycles, there is considerable interest and a growing body of literature on computationally tractable inference algorithms. For example, the generalization of the Rauch-Tung-Striebel smoother to trees can in principle be applied to graphs with cycles by aggregating nodes of the original graph using so-called junction tree algorithms [17] to form an equivalent model on a tree. However, the dimensions of variables at nodes in such a tree model depend on the so-called tree-width of the original graph [29], with overall inference complexity in GMRFs that grows as the cube of this tree-width. Thus, 1 A C++ implementation of the algorithms described in this paper is available at

3 3 these algorithms are tractable only for graphs with small tree-width, precluding use for many graphs of practical importance such as a 2D s s lattice (with n = s 2 nodes) for which the tree-width is s (so that the cube of the tree-width is n 3/2, resulting in complexity that grows faster than linearly with graph size) or a 3D s s s lattice for which the tree-width is s 2 (so that the cube of the tree-width is n 2 ). Since exact inference is only feasible for very particular graphs, there is great interest in algorithms that yield approximations to the correct means and covariances and that have tractable complexity. One well-known algorithm is loopy belief propagation (LBP) [30], [20] which take the local message-passing rules which yield the exact solution on trees, and apply them unchanged and iteratively to graphs with cycles. There has been recent progress [31], [20] in understanding how such algorithms behave, and for GMRFs it is now known [31], [22] that if LBP converges, it yields the correct value for ˆx but not the correct values for the P v,v. Although some sufficient conditions for convergence are known [31], [32], LBP does not always converge and may converge slowly in large GMRFs. There are several other classes of approximate algorithms that are more closely related to our approach, and we discuss these connections in the next subsection. As we now describe, RCM can be viewed as a direct, recursive approximation of an exact (and hence intractable) inference algorithm created by aggregating nodes of the original graph into a tree. In particular, by employing an information-theoretic approach to reduced-order modeling, together with a particular strategy for aggregating nodes, we construct tractable, near-optimal algorithms that can be applied successfully to very large graphs. C. The Basic Elements of RCM As with exact methods based on junction trees, RCM makes use of separators that is, sets of nodes which, if removed from the graph, result in two or more disconnected components. By Markovianity, the sets of variables in each of these disconnected components are mutually independent conditioned on the set of values on the separator. This suggests a divide and conquer approach to describing the overall statistics of the MRF on a hierarchicallyorganized tree. Each node in this tree corresponds to a separator at a different scale in the field. For example, the root node of this tree might correspond to a separator that separates the entire graph into, say k, disconnected subgraphs. The root node then has k children one corresponding to each of these disconnected subgraphs and the node for each of these children would then correspond to a separator that further dissects that subgraph. This continues to some finest level at which exact inference computations on the subgraphs at that level are manageable. The problem with this approach, as suggested in Section II-B, is that the dimensionality associated with the larger separators in our hierarchical tree can be quite high for instance, n in square grids. This problem has led several researchers [24], [7], [33], [34], [14] to develop approaches for GMRFs based on dimensionality reduction that is, replacing the high-dimensional vector of values along an entire separator by a lower-dimensional vector. While approaches such as [33], [34] use statistically-motivated criteria for choosing these approximations, there are significant limitations of this idea. The first is that the use of low-dimensional approximations can lead to artifacts (that is, modeling errors which expose the underlying approximation), both across and along these separators. The second is that performing such a dimensionality reduction requires that we have available the exact mean and covariance for the vector of variables whose dimension we wish to reduce, which is precisely the intractable computation we wish to approximate! The third limitation is that these approaches are strictly top-down approaches that is, they require establishing the hierarchical decomposition from the root node on down to the leaf nodes a priori, an approach often referred to as nested dissection. We also employ nested dissection in our examples, but the RCM approach also offers the possibility of bottom-up organization of computations, beginning at nodes located close to each other and working outward a capability that is particularly appealing for distributed sensor networks. The key to RCM is the use of the implicit, information form, corresponding to models for the variables along separators, allowing us to consider model-order rather than dimensionality reduction. In this way, we still retain full dimensionality of the variables along each separator, overcoming the problem of artifacts. Of course, we still have to deal with the computational complexity of obtaining the information form of the statistics along each separator. Doing that in a computationally and statistically principled fashion is one of the major components of RCM. Consider a GMRF on the graph depicted in Fig. 4(a) so that the information matrix for this field has a sparsity pattern defined by this graph. Suppose now that we consider solving for ˆx and the diagonal of P from (2) by variable elimination. In particular, suppose that we eliminate all of the variables within the dashed region

4 4 in Fig. 4(a) except for those right at the boundary. Doing this in general will lead to fill in the information matrix for the set of variables that remain after variable elimination. As depicted in Fig. 4(b), this fill is completely concentrated within the dashed region that is, within this cavity. That is, if we ignore the connections outside the cavity, we have a model for the variables along the boundary that is generally very densely connected. This suggests approximating this high-order exact model for the boundary by a reduced or thinned model as in Fig. 4(c) with the sparsity suggested by this figure. Indeed, if we thin the model sufficiently, we can then continue the process of alternating variable elimination and model thinning in a computationally tractable manner. Suppose next that there are a number of disjoint cavities as in Fig. 4 in each of which we have performed alternating steps of variable elimination to enlarge the cavity followed by model thinning to maintain tractability. Eventually, two or more of these cavities will reach a point at which they are adjacent to each other, as in Fig. 5(a). At this point, the next step is one of merging these cavities into a larger one (Fig. 5(b)), eliminating the nodes that are interior to the new, larger cavity (Fig. 5(c)), and then thinning this new model. If each step does sufficient thinning, computational tractability can be maintained. Eventually, this outwards elimination ends, and a reverse inwards elimination procedure commences, again done in information form. This inwards procedure eliminates all of the variables outside of each subfield, except for the variables adjacent to the subfield, producing what is known as a blanket model. Eliminating these variables involves computations that recursively produce blanket models for smaller and smaller subfields, as illustrated in Fig. 6, which shows that, once again, model thinning (going from Fig. 6(c) to Fig. 6(d)) plays a central role. Finally, once this inward sweep has been completed, we have information forms for the marginal statistics for each of the subfields that were used to initialize the outwards elimination procedure. Inverting these many smaller, now-localized models to obtain means and variances is then, by construction, computationally tractable. RCM has some relationships to other work as well as some substantive differences. The general conceptual form we have outlined is closely related to the nested dissection approach [23], [12] to solving large linear systems. The approach to model thinning in [23], [12], however, is simply zeroing a set of elements (retaining just those elements which couple nearby nodes along the boundary). The statistical interpretation of this approach and its extensibility to less regular lattices and fields, however, are problematic. In particular, zeroing elements can lead to indefinite (and hence meaningless) information matrices, and even if this is not the case, such an operation in general will modify all of the elements of the covariance matrix (including the variances of individual variables). In contrast, we adopt a principled, statistical approach to model thinning, using so-called information projections, which guarantee that the means, variances and edgewise correlations in the thinned model are unchanged by the thinning process. Information-theoretic approaches to approximating graphical models have a significant literature [25], [35], [29], [36], most of which focuses on doing this for a single, overall graphical model and not in the context of a recursive procedure such as we develop. One effort that has considered a recursive approach is [26] which examines timerecursive inference for Dynamic Bayes Nets (DBNs) that is, for graphical models that evolve in time, so that we can view the overall graphical model as a set of coupled temporal stages. Causal recursive filtering then corresponds to propagating frontiers that is, a particular choice of what we would call cavity boundaries corresponding to the values at all nodes at a single point in time. The method in [26] projects each frontier model into a family of factored models so that the projection is given by a product of marginals on disjoint subsets of nodes. Such an operation can be viewed as a special case of the outward propagation of cavity models where nodes in the boundary are required to be mutually independent. In our approach, we instead adaptively thin the graphical model by identifying and removing edges that correspond to weak conditional dependencies so that the thinned models typically do not become disconnected. Also, the other two elements of our approach specifically, the hierarchical structure which requires merging operations as in Fig. 5 and the inward recursion for blanket models as in Fig. 6 don t arise in the consideration of DBNs [26]. This distinction is important for large-scale computation because the hierarchical, tree structure of RCM is highly favorable for parallel computing 2, whereas frontier propagation methods require serial computations. There are also parallels to the group renormalization method using decimation [37], [13], which constructs a multi-scale cascade of coarse-scale MRFs by a combination of node-elimination and edge-thinning, and estimates the most probable configuration at each scale using iterative methods. 2 In exact inference methods using junction trees, the benefits of parallel computing are limited by the predominant computations on the largest separators, a limitation that RCM avoids through the use of thinned boundary models.

5 5 D(p q) = D(p p F ) + D(p F q) M(p) p D(p p F ) = h(p F ) h(p) q D(p F q) p F F Fig. 1. Illustration of information projection and the Pythagorean relation. III. MODEL REDUCTION In this section we focus on the problem of model reduction, the solution of which RCM employs in the recursive thinning of cavity and blanket models. In Section III-A, we pose model reduction as information projection to a family of GMRFs and develop a tractable maximum-entropy method to compute these projections. In Section III-C we present a greedy algorithm that uses conditional mutual information to select which edges to remove. In our development we assume that the model being thinned is tractable. This is consistent with RCM in which we thin models, propagate them to larger ones that are still tractable and then thin again to maintain tractability. A. Information Projection and Maximum Entropy Suppose that we wish to approximate a probability distribution p(x) by a GMRF defined on a graph G = (V, E). Over the family F of GMRFs on G we select q(x) to minimize the information divergence (relative entropy [38]) relative to p: D(p q) = p(x)log p(x) dx 0 (3) q(x) As depicted in Fig. 1, minimizing (3) can be viewed as a projection of p onto F. Many researchers have adopted (3) as a natural measure of modeling error [25], [36], [29]. The problem of minimizing information divergence takes on an especially simple characterization when the approximating family F is an exponential family [39], [35], [40], that is, a family of the form p θ (x) exp{θ φ(x)} where θ R d are the exponential parameters and φ : R n R d is a vector of linearly independent sufficient statistics. The family is defined by the set θ Θ for which exp{θ φ(x)}dx <. The vector of moments η = Λ(θ) E θ {φ(x)} plays a central role in minimizing (3). In particular, it can be shown that θ Θ minimizes D(p p θ ) if and only if we have moment matching relative to F; that is if and only if the expected values of the sufficient statistics that define F are the same under p θ and the original density p. This optimizing element, which we refer to as the information projection of p to F and denote by p F, is the unique member of the family F for which the following Pythagorean relation holds: D(p q) = D(p p F ) + D(p F q) for any q F (see [35] and Fig. 1). Information projections also have a maximum entropy interpretation [41], [35] in that among all densities q M(p) that match the moments of p relative to F, p F is the one which maximizes the entropy h(q) = q(x)log q(x)dx. Moreover, the increase in entropy from p to p F is precisely the value of the information loss D(p p F ) = h(p F ) h(p). The family of GMRFs on a graph G is represented by an exponential family with sufficient statistics φ G, exponential parameters θ G and moment parameter η G given by: φ G (x) (x v ) v V (x 2 v) v V (x u x v ) {u,v} E θ G = (h v ) v V ( 1 2 J v,v) v V ( J u,v ) {u,v} E η G = (ˆx v ) v V (P v,v + ˆx 2 v) v V (P u,v + ˆx uˆx v ) {u,v} E (4) Note that θ G specifies h and the non-zero elements of J while η G specifies ˆx and the corresponding subset of elements in P. These parameters are related by a one-to-one map η G = Λ(θ G ), defined by P = J 1 and ˆx = J 1 h, which is bijective on the image of realizable moments M(G). Given a distribution p(x), the projection to F(G) is given as follows. Using the distribution p, we compute the moments relative to G, or equivalently, the means ˆx v, variances P v,v and edge-wise cross-covariances P u,v on G.

6 6 The information matrix J of the projection is then uniquely determined by the following complementary sets of constraints [42], [28]: (J 1 ) u,v = P u,v, (u, v) E (5) J u,v = 0, (u, v) E (6) where E = E {(v, v), v V }. Eq. (5) imposes covariance-matching conditions over G while (6) imposes Markovianity with respect to G. Also, by the maximum-entropy principle, an equivalent characterization of J is that P J 1 is the maximum entropy completion [43] of the partial covariance specification P G = (P u,v, (u, v) E ). Given J, the remaining moment constraints are satisfied by setting h = Jˆx. Then, (h, J) is the information form of the projection to G. Hence, projection to general GMRFs may be solved by a shifted projection to the zero-mean GMRFs and we may focus on this zero-mean case without any loss of generality. The family of zero-mean GMRFs is described as in (4) but without the linear-statistics x and corresponding parameters h and moments ˆx. B. Maximum-Entropy Relative to a Chordal Super-Graph We now develop a method to compute the projection to a graph by embedding this graph within a chordal super-graph and maximizing entropy of the chordal GMRF subject to moment constraints over the embedded subgraph. This approach allows us to exploit certain tractable calculations on chordal graphs to efficiently compute the projection to a non-chordal graph. 1) Chordal GMRFs: A graph is chordal if, for every cycle of four or more nodes, there exists an edge (a chord) connecting two non-consecutive nodes of the cycle. Let C(G) denote the set of cliques of G: the maximal subsets C V for which the induced subgraph G C is complete, that is, every pair of nodes in C is an edge of the graph. A useful result of graph theory states that a graph is chordal if and only if there exists a junction tree: a tree T = (Γ, E Γ ) whose nodes γ Γ are identified with cliques C γ C(G) and where for every pair of nodes α, β Γ we have C α C β C γ for all γ along the path from α to β. Then, each edge (α, β) E Γ determines a minimal separator S = C α C β of the graph. Moreover, any junction tree of a chordal graph G yields the same collection of edge-wise separators, which we denote by S(G). The importance of chordal graphs is shown by the following well-known result: Any strictly-positive probability distribution p G (x) that is Markov on a chordal graph G can be represented in terms of its marginal distributions on the cliques C C(G) and separators S S(G) of the graph as p G (x) = C p C(x C ) S p S(x S ). (7) In chordal GMRFs, this leads to the following formula for the sparse information matrix in terms of marginal covariances: J = C [P 1 C ] V S [P 1 S ] V. (8) Here, [...] V denotes zero-padding to a V V matrix indexed by V. In the exponential family, this provides an efficient method to compute θ G = Λ 1 (η G ). Also, given the marginal covariances of an arbitrary distribution p(x), not necessarily Markov on G, (8) describes the projection of p to F(G). The complexity of this calculation is O(nw 3 ) where w is the size of the largest clique. 3 2) Entropy and Fisher Information in Chordal GMRFs: Based on (7), it follows that the entropy of a chordal MRF likewise decomposes in terms of marginal entropy on the cliques and separators of G. In the moment parameters of the GMRF, we have h(η G ) = h C (η GC ) h S (η GS ), (9) C S where h C and h S denote marginal entropy of cliques and separators, computed using h U (η GU ) = 1 2 (log detp U(η GU ) + U log 2πe). (10) For exponential families, it is well-known that h(η) = Λ 1 (η) so that, for GMRFs, differentiating (9) reduces to performing the conversion (8). Thus, both h(η G ) and h(η G ) can be computed with O(nw 3 ) complexity. 3 This complexity bound follows from the fact that, in chordal graphs, the number of maximal cliques is at most n 1 and, in GMRFs, the computations we perform on each clique are cubic in the size of the clique.

7 7 Next, we recall that the Fisher information with respect to parameters η G is defined G(η G ) E ηg { log p(x; η G ) T log p(x; η G )} = T h(η G ), where denotes gradient with respect to η G and the expectation is with respect to the unique element p F(G) with moments η G. Then, G(η G ) is a symmetric, positive-definite matrix and also describes the negative Hessian of entropy in exponential families. By twice differentiating (9), it follows that, in chordal GMRFs, the Fisher information matrix has a sparse representation in terms of marginal Fisher information defined on the cliques and separators of the graph: G(η G ) = [G C (η GC )] G [G S (η GS )] G, (11) C S where G U (η GU ) T h U (η GU ) is the marginal Fisher information on U and [...] G denotes zero-padding to a matrix indexed by nodes and edges of G. From (11), we observe that the fill pattern of G(η G ) defines another chordal graph with the same junction tree as G, but where each clique C C(G c ) maps to a larger clique with O( C 2 ) nodes (corresponding to a full sub-matrix of G(η G ) indexed by nodes and edges of G C ). For this reason, direct use of G(η G ), viewed simply as a sparse matrix, is undesirable if G contains larger cliques. However, we can specify implicit methods that exploit the special structure of G to implement multiplication by either G or G 1 with O(nw 3 ) complexity. Observing that G(η G ) = θg η G represents the Jacobian of the mapping from η G to θ G, we can compute matrix-vector products dθ G = G dη G for an arbitrary input dη G (viewed as a change in moment coordinates). Differentiating (8) using d(p 1 U ) = P 1 U dp UP 1 U we obtain: dj = S [P 1 C dp CP 1 C ] V + C [P 1 S dp SP 1 S ] V. (12) Similarly, we can compute dη G = G 1 dθ G by differentiating η G = Λ(θ G ). In appendix A, we summarize a recursive inference algorithm, defined relative to a junction tree of G, that computes η G given θ G and derive a corresponding differential form of the algorithm that computes dη G given dθ G. These methods are used to efficiently implement the variational method described next. 3) Maximum-Entropy Optimization: Given a GMRF p on G, we develop a maximum-entropy (ME) method to compute the projection to an arbitrary (non-chordal) sub-graph S. Let G be a chordal super-graph of G and let R = E(G ) \ E(S) such that η G = (η S, η R ). We may compute η G (p) using recursive inference on a junction tree of G (see Appendix A). To compute the projection to S, we maximize entropy in the chordal GMRF subject to moment constraints over the sub-graph S. This may be formulated as a convex optimization problem: min η R f(η R ) h(η S, η R ) s.t. (η S, η R ) M(G ) (13) Here, η G M(G ) are the realizable moments of the GMRF defined on G. 4 Starting from η (0) G = η G (p), we compute a sequence η (k) G = (η S, η (k) R ) using Newton s method. For each k, this requires solving the linear system G (k) R η(k) R = θ(k) R (14) where G (k) R = T f is the principle sub-matrix of G(η (k) G ) corresponding to R and θ (k) R = f is the corresponding sub-vector of θ (k) G = Λ 1 (η (k) G ) computed using (8). We then set η (k+1) R = η (k) R + λ η(k) R, where λ (0, 1] is determined by back-tracking line search to stay within M(G ) and to insure that entropy is increased. This procedure converges to the optimal ηg = (η S, η R ), for which the corresponding exponential parameters satisfy θr = 0. Then, θ S is the information projection to S. Finally, we discuss an efficient method to compute the Newton step: If the width w of the chordal graph is very small, say w < 10, we could explicitly form the sparse matrix G R and efficiently solve (14) using direct methods. However, this approach has O(nw 6 ) complexity, which is undesirable for larger values of w. Instead, we use an inexact Newton s step, obtained by approximate solution of (14) using standard iterative methods, for instance, 4 In the chordal graph G, the condition that η G is realizable is equivalent to P C(η G C ) 0 for all C C(G ), which can be verified with complexity O(nw 3 ). Thus, M(G ) is convex because the set of positive-definite matrices on each clique is convex.

8 8 preconditioned conjugate gradients (PCG). Such methods generally require an efficient method to compute matrixvector products G R η R, which we can provide using the implicit method, based on (12), for multiplication by G. Also, to obtain rapid convergence, it is important to provide an efficient preconditioner, which approximates (G R ) 1. For our preconditioner, we use (G 1 ) R, 5 implemented using an implicit method for multiplication by G 1 described in Appendix A. In this way, we obtain iterative methods that have O(nw 3 ) complexity per iteration. Using the PCG method, we find that a small number of iterations (typically, 3-12) is sufficient to obtain a good approximation to each Newton step, leading to rapid convergence in Newton s method, but with significantly less overall computation for larger values of w than is required using direct methods. C. Greedy Model Thinning In this section, we propose a simple greedy strategy for thinning a GMRF model. This entails selecting edges of the graph which correspond to weak statistical interactions between variables and pruning these edges from the GMRF by information projection. The quantity we use to measure the strength of interaction between x u and x v is the conditional mutual information [38], ) } ( p(x u, x v x \{u,v} ) I u,v (p) E p {log = 1 p(x u x \{u,v} )p(x v x \{u,v} ) 2 log 1 J2 u,v J u,u J v,v which is the average mutual information between x u and x v after conditioning on the other variables x \{u,v}. In GMRFs, we can omit edge {u, v} from G, without any modeling error, if and only if x u and x v are conditionally independent given x \{u,v}, that is, if and only if I u,v (p) = 0. This suggests using the value of I u,v (p), which is tractable to compute, to select edges {u, v} E to remove. To motivate this idea further, we note that I u,v (p) is closely related to the information loss resulting from removing edge {u, v} from G by information projection. Let G \ {u, v} = (V, E \ {u, v}) denote the sub-graph of G with edge {u, v} removed and let K denote the complete graph on V. Then, observing that G \ {u, v} is a sub-graph of K \ {u, v}, we have, by the Pythagorean relation with respect to p K\{u,v}, D(p p G\{u,v} ) = D(p p K\{u,v} ) + D(p K\{u,v} p G\{u,v} ) = I u,v (p) + D(p K\{u,v} p G\{u,v} ) I u,v (p) where we have used D(p p K\{u,v} ) = I u,v (p) and D(p K\{u,v} p G\{u,v} ) 0. Thus, I u,v (p) is a lower-bound on the information loss D(p p G\{u,v} ). Moreover, for p F(G) having a small value of I u,v (p), we find that D(p K\{u,v} p G\{u,v} ) tends to be small relative to I u,v (p) so that I u,v (p) then provides a good estimate of D(p p G\{u,v} ). In other words, removing edges with small conditional mutual information is roughly equivalent to picking those edges to remove that result in the least modeling error. We use the following greedy approach to thin a GMRF defined on G. Let δ > 0 specify the tolerance on conditional mutual information for removal of an edge. We compute I u,v (p) for all edges {u, v} E and select a subset of edges R E with I u,v (p) < δ to remove. The information projection to the sub-graph S = (V, E \ R) is then computed using our ME method as described in Section III-B (relative to a chordal super-graph of G). Because the values of I u,v in this information projection will generally differ from their prior values, we may continue to thin the GMRF until all the remaining edges have I u,v > δ. Also, by limiting the number of edges removed at each step, it is possible to take into account the effect of removing the weakest edges before selecting which other edges to remove, which can help reduce the overall information loss. 0, IV. RECURSIVE CAVITY MODELING We now flesh out the details of RCM. In Section IV-A, we specify the hierarchical tree representation of the GMRF that we use, and in Section IV-B, we define information forms and the three basic operations we use: composition, elimination and model reduction. These forms and operators are the components we use to build our 5 To motivate this preconditioner, we note that (G R) 1 is given by the Schur complement H R H R,SH 1 S HS,R with respect to H G 1 = cov{φ(x)}. Hence, our preconditioner H R = (G 1 ) R arises by neglecting the intractable term H R,SH 1 S HS,R, which is a good approximation if the correlation H S,R is weak relative to H R and H S.

9 9 T = (Γ, E Γ ) π(γ) γ 0 U U γ U π(γ) U 0 = V α(γ) β(γ) U α U β (a) (b) Fig. 2. (a) The tree T = (Γ, E Γ) based on (b) the collection U of nested subsets of vertices V of the underlying graph G. two-pass, recursive, message-passing inference algorithm on the hierarchical tree. First, as described in Section IV-C, we perform an upward pass on the tree which constructs cavity models. Next, as described in Section IV-D, we perform a downward pass on the tree which constructs blanket models and also estimates marginal variances and edge-wise covariances in the GMRF. Lastly, in Section IV-E, we describe a hierarchical preconditioner, using the cavity models computed by RCM, and an iterative estimation algorithm that computes the means for all vertices of the GMRF. Before we proceed, we define some basic notation with respect to the graph G = (V, E) describing the Markov structure of x. Given U V, let U V \U denote the set complement of U in V and let U {v U (u, v) E} denote the blanket of U in G. Also, U (U ) is the surface of U and U U \ U is its interior. These definitions are illustrated in Fig. 3(a), (b) and (c). A. Hierarchical Tree Structure We begin by requiring that the graphical model is recursively dissected into a hierarchy of nested subfields as indicated in Fig. 2. First, we describe a bottom-up construction. Let the set V be partitioned into a collection L of many small, disjoint subsets chosen so as to induce low-diameter, connected subgraphs in G over which exact inference is tractable. These small sets of vertices are recursively merged into larger and larger subfields until only the entire set V remains. Only adjacent subfields are merged so as to induce connected subgraphs. Also, merging should (ideally) keep the diameter of these connected subgraphs as small as possible. To simplify presentation only, we assume that subfields are merged two at a time. This generates a collection U 2 V containing the smallest sets in L as well as each of the merged sets up to and including V. Alternatively, such a dissection can be constructed in a top-down fashion by recursively splitting the graph, and resulting sub-graphs, into roughly equal parts chosen so as to minimize the number of cut edges at each step. For instance, in 2D lattices this is simply achieved by performing an alternating series of vertical and horizontal cuts. In any case, this recursive dissection of the graph defines a tree T = (Γ, E Γ ), in which each node γ Γ corresponds to a subset U γ U and with directed edges E Γ linking each dissection cell to it immediate sub-cells. We let π(γ) denote the parent of node γ in this tree. Also, the children of γ are denoted π 1 (γ) = {α(γ), β(γ)}, or more simply {α, β} where γ has been explicitly specified. The following vertex sets are defined for each U γ U relative to the graph G: B γ U γ, R γ U γ. (15) As seen in Figs. 3(a), (b) and (c), the blanket B γ is the outer boundary of U γ while the surface R γ is its inner boundary, and either serves as a separator between U γ and U γ. Also, the following separators are used in RCM: S γ R α R β, S α B γ R β, S β B γ R α. (16) The separator S γ, used in the RCM upward pass, is the union of the surfaces of the two children of a subfield (see Fig. 3(d) and (e)). The separators S α and S β, used in the RCM downward pass, are each the union of its parent s blanket and its sibling s surface (see Fig. 3(d) and (f)). These separators define a Markov tree representation, with respect to T, of the original GMRF defined on G [24]: For each leaf γ of T define the state vector x γ x Uγ. For each non-leaf γ let x γ x S γ. By construction, each S γ is a separator of the graph, that is, the subfields U α, U β and U γ are mutually separated by S γ. Hence, all conditional independence relations required by the Markov tree are satisfied by the underlying GMRF. However, we are interested in the large class of models for which exact inference on such a Markov tree representation is not feasible because of the large size of some of the separators. As discussed in Section II-C, we instead perform reduced-order modeling of these variables, corresponding to a thinned, tractable graphical model on each separator. L

10 10 U γ B γ U γ R γ U γ U γ U γ (a) (b) (c) U γ S γ U γ S α Uα U β U α (d) (e) (f) Fig. 3. Illustrations of the graph G of a GMRF and of our notation used to indicate subfields: (a) the subfield U γ and its complement U γ; (b) the blanket B γ = U γ; (b) the interior Uγ and surface R γ = U γ; (d) partitioning of U γ into sub-cells U α and U β ; (e) separator S γ = R α R β ; (f) separator S α = R β B γ. B. Information Kernels In the sequel, we let (h U, J U ), where U V, h U R U and J U R U U is symmetric positive definite, represent the information kernel f U : R U R + defined by: f U (x U ; h U, J U ) = exp{ 1 2 xt UJ U x U + h T Ux U } (17) The subscript U indicates the support of the information kernel, and of the matrices h U and J U. Generally, f U corresponds (after normalization) to a density over the variables x U parameterized by h U and J U. In RCM, the set U is typically a separator of the graph, and h U and J U are approximations to the exact distribution in question so that J U is sparse. We also use matrices J U,W, where U, W V and J U,W R U W, to represent the function f U,W (x U, x W ; J U,W ) = exp{ x T UJ U,W x W }, (18) which describes the interaction between subfields U, W. We adopt the following notations: Let h U [W] denote the sub-vector of h U indexed by W U. Likewise, J U [W 1, W 2 ] denotes the sub-matrix of J U indexed by W 1 W 2 and we write J U [W] = J U [W, W] to indicate a principle sub-matrix. Given two disjoint subfield models (h U1, J U1 ) and (h U2, J U2 ) and the interaction J U1,U 2 we let (h U, J U ) = (h U1, J U1 ) J U1,U 2 (h U2, J U2 ) denote the joint model on U = U 1 U 2 defined by h U = ( hu1 h U2 ), J U = ( ) JU1 J U1,U 2 J T U 1,U 2 J U2 which corresponds to multiplication of information kernels or addition of their information forms. Given an information form (h U, J U ) and D U to be eliminated, we let (ĥs, ĴS) = ˆΠ S (h U, J U ) ˆΠ \D (h U, J U ) denote 6 the operation of Gaussian Elimination (GE) defined by S = U \ D and ĥ S = h U [S] J U [S, D]J U [D] 1 h U [D] (19) Ĵ S = J U [S] J U [S, D]J U [D] 1 J U [D, S] (20) The matrix ĴS is the Schur complement of the sub-matrix J U [D] in J U. Straightforward manipulations lead to the following well-known result: (ĴS) 1 = (J 1 U )[S] and (ĴS) 1 ĥ S = (J 1 U h U)[S]. (21) Thus, the information form (ĥs, ĴS) corresponds to the marginal on S with respect to the model (h U, J U ). Also, GE may be implemented recursively as follows: given an elimination order (d 1,...,d n ) of the elements in D, compute (20) as ˆΠ \dn ˆΠ \d1 (U,h U, J U ), that is, by eliminating one variable at a time. Note also that only those 6 Two notations are introduced, as in some cases S = U \ D is given explicitly, while in others it is only implicitly specified in terms of U and D.

11 11 J[U γ ] Ĵ Rγ J Rγ (a) (b) (c) Fig. 4. Initialization of a cavity model for a small subfield U γ L, corresponding to a leaf of T : (a) the initial subfield model J[U γ], a sub-matrix of J; (b) the cavity model Ĵ Rγ = ˆΠJ[U γ] after Gaussian elimination of the interior variables Uγ = U γ \ R γ; (c) the final thinned cavity model JRγ = Π δ Ĵ Rγ defined on the surface R γ of subfield U γ. entries of h U and J U indexed by D are modified by GE. Hence, GE is a localized operation within the graphical representation of the GMRF as suggested by Figs. 4(a) and 4(b). However, eliminating D typically has the effect of causing ĴS[ D] to become full as shown in Fig. 4(b). This creation of fill can spoil the graphical model so that recursive GE becomes intractable with worst-case cubic complexity in dense graphs. Given an information matrix J U we denote the result of model-order reduction by J U = Π δ J U. The model reduction algorithm in Section III requires specifying a parameter δ which controls the tolerance on conditional mutual information for the removal of an edge. The procedure then determines which edges in the graph corresponding to J U to remove and determines the projection to this thinned graph. This projection preserves variances and edgewise cross-covariances on the thinned graph, which is equivalent to ˆΠ C JU = ˆΠ C J U for each clique C U of the thinned graph. In the following sections, we first develop our two-pass approximate inference procedure, focusing on calculation of just the information matrices, which are all independent of h. Then, we provide additional calculations involving h and ˆx, presented as a separate two-pass procedure which then serves as a preconditioner in an iterative method. C. Upward Pass: Cavity Model Propagation In this first step, messages are passed from the leaves of the tree L up towards the root V. These upward messages take the form of cavity models, encoding conditional statistics of variables lying in the surfaces of given subfields. To be precise, each cavity model, represented by an information matrix J Rγ, approximates a conditional density p(x Rγ x Bγ = 0) so that J Rγ is a tractable, thin matrix. 1) Leaf-Node Initialization: For each U γ L we initialize a cavity model as follows: We begin with the local information matrix J[U γ ] as depicted in Fig. 4(a). This specifies the conditional density p(x Uγ x Bγ = 0) f(x Uγ ; 0, J[U γ ]). We then eliminate all variables within the interior of U γ by Gaussian elimination: ĴR γ = ˆΠ Rγ J[U γ ]. This has the effect of deleting all nodes in the interior of U γ and updating the matrix parameters on the surface. As indicated in Fig. 4(b), this also induces fill within the information matrix. To ensure tractable computations in later stages, we thin this model: J Rγ = Π δ Ĵ Rγ, yielding a reduced-order cavity model, as shown if Fig. 4(c), for each subfield U γ L. Then we are ready to proceed up the tree growing larger cavity models from smaller ones. 2) Growing Cavity Models: Let U γ V be a subfield in U where we have already constructed the two cavity models for R α and R β as depicted in Fig. 5(a). Then, we construct the cavity model for U γ as follows: a) Join Cavity Models: First, we form the composition of the two sub-cavity models as indicated in Fig. 5(b): J S γ = J Rα J[R α, R β ] J Rβ. Note that S γ = R α R β is a superset of R γ. b) Variable Elimination: Next, we must eliminate the extra variables D γ S γ \ R γ, to obtain the marginal information matrix ĴR γ = ˆΠ Rγ JS γ. To ensure scalability, rather than eliminating all variables at once, we eliminate variables recursively beginning with those farthest from the surface and working our way towards the surface. This is an efficient computation thanks to model reductions performed previously in U α and U β. c) Model Thinning: This preceding elimination step induces fill across the cavity (Fig. 5(c)). Hence, to maintain tractability as we continue, we perform model-order reduction yielding J Rγ = Π δ Ĵ Rγ which is the desired reduced-order cavity model represented in Fig. 5(d). This projection step requires that we compute moments of the graphical model specified by ĴR γ. Thanks to model thinning in the subtree of T rooted at γ, these moments can be computed efficiently.

12 12 J Rα JRβ J S γ Ĵ Rγ J Rγ (a) (b) (c) (d) Fig. 5. Recursive construction of a cavity model: (a) cavity models J Rα, J Rβ of sub-cells U α, U β ; (b) joined cavity model JS γ = J Rα J[R α, R β ] J Rβ defined on separator S γ = R α R β ; (c) the cavity model Ĵ Rγ = ˆΠ J S γ after Gaussian elimination of variables S γ \ R γ; (d) the final thinned cavity model JRγ = Π δ Ĵ Rγ defined on the surface R γ of subfield U γ. J Bγ J Sα Ĵ Bα J Bα U α JRβ U α U α U α (a) (b) (c) (d) Fig. 6. Recursive construction of a blanket model: (a) the cavity model JRβ of the sibling subfield U β and the blanket model JBγ of the parent; (b) joined cavity/blanket model JSα = J Rβ J[R β, B γ] J Bγ defined on the separator S α = R β B γ; (c) the blanket model Ĵ Bγ = ˆΠ J Sα after Gaussian elimination of variables S α \ B γ; (d) the final thinned blanket model JBα = Π δ Ĵ Bα defined on the blanket B α of subfield U α. D. Downward Pass: Blanket Model Propagation The next, downward pass on the tree T involves messages in the form of blanket models, that is, graphical models encoding the conditional statistics of variables lying in the blanket of some subfield. Each subfield s blanket model is a concise summary of the complement of that subfield sufficient for near-optimal inference within the subfield. Specifically, the blanket model J Bγ is a tractable approximation of the conditional model p(x Bγ x Rγ = 0). 1) Root-Node Initialization: Note that the blanket of V is the empty set so that a blanket model is not required for the root of T. As we move down to the children U α(0) and U β(0), we note that B α(0) = R β(0) and, hence, a blanket model for U α(0) is given by the cavity model for U β(0), which was computed in the upward pass. Hence, we already have blanket models for U α(0) and U β(0) and are ready to build blanket models for their descendents. 2) Shrinking Blanket Models: Suppose that we already have the blanket model for U γ as represented in Fig. 6(a). Then, we can construct the blanket model for the child U α as follows: a) Joining Blanket and Sub-Cavity Model: First, we form the composition of the blanket model defined on B γ with the cavity model defined on R β (from the sibling of α) as shown in Fig. 6(b): J Sα = J Bγ J[B γ, R β ] J Rβ. Note that S α = B γ R β is a superset of B α. b) Variable Elimination: Next, we eliminate all variables in D γ S α \B α, yielding ĴB α = ˆΠ Bα JSα. To ensure scalable computations, we again perform variable elimination recursively, starting with vertices farthest from the blanket and working our way towards U α. The result is depicted in Fig. 6(c). c) Model Thinning: Lastly, we thin this resulting blanket model: J Bα = Π δ Ĵ Bα, yielding our reduced-order blanket model for subfield U α (Fig. 6(d)). The blanket model for U β is computed in an identical manner with the roles of α and β reversed. 3) Leaf-Node Marginalization: Once we have constructed a blanket model for each of the smallest subfields U γ L, we can join this model with the conditional model for the enclosed subfield (that is the model used to seed the upwards pass), to obtain a graphical model approximation of the (zero-mean) marginal density p(x Ūγ ) on Ū γ U γ B γ, given in information form by J Ūγ = J[U γ ] J[U γ, B γ ] J Bγ. Inverting each of these localized models, that is, computing P Ūγ = ( J Ūγ ) 1, yields variances of all variables and covariances for each edge of G. E. An RCM-Preconditioner for Iterative Estimation In the preceding sections, we have described a recursive algorithm for constructing a hierarchical collection of cavity and blanket models, described by thin information matrices. In this section, we describe how to extend these computations to compute the estimates ˆx solving Jˆx = h. We begin by describing a two-pass algorithm, based on the cavity models computed previously, which computes an approximation of ˆx, and then describe an iterative procedure, using the two-pass algorithm as a preconditioner, that iteratively refines the estimate.

13 13 1) Upward-Pass: We specify a recursive algorithm that works its way up the tree, computing a potential vector h Rγ at each node γ Γ of the dissection tree. Let J Rγ denote the cavity models computed previously by the RCM upward pass. For each leaf-node, we solve J[U γ ] x Uγ = h[u γ ] for x Uγ and then compute h Rγ = J Rγ x Uγ. At each non-leaf node, we compute h Rγ as follows: a) Join: Form the composite model (h S γ, J S γ) = (h Rα, J Rα ) J[R α, R β ] (h Rβ, J Rβ ), where S γ = R α R β, by joining the two cavity models from the children. b) Sparse Solve: Given this joint model, we solve J S γ x S γ = h S γ using direct methods, which is tractable because J S γ is a thin, sparse matrix. 7 c) Sparse Multiply: Finally, we compute the potential vector h Rγ = J Rγ x S γ[r γ ], which is a tractable computation because J Rγ is sparse. 2) Downward-Pass (Back-Substitution): Once the root node is reached, we have the information form (h S 0, J S 0) = (h Rα(0), J Rα(0) ) J[R α(0), R β(0) ] (h Rβ(0), J Rβ(0) ) at the top-level separator of the dissection tree, which is an approximate model for the marginal distribution p(x S 0). Hence, we can compute an approximation for the means ˆx S 0 by solving J S 0 ˆx S 0 = h S 0. Conditioning on this estimate, we can then recurse back down the tree filling in the missing values of ˆx along each separator, thereby propagating estimates down the tree. In this downward pass, each node below the root of the tree receives an estimate ˆx Rγ of the variables in the surface R γ of the corresponding subfield. Again using the model (h S γ, J S γ), formed by the upward computations, we interpolate into the subfield, computing ˆx D γ where D γ = S γ \ R γ, by solution of the linear system of equations J D γ ˆx D γ = h D γ J D γ J S γ[d γ ] h D γ h S γ[d γ ] J S γ[d γ, R γ ] ˆx Rγ. (22) The estimate ˆx D γ is computed with respect to the approximation of p(x D γ ˆx Rγ ) f(x D γ; h D γ, J D γ) (after normalization), which is approximate because of the model thinning steps in RCM. Once the leaves of the tree are reached, the interior of each subfield is interpolated similarly, thus yielding a complete estimate ˆx. 3) Richardson Iteration: The preceding two-pass algorithm may be used to compute an approximate solution of Jx = b for an arbitrary right-hand side b. The resulting estimate is linear in b and we denote this linear operator by M. Using M as a preconditioner 8, we compute a sequence of estimates {ˆx (n) } defined by ˆx (0) = 0 and ˆx (n+1) = ˆx (n) + M (h J ˆx (n) ). (23) Let ρ denote the spectral radius of I MJ. If ρ < 1 then ˆx (n) converges to ˆx J 1 h with ˆx (n) ˆx ρ n ˆx. For small δ, this condition is met and we achieve rapid convergence to the correct means. V. APPLICATIONS IN REMOTE SENSING In this section we develop two applications of RCM in remote sensing: (1) interpolation of satellite altimetry measurements of sea-surface height, and (2) estimation of the surface of a large salt-deposit beneath the Gulf of Mexico. The purpose of these examples is to demonstrate that RCM scales well to very large problems while yielding estimates and error covariances that are close to those that would have resulted if exact optimal estimation had been performed instead. Although the specific statistical models used in these examples are perhaps over-simplified, the results that follow (which include space-varying measurement densities and hence space-variant estimation errors) do serve to demonstrate the applicability of RCM to very large spatial estimation problems. 7 We use a sparse Cholesky factorization of J S γ and back-substitution based on h S γ. Also, some computation can be saved if we use an elimination order beginning with S γ \ R γ because we only need to compute x S γ[r γ] in the back-substitution. 8 To implement M b efficiently, we pre-compute and store calculations that do not depend on b. For instance, we compute a sparse Cholesky factorization for each J Rγ using a low-fill elimination order. This leads to an extremely fast preconditioner because only back-substitution steps are required each time we apply M to a different b vector.

14 14 A. Model Specifications Throughout this section, we consider GMRFs of the form { p(x, y) exp 1 ( Dx σ 2 r )} y Cx 2 σd 2 where x R n represents the vector of field values at the vertices of a regular 2D lattice and y R m is a vector of local, noisy measurements of the underlying field at an irregular set of points scattered throughout the field. Here, Dx 2 represent our prior for x, which serves to regularize the field, and the data-fidelity term y Cx 2 represents our measurement model. We consider two prior models commonly used in image-processing. The thin-membrane (TM) model is defined such that each row d k corresponds to an edge {u, v} E, and has two non-zero components: d k,u = +1 and d k,v = 1. This gives a regularization term Dx 2 = (x u x v ) 2 (25) {u,v} E that penalizes gradients, favoring level surfaces. The thin-plate (TP) model is defined such that each row d v corresponds to a vertex v V and has non-zero components d v,v = 1 and d v,u = 1 N(v) for adjacent vertices u N(v). This gives a regularization term Dx 2 = 2 x v 1 N(v) x u (26) v V u N(v) that penalizes curvature, favoring flat surfaces. In general, the locations of the measurements y defines an irregular pattern with respect to the grid defined for x. Moreover, the location of individual measurements may fall between these grid points. For this reason each measurement y t is modeled as the bilinear interpolation c t x of the four nearest grid points to the actual measurement location corrupted by zero-mean, white Gaussian noise: y t = c t x+v t where v t N(0, σd 2 ). The posterior density p(x y) may be expressed in information form with parameters J = DT D σ 2 r + CT C σ 2 d, h = CT y σd 2. Thus, the fill-pattern of J (and hence the posterior Markov structure of x) is determined both by D T D and C T C. In the TM model, D T D has non-zero off-diagonal entries only at those locations corresponding to nearest neighbors in the lattice. In the TP model, there are also additional connections between pairs of vertices that are two steps away in the square lattice, including diagonal edges. Finally, for each measurement y k there is a contribution of c k c T k to J, which creates edges between those four grid points closest to the location of measurement k. This results in a sparse J matrix where all edges are between nearby points in the lattice. Hence, we can apply RCM to the information model (h, J) to calculate approximations of the estimates ˆx v (y) = E{x v y} and error variances ˆσ v 2 = E{(x v ˆx v (y)) 2 y} for all vertices v V and error covariances E{(x u ˆx u (y))(x v ˆx v (y))} for all edges {u, v} E. In Appendix B, we also describe an iterative algorithm to estimate the model parameters σ r and σ d. B. Sea-Surface Height Estimation First, we consider the problem of performing near-optimal interpolation of satellite altimetry of sea-surface height anomaly (SSHA), measured relative to seasonal, space-variant mean-sea level. 9 We model SSHA by the thin-membrane model, which seems an appropriate choice as it favors a level sea-surface. We estimate SSHA at the vertices of an lattice covering latitudes between ±60 and a full 360 of longitude, which yields a resolution of 1 5 in both latitude and longitude. The final world-wide estimates and associated error variances, obtained using RCM with δ = 10 4 and model parameters σ r 1cm and σ d 3.5cm, are displayed in Fig. 7. In this example, RCM requires about three minutes to execute, including run-time of both the cavity and blanket modeling procedures as well as the total run-time of the iterative procedure to compute the means. About 30 iterations are required to obtain a residual error h Jˆx (k) less than 10 4, where each iteration takes 2-3 seconds. 9 This data was collected by the Jason-1 satellite over a ten day period beginning 12/1/2004 and is available from the Jet Propulsion Laboratory (24)

15 15 Estimated Sea Surface Height Anomaly (cm) Standard Deviation of Estimation Error (cm) Fig. 7. Estimated sea-surface height anomaly and space-variant standard deviation of estimation error computing using RCM. C. Salt-Top Estimation Next, we consider the problem of estimating the salt-top, that is, the top surface of a large salt deposit located several kilometers beneath the sea-floor somewhere in the Gulf of Mexico. The data for this estimation problem, provided courtesy of Shell International Exploration, Inc., consists of a large set of picks chosen by analysts while viewing cross-sections of seismic sounding data. Hence, these picks fall along straight line segments in latitude and longitude, and it is our goal to interpolate between these points. For this problem, we use the thin-plate model for the surface of the salt-deposit, which allows for undulations typically seen in the salt-top, with a lattice at a resolution of 60 feet and with model parameters σ r 12 feet and σ d 35 feet. The final estimates and error variances are shown in Fig. 8. These results were obtained using RCM with a tolerance of δ = 10 4, which required about five minutes to run, including the total time required for iterative computation of the means. The run-times for the TP model are somewhat slower than for the TM model because the Markov blankets arising in the TP model are twice as wide as in the TM model, so the cavity and blanket models are more complex. VI. CONCLUSION We have presented a new, principled approach to approximate inference in very large GMRFs employing a recursive model reduction strategy based on information theoretic principles and have applied this method to perform near-optimal interpolation of sea-surface satellite altimetry. These results show the practical utility of the method for near-optimal, large-scale estimation. Several possible directions for further research are suggested by this work. First, the accuracy of RCM in applications such as that illustrated here provides considerable motivation for the development of a better theoretical understanding of its accuracy and stability. For instance, if it were possible to compute and propagate upper-bounds on the information divergence in RCM this would be very useful and may lead to a robust formulation. Although we have focused on examples using Gaussian prior models, we expect RCM will also prove useful in non-linear edge-preserving methods such as [44]. Although these methods use a non-gaussian prior, their solution generally involves solving a sequence of Gaussian problems with an adaptive, space-variant process noise. Hence, RCM could be used as a fast computational engine in these methods. We also are interested to apply RCM to higher-dimensional GMRFs, such as arise in seismic and tomographic 3D estimation problems or

16 16 Standard Deviation of Estimation Error (feet) Fig. 8. Estimated salt-top and space-variant standard deviation of estimation error computing using RCM. for filtering of dynamic GMRFs. We anticipate that it will be important to take advantage of the inherently parallel nature of RCM to address these computationally intensive applications. Another direction to explore is based on the rich class of multi-scale models, such as models having multi-grid or pyramidal structure. For example, the work in [11] demonstrates the utility and drawbacks of using multi-resolution models defined on trees to estimation of ocean height from satellite data. Such models allow one to capture long-distance correlations much more efficiently than a single-resolution nearest-neighbor model, but the tree structure used in [11] leads to artifacts at tree boundaries, something that RCM is able to avoid. This suggests the idea of enhancing models as in [11] by including new edges that eliminate these artifacts but that introduce cycles into these multi-resolution graphical models. However, if such models can be developed, RCM offers a principled, scalable approximate inference algorithm well-suited for solution of such hierarchical, multi-resolution models. Finally, while the specifics of this paper concern Gaussian models, the general framework we have outlined should apply more generally. This is especially pertinent for inference in discrete MRFs where computation of either the marginal distributions or the mode grows exponentially in the width of the graph [6], [10], which suggests developing counterparts to RCM for these problems. A. Recursive Inference Algorithm APPENDIX In this appendix we summarize a recursive algorithm for computing the moments η G = Λ(θ G ) of a zero-mean, chordal GMRF. Also, by differentiating each step of this procedure, we obtain an algorithm to compute the firstorder change in moment parameters dη G due to a perturbation dθ G. The complexity of both algorithms is O(nw 3 ), where n is the number of variables and w is the size of the largest clique. These algorithms are used as sub-routines in the model-reduction procedure described in Section III. In RCM, these methods are only used for thin cavity and blanket models and are tractable in that context. Let T = (Γ, E Γ ) be a junction tree of G. We obtain a directed version of T by selecting an arbitrary clique to be the root node and orienting the edges away from the root. For each non-root node γ, let π(γ) denote its parent. We split each clique C γ into a separator S γ = C γ C π(γ) and the residual set R γ = C γ \ C π(γ). At the root, these are defined S γ = and R γ = C γ. Now, we specify our recursive inference procedure. The input to this procedure is the sparse matrix J, which is defined over a chordal graph and parameterized by θ G. The output is a sparse matrix P, defined on the same chordal graph, with elements specified by η G. In the differential form of the algorithm, we also have a sparse input dj and sparse output dj, corresponding to dθ G and dη G. 1) Upward Pass: For each node γ Γ of the junction tree, starting from the the leaves of the tree and working upwards, we perform the following computations in the order shown: Q γ = (J[R γ ]) 1 A γ = Q γ J[R γ, S γ ] J[S γ ] J[S γ ] + J[S γ, R γ ] A γ

Fisher Information in Gaussian Graphical Models

Fisher Information in Gaussian Graphical Models Fisher Information in Gaussian Graphical Models Jason K. Johnson September 21, 2006 Abstract This note summarizes various derivations, formulas and computational algorithms relevant to the Fisher information

More information

Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation

Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation Jason K. Johnson, Dmitry M. Malioutov and Alan S. Willsky Department of Electrical Engineering and Computer Science Massachusetts Institute

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

Message-Passing Algorithms for GMRFs and Non-Linear Optimization

Message-Passing Algorithms for GMRFs and Non-Linear Optimization Message-Passing Algorithms for GMRFs and Non-Linear Optimization Jason Johnson Joint Work with Dmitry Malioutov, Venkat Chandrasekaran and Alan Willsky Stochastic Systems Group, MIT NIPS Workshop: Approximate

More information

Modeling and Estimation in Gaussian Graphical Models: Maximum-Entropy Methods and Walk-Sum Analysis. Venkat Chandrasekaran

Modeling and Estimation in Gaussian Graphical Models: Maximum-Entropy Methods and Walk-Sum Analysis. Venkat Chandrasekaran Modeling and Estimation in Gaussian Graphical Models: Maximum-Entropy Methods and Walk-Sum Analysis by Venkat Chandrasekaran Submitted to the Department of Electrical Engineering and Computer Science in

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector

More information

Embedded Trees: Estimation of Gaussian Processes on Graphs with Cycles

Embedded Trees: Estimation of Gaussian Processes on Graphs with Cycles SUBMIED O IEEE RANSACIONS ON SIGNAL PROCESSING 1 Embedded rees: Estimation of Gaussian Processes on Graphs with Cycles Erik B. Sudderth *, Martin J. Wainwright, and Alan S. Willsky EDICS: 2-HIER (HIERARCHICAL

More information

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013 School of Computer Science Probabilistic Graphical Models Theory of Variational Inference: Inner and Outer Approximation Junming Yin Lecture 15, March 4, 2013 Reading: W & J Book Chapters 1 Roadmap Two

More information

Graphical Models for Statistical Inference and Data Assimilation

Graphical Models for Statistical Inference and Data Assimilation Graphical Models for Statistical Inference and Data Assimilation Alexander T. Ihler a Sergey Kirshner a Michael Ghil b,c Andrew W. Robertson d Padhraic Smyth a a Donald Bren School of Information and Computer

More information

14 : Theory of Variational Inference: Inner and Outer Approximation

14 : Theory of Variational Inference: Inner and Outer Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Maria Ryskina, Yen-Chia Hsu 1 Introduction

More information

Embedded Trees: Estimation of Gaussian Processes on Graphs with Cycles

Embedded Trees: Estimation of Gaussian Processes on Graphs with Cycles MI LIDS ECHNICAL REPOR P-2562 1 Embedded rees: Estimation of Gaussian Processes on Graphs with Cycles Erik B. Sudderth, Martin J. Wainwright, and Alan S. Willsky MI LIDS ECHNICAL REPOR P-2562 (APRIL 2003)

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 12: Gaussian Belief Propagation, State Space Models and Kalman Filters Guest Kalman Filter Lecture by

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014 Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 2014 Problem Set 3 Issued: Thursday, September 25, 2014 Due: Thursday,

More information

14 : Theory of Variational Inference: Inner and Outer Approximation

14 : Theory of Variational Inference: Inner and Outer Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2014 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Yu-Hsin Kuo, Amos Ng 1 Introduction Last lecture

More information

Graphical Models for Statistical Inference and Data Assimilation

Graphical Models for Statistical Inference and Data Assimilation Graphical Models for Statistical Inference and Data Assimilation Alexander T. Ihler a Sergey Kirshner b Michael Ghil c,d Andrew W. Robertson e Padhraic Smyth a a Donald Bren School of Information and Computer

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

Undirected Graphical Models

Undirected Graphical Models Undirected Graphical Models 1 Conditional Independence Graphs Let G = (V, E) be an undirected graph with vertex set V and edge set E, and let A, B, and C be subsets of vertices. We say that C separates

More information

Gaussian Multiresolution Models: Exploiting Sparse Markov and Covariance Structure

Gaussian Multiresolution Models: Exploiting Sparse Markov and Covariance Structure Gaussian Multiresolution Models: Exploiting Sparse Markov and Covariance Structure The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters.

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics. EECS 281A / STAT 241A Statistical Learning Theory

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics. EECS 281A / STAT 241A Statistical Learning Theory UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics EECS 281A / STAT 241A Statistical Learning Theory Solutions to Problem Set 2 Fall 2011 Issued: Wednesday,

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

Undirected Graphical Models: Markov Random Fields

Undirected Graphical Models: Markov Random Fields Undirected Graphical Models: Markov Random Fields 40-956 Advanced Topics in AI: Probabilistic Graphical Models Sharif University of Technology Soleymani Spring 2015 Markov Random Field Structure: undirected

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Graphical models for statistical inference and data assimilation

Graphical models for statistical inference and data assimilation Physica D ( ) www.elsevier.com/locate/physd Graphical models for statistical inference and data assimilation Alexander T. Ihler a,, Sergey Kirshner b, Michael Ghil c,d, Andrew W. Robertson e, Padhraic

More information

3 : Representation of Undirected GM

3 : Representation of Undirected GM 10-708: Probabilistic Graphical Models 10-708, Spring 2016 3 : Representation of Undirected GM Lecturer: Eric P. Xing Scribes: Longqi Cai, Man-Chia Chang 1 MRF vs BN There are two types of graphical models:

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Lecture 4 October 18th

Lecture 4 October 18th Directed and undirected graphical models Fall 2017 Lecture 4 October 18th Lecturer: Guillaume Obozinski Scribe: In this lecture, we will assume that all random variables are discrete, to keep notations

More information

Inference in Bayesian Networks

Inference in Bayesian Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Inference in graphical models Description Assume we have evidence e on the state of a subset of variables E in the model (i.e. Bayesian Network)

More information

Undirected graphical models

Undirected graphical models Undirected graphical models Semantics of probabilistic models over undirected graphs Parameters of undirected models Example applications COMP-652 and ECSE-608, February 16, 2017 1 Undirected graphical

More information

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014 Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 2014 Recitation 3 1 Gaussian Graphical Models: Schur s Complement Consider

More information

Directed and Undirected Graphical Models

Directed and Undirected Graphical Models Directed and Undirected Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Last Lecture Refresher Lecture Plan Directed

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

13 : Variational Inference: Loopy Belief Propagation

13 : Variational Inference: Loopy Belief Propagation 10-708: Probabilistic Graphical Models 10-708, Spring 2014 13 : Variational Inference: Loopy Belief Propagation Lecturer: Eric P. Xing Scribes: Rajarshi Das, Zhengzhong Liu, Dishan Gupta 1 Introduction

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

9 Forward-backward algorithm, sum-product on factor graphs

9 Forward-backward algorithm, sum-product on factor graphs Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 2014 9 Forward-backward algorithm, sum-product on factor graphs The previous

More information

CS281A/Stat241A Lecture 19

CS281A/Stat241A Lecture 19 CS281A/Stat241A Lecture 19 p. 1/4 CS281A/Stat241A Lecture 19 Junction Tree Algorithm Peter Bartlett CS281A/Stat241A Lecture 19 p. 2/4 Announcements My office hours: Tuesday Nov 3 (today), 1-2pm, in 723

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Variational Inference IV: Variational Principle II Junming Yin Lecture 17, March 21, 2012 X 1 X 1 X 1 X 1 X 2 X 3 X 2 X 2 X 3 X 3 Reading: X 4

More information

Discrete Bayesian Networks: The Exact Posterior Marginal Distributions

Discrete Bayesian Networks: The Exact Posterior Marginal Distributions arxiv:1411.6300v1 [cs.ai] 23 Nov 2014 Discrete Bayesian Networks: The Exact Posterior Marginal Distributions Do Le (Paul) Minh Department of ISDS, California State University, Fullerton CA 92831, USA dminh@fullerton.edu

More information

Probabilistic Graphical Models (I)

Probabilistic Graphical Models (I) Probabilistic Graphical Models (I) Hongxin Zhang zhx@cad.zju.edu.cn State Key Lab of CAD&CG, ZJU 2015-03-31 Probabilistic Graphical Models Modeling many real-world problems => a large number of random

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

More information

Y1 Y2 Y3 Y4 Y1 Y2 Y3 Y4 Z1 Z2 Z3 Z4

Y1 Y2 Y3 Y4 Y1 Y2 Y3 Y4 Z1 Z2 Z3 Z4 Inference: Exploiting Local Structure aphne Koller Stanford University CS228 Handout #4 We have seen that N inference exploits the network structure, in particular the conditional independence and the

More information

CSC 412 (Lecture 4): Undirected Graphical Models

CSC 412 (Lecture 4): Undirected Graphical Models CSC 412 (Lecture 4): Undirected Graphical Models Raquel Urtasun University of Toronto Feb 2, 2016 R Urtasun (UofT) CSC 412 Feb 2, 2016 1 / 37 Today Undirected Graphical Models: Semantics of the graph:

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

6.867 Machine learning, lecture 23 (Jaakkola)

6.867 Machine learning, lecture 23 (Jaakkola) Lecture topics: Markov Random Fields Probabilistic inference Markov Random Fields We will briefly go over undirected graphical models or Markov Random Fields (MRFs) as they will be needed in the context

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 9: Variational Inference Relaxations Volkan Cevher, Matthias Seeger Ecole Polytechnique Fédérale de Lausanne 24/10/2011 (EPFL) Graphical Models 24/10/2011 1 / 15

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Variational Inference II: Mean Field Method and Variational Principle Junming Yin Lecture 15, March 7, 2012 X 1 X 1 X 1 X 1 X 2 X 3 X 2 X 2 X 3

More information

Review: Directed Models (Bayes Nets)

Review: Directed Models (Bayes Nets) X Review: Directed Models (Bayes Nets) Lecture 3: Undirected Graphical Models Sam Roweis January 2, 24 Semantics: x y z if z d-separates x and y d-separation: z d-separates x from y if along every undirected

More information

Bayesian Machine Learning - Lecture 7

Bayesian Machine Learning - Lecture 7 Bayesian Machine Learning - Lecture 7 Guido Sanguinetti Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh gsanguin@inf.ed.ac.uk March 4, 2015 Today s lecture 1

More information

Covariance Matrix Simplification For Efficient Uncertainty Management

Covariance Matrix Simplification For Efficient Uncertainty Management PASEO MaxEnt 2007 Covariance Matrix Simplification For Efficient Uncertainty Management André Jalobeanu, Jorge A. Gutiérrez PASEO Research Group LSIIT (CNRS/ Univ. Strasbourg) - Illkirch, France *part

More information

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4 ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian

More information

Machine Learning Lecture 14

Machine Learning Lecture 14 Many slides adapted from B. Schiele, S. Roth, Z. Gharahmani Machine Learning Lecture 14 Undirected Graphical Models & Inference 23.06.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de

More information

Variational Inference (11/04/13)

Variational Inference (11/04/13) STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further

More information

Graphical Models and Kernel Methods

Graphical Models and Kernel Methods Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.

More information

Variational Learning : From exponential families to multilinear systems

Variational Learning : From exponential families to multilinear systems Variational Learning : From exponential families to multilinear systems Ananth Ranganathan th February 005 Abstract This note aims to give a general overview of variational inference on graphical models.

More information

3 Undirected Graphical Models

3 Undirected Graphical Models Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 2014 3 Undirected Graphical Models In this lecture, we discuss undirected

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PILCO: A Model-Based and Data-Efficient Approach to Policy Search PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol

More information

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs) Machine Learning (ML, F16) Lecture#07 (Thursday Nov. 3rd) Lecturer: Byron Boots Undirected Graphical Models 1 Undirected Graphical Models In the previous lecture, we discussed directed graphical models.

More information

Markov Random Fields

Markov Random Fields Markov Random Fields Umamahesh Srinivas ipal Group Meeting February 25, 2011 Outline 1 Basic graph-theoretic concepts 2 Markov chain 3 Markov random field (MRF) 4 Gauss-Markov random field (GMRF), and

More information

4.1 Notation and probability review

4.1 Notation and probability review Directed and undirected graphical models Fall 2015 Lecture 4 October 21st Lecturer: Simon Lacoste-Julien Scribe: Jaime Roquero, JieYing Wu 4.1 Notation and probability review 4.1.1 Notations Let us recall

More information

A Multiresolution Methodology for Signal-Level Fusion and Data Assimilation with Applications to Remote Sensing

A Multiresolution Methodology for Signal-Level Fusion and Data Assimilation with Applications to Remote Sensing A Multiresolution Methodology for Signal-Level Fusion and Data Assimilation with Applications to Remote Sensing MICHAEL M. DANIEL, STUDENT MEMBER, IEEE, AND ALAN S. WILLSKY, FELLOW, IEEE Invited Paper

More information

Factor Analysis and Kalman Filtering (11/2/04)

Factor Analysis and Kalman Filtering (11/2/04) CS281A/Stat241A: Statistical Learning Theory Factor Analysis and Kalman Filtering (11/2/04) Lecturer: Michael I. Jordan Scribes: Byung-Gon Chun and Sunghoon Kim 1 Factor Analysis Factor analysis is used

More information

12 : Variational Inference I

12 : Variational Inference I 10-708: Probabilistic Graphical Models, Spring 2015 12 : Variational Inference I Lecturer: Eric P. Xing Scribes: Fattaneh Jabbari, Eric Lei, Evan Shapiro 1 Introduction Probabilistic inference is one of

More information

CS Lecture 19. Exponential Families & Expectation Propagation

CS Lecture 19. Exponential Families & Expectation Propagation CS 6347 Lecture 19 Exponential Families & Expectation Propagation Discrete State Spaces We have been focusing on the case of MRFs over discrete state spaces Probability distributions over discrete spaces

More information

Belief Update in CLG Bayesian Networks With Lazy Propagation

Belief Update in CLG Bayesian Networks With Lazy Propagation Belief Update in CLG Bayesian Networks With Lazy Propagation Anders L Madsen HUGIN Expert A/S Gasværksvej 5 9000 Aalborg, Denmark Anders.L.Madsen@hugin.com Abstract In recent years Bayesian networks (BNs)

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Graphical Model Inference with Perfect Graphs

Graphical Model Inference with Perfect Graphs Graphical Model Inference with Perfect Graphs Tony Jebara Columbia University July 25, 2013 joint work with Adrian Weller Graphical models and Markov random fields We depict a graphical model G as a bipartite

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 7, 04 Reading: See class website Eric Xing @ CMU, 005-04

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: ony Jebara Kalman Filtering Linear Dynamical Systems and Kalman Filtering Structure from Motion Linear Dynamical Systems Audio: x=pitch y=acoustic waveform Vision: x=object

More information

17 Variational Inference

17 Variational Inference Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms for Inference Fall 2014 17 Variational Inference Prompted by loopy graphs for which exact

More information

Part 1: Expectation Propagation

Part 1: Expectation Propagation Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud

More information

Graph structure in polynomial systems: chordal networks

Graph structure in polynomial systems: chordal networks Graph structure in polynomial systems: chordal networks Pablo A. Parrilo Laboratory for Information and Decision Systems Electrical Engineering and Computer Science Massachusetts Institute of Technology

More information

Expectation Propagation in Factor Graphs: A Tutorial

Expectation Propagation in Factor Graphs: A Tutorial DRAFT: Version 0.1, 28 October 2005. Do not distribute. Expectation Propagation in Factor Graphs: A Tutorial Charles Sutton October 28, 2005 Abstract Expectation propagation is an important variational

More information

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Part I C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Probabilistic Graphical Models Graphical representation of a probabilistic model Each variable corresponds to a

More information

Markov Random Fields

Markov Random Fields Markov Random Fields 1. Markov property The Markov property of a stochastic sequence {X n } n 0 implies that for all n 1, X n is independent of (X k : k / {n 1, n, n + 1}), given (X n 1, X n+1 ). Another

More information

Stat 521A Lecture 18 1

Stat 521A Lecture 18 1 Stat 521A Lecture 18 1 Outline Cts and discrete variables (14.1) Gaussian networks (14.2) Conditional Gaussian networks (14.3) Non-linear Gaussian networks (14.4) Sampling (14.5) 2 Hybrid networks A hybrid

More information

Simultaneous Multi-frame MAP Super-Resolution Video Enhancement using Spatio-temporal Priors

Simultaneous Multi-frame MAP Super-Resolution Video Enhancement using Spatio-temporal Priors Simultaneous Multi-frame MAP Super-Resolution Video Enhancement using Spatio-temporal Priors Sean Borman and Robert L. Stevenson Department of Electrical Engineering, University of Notre Dame Notre Dame,

More information

Using Laplacian Eigenvalues and Eigenvectors in the Analysis of Frequency Assignment Problems

Using Laplacian Eigenvalues and Eigenvectors in the Analysis of Frequency Assignment Problems Using Laplacian Eigenvalues and Eigenvectors in the Analysis of Frequency Assignment Problems Jan van den Heuvel and Snežana Pejić Department of Mathematics London School of Economics Houghton Street,

More information

Bayesian Networks BY: MOHAMAD ALSABBAGH

Bayesian Networks BY: MOHAMAD ALSABBAGH Bayesian Networks BY: MOHAMAD ALSABBAGH Outlines Introduction Bayes Rule Bayesian Networks (BN) Representation Size of a Bayesian Network Inference via BN BN Learning Dynamic BN Introduction Conditional

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

Probabilistic Graphical Networks: Definitions and Basic Results

Probabilistic Graphical Networks: Definitions and Basic Results This document gives a cursory overview of Probabilistic Graphical Networks. The material has been gleaned from different sources. I make no claim to original authorship of this material. Bayesian Graphical

More information

Strongly chordal and chordal bipartite graphs are sandwich monotone

Strongly chordal and chordal bipartite graphs are sandwich monotone Strongly chordal and chordal bipartite graphs are sandwich monotone Pinar Heggernes Federico Mancini Charis Papadopoulos R. Sritharan Abstract A graph class is sandwich monotone if, for every pair of its

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning and Inference in DAGs We discussed learning in DAG models, log p(x W ) = n d log p(x i j x i pa(j),

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures

More information

11. Learning graphical models

11. Learning graphical models Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical

More information

Optimal multilevel preconditioning of strongly anisotropic problems.part II: non-conforming FEM. p. 1/36

Optimal multilevel preconditioning of strongly anisotropic problems.part II: non-conforming FEM. p. 1/36 Optimal multilevel preconditioning of strongly anisotropic problems. Part II: non-conforming FEM. Svetozar Margenov margenov@parallel.bas.bg Institute for Parallel Processing, Bulgarian Academy of Sciences,

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

17 : Markov Chain Monte Carlo

17 : Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo

More information

Lecture 9: PGM Learning

Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

More information

Graph structure in polynomial systems: chordal networks

Graph structure in polynomial systems: chordal networks Graph structure in polynomial systems: chordal networks Pablo A. Parrilo Laboratory for Information and Decision Systems Electrical Engineering and Computer Science Massachusetts Institute of Technology

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

Minimizing D(Q,P) def = Q(h)

Minimizing D(Q,P) def = Q(h) Inference Lecture 20: Variational Metods Kevin Murpy 29 November 2004 Inference means computing P( i v), were are te idden variables v are te visible variables. For discrete (eg binary) idden nodes, exact

More information

Learning discrete graphical models via generalized inverse covariance matrices

Learning discrete graphical models via generalized inverse covariance matrices Learning discrete graphical models via generalized inverse covariance matrices Duzhe Wang, Yiming Lv, Yongjoon Kim, Young Lee Department of Statistics University of Wisconsin-Madison {dwang282, lv23, ykim676,

More information