Embedded Trees: Estimation of Gaussian Processes on Graphs with Cycles

Size: px
Start display at page:

Download "Embedded Trees: Estimation of Gaussian Processes on Graphs with Cycles"

Transcription

1 MI LIDS ECHNICAL REPOR P Embedded rees: Estimation of Gaussian Processes on Graphs with Cycles Erik B. Sudderth, Martin J. Wainwright, and Alan S. Willsky MI LIDS ECHNICAL REPOR P-2562 (APRIL 2003) SUBMIED O IEEE RANSACIONS ON SIGNAL PROCESSING Abstract Graphical models provide a powerful general framework for encoding the structure of large-scale estimation problems. However, the graphs describing typical real world phenomena contain many cycles, making direct estimation procedures prohibitively costly. In this paper, we develop an iterative inference algorithm for general Gaussian graphical models. It operates by exactly solving a series of modified estimation problems on spanning trees embedded within the original cyclic graph. When these subproblems are suitably chosen, the algorithm converges to the correct conditional means. Moreover, and in contrast to many other iterative methods, the tree-based procedures we propose can also be used to calculate exact error variances. Although the conditional mean iteration is effective for quite densely connected graphical models, the error variance computation is most efficient for sparser graphs. In this context, we present a modeling example which suggests that very sparsely connected graphs with cycles may provide significant advantages relative to their tree-structured counterparts, thanks both to the expressive power of these models and to the efficient inference algorithms developed herein. he convergence properties of the proposed tree-based iterations are characterized both analytically and experimentally. In addition, by using the basic tree based iteration to precondition the conjugate gradient method, we develop an alternative, accelerated iteration which is finitely convergent. Simulation results are presented that demonstrate this algorithm s effectiveness on several inference problems, including a prototype distributed sensing application. Portions of this work have appeared previously in the conference paper [1], and thesis [2]. E. B. Sudderth and A. S. Willsky are with the Laboratory for Information and Decision Systems, Department of Electrical Engineering and Computer Science, Massachusetts Institute of echnology, Cambridge, MA USA. ( esuddert@mit.edu; willsky@mit.edu) M. J. Wainwright is with the Department of Electrical Engineering and Computer Science, University of California at Berkeley, Berkeley, CA USA. ( martinw@eecs.berkeley.edu) his research was supported in part by ONR under Grant N , by AFOSR under Grant F , and by an ODDR&E MURI funded through ARO Grant DAAD E. B. Sudderth was partially funded by an NDSEG fellowship.

2 2 MI LIDS ECHNICAL REPOR P-2562 I. INRODUCION Gaussian processes play an important role in a wide range of practical, large scale statistical estimation problems. For example, in such fields as computer vision [3, 4] and oceanography [5], Gaussian priors are commonly used to model the statistical dependencies among hundreds of thousands of random variables. Since direct linear algebraic methods are intractable for very large problems, families of structured statistical models, and associated classes of efficient estimation algorithms, must be developed. For problems with temporal, Markovian structure, linear state space models [6] provide a popular and effective framework for encoding statistical dependencies. emporal Markov models correspond to perhaps the simplest class of graphical models [7], in which nodes index a collection of random variables (or vectors), and edges encode the structure of their statistical relationships (as elucidated in Section II). Specifically, temporal state space models are associated with simple chain graphs (see G ss in Figure 1). When a collection of random variables has more complex statistical structure, more complex graphs often containing loops or cycles and many paths between pairs of nodes are generally required. For graphs without loops, including Markov chains and more general tree structured graphs, very efficient optimal inference algorithms exist. For example, for Gaussian processes the Kalman Filter and Rauch ung Striebel (RS) smoother for state space models [6], and their generalizations to arbitrary trees [8], produce both optimal estimates and error variances with constant complexity per graph node. However, for large graphs with cycles, exact (non iterative) methods become prohibitively complex, leading to the study of iterative algorithms. Although one can apply standard linear algebraic methods (such as those discussed in Section II-B), there is also a considerable literature on iterative algorithms specialized for statistical inference on loopy graphs. In this paper, we present a new class of iterative inference algorithms for arbitrarily structured Gaussian graphical models. As we illustrate, these algorithms have excellent convergence properties. Just as importantly, and in contrast to existing methods, our algorithms iteratively compute not only optimal estimates, but also exact error variances. Moreover, we show that our algorithms can be combined with classical linear algebraic methods (in particular conjugate gradient) to produce accelerated, preconditioned iterations with very attractive performance. In some contexts, such as the sensor network problem presented in Section VI-C, the structure of the graph relating a set of variables may be determined a priori by physical constraints. In many other situations, however, the choice of the graph is also part of the signal processing problem. hat is, in many cases the model used does not represent truth, but rather a tradeoff between the accuracy with which the model captures important features of the phenomenon of interest, and the tractability of the resulting signal processing algorithm. At one extreme are tree-structured graphs, which admit very efficient estimation

3 SUDDERH et al.: EMBEDDED REES: ESIMAION OF GAUSSIAN PROCESSES 3 algorithms but have comparatively limited modeling power. he addition of edges, and creation of loops, tends to increase modeling power, but also leads to more complex inference algorithms. he following example explores these tradeoffs in more detail, and demonstrates that for many statistical signal processing problems it is possible to construct graphical models which effectively balance the conflicting goals of model accuracy and algorithmic tractability. As we demonstrate in Section VI-A, the new algorithms developed in this paper are particularly effective for this example. A. Graphical Modeling using Chains, rees, and Graphs with Cycles Consider the following one dimensional, isotropic covariance function, which has been used to model periodic hole effect dependencies arising in geophysical estimation problems [9, 10]: C(τ; ω) = 1 sin(ωτ) ω > 0 (1) ωτ Figure 2 shows the covariance matrix corresponding to 128 samples of a process with this covariance. In the same figure, we illustrate the approximate modeling of this process with a chain structured graph G ss. More precisely, we show the covariance of a temporal state space model with parameters chosen to provide an optimal 1 approximation to the exact covariance, subject to the constraint that each state variable has dimension 2. Notice that although the Markov chain covariance P ss accurately captures short range correlations, it is unable to model important long range dependencies inherent in this covariance. Multiscale autoregressive (MAR) models provide an alternative modeling framework that has been demonstrated to be powerful, efficient, and widely applicable [12]. MAR models define state space recursions on hierarchically organized trees, as illustrated by G mar in Figure 1. Auxiliary or hidden variables are introduced at coarser scales of the tree in order to capture more accurately the statistical properties of the finest scale process defined on the leaf nodes. 2 Since MAR models are defined on cycle free graphs, they admit efficient optimal estimation algorithms, and thus are attractive alternatives for many problems. Multiscale methods are particularly attractive for two dimensional processes, where quadtrees may be used to approximate notoriously difficult nearest neighbor grids [4, 5]. Figure 2 shows a multiscale approximation P mar to the hole effect covariance, where the dimension of each node is again constrained to be 2. he MAR model captures the long range periodic correlations much better than the state space model. However, this example also reveals a key deficiency of MAR models: some spatially adjacent fine scale nodes are widely separated in the tree structure. In such cases, 1 For all modeling examples, optimality is measured by the Kullback Leibler divergence [11]. We denote the KL divergence between two zero mean Gaussians with covariances P and Q by D(P Q). 2 In some applications, coarse scale nodes provide an efficient mechanism for modeling non local measurements [13], but in this example we are only interested in the covariance matrix induced at the finest scale.

4 4 MI LIDS ECHNICAL REPOR P-2562 the correlations between these nodes may be inadequately modeled, and blocky boundary artifacts are produced. Blockiness can be reduced by increasing the dimension of the coarse scale nodes, but this often leads to an unacceptable increase in computational cost. One potential solution to the boundary artifact problem is to add edges between pairs of fine scale nodes where discontinuities are likely to arise. Such edges should be able to account for short range dependencies neglected by standard MAR models. o illustrate this idea, we have added three edges to the tree graph G mar across the largest fine scale boundaries, producing an augmented graph G aug (see Figure 1). Figure 2 shows the resulting excellent approximation P aug to the original covariance. his augmented multiscale model retains the accurate long range correlations of the MAR model, while completely eliminating the worst boundary artifacts. he previous example suggests that very sparsely connected graphs with cycles may offer significant modeling advantages relative to their tree structured counterparts. Unfortunately, as the resulting graphs do have cycles, the extremely efficient inference algorithms which made tree structured multiscale models so attractive are not available. he main goal of this paper is to develop inference techniques which allow both estimates and the associated error variances to be quickly calculated for the widest possible class of graphs. he algorithms we develop are particularly effective for graphs, like that presented in this example, which are nearly tree structured. B. Outline of Contributions he primary contribution of this paper is to demonstrate that tree based inference routines provide a natural basis for the design of estimation algorithms which apply to much broader classes of graphs. All of the algorithms depend on the fact that, within any graph with cycles, there are many embedded subgraphs for which optimal inference is tractable. Each embedded subgraph can be revealed by removing a different subset of the original graph s edges. We show that, by appropriately combining sequences of exact calculations on tractable subgraphs, it is possible to solve statistical inference problems defined on the original graph with cycles exactly and efficiently. Without question, the most tractable subgraphs are trees, and for simplicity of terminology we will refer to our methods as embedded trees (E) algorithms. Similarly, we will often think of extracted subgraphs as being trees. However, all of the ideas developed in this paper carry over immediately to the more general case in which the extracted subgraph contains cycles but is still tractable, as illustrated in Section VI-C. After presenting the necessary background regarding Gaussian graphical models and numerical linear algebra (Section II), we develop the details of our embedded trees algorithm for iterative, exact calculation of both means (Section III) and error variances (Section IV). he performance of this algorithm is analyzed

5 SUDDERH et al.: EMBEDDED REES: ESIMAION OF GAUSSIAN PROCESSES 5 both theoretically and experimentally, demonstrating the convergence rate improvements achieved through the use of multiple embedded trees. In Section V, we then show how the basic E algorithm may be used to precondition the conjugate gradient method, producing a much more rapidly convergent iteration. o emphasize the broad applicability of our methods, we provide an experimental evaluation on several different inference problems in Section VI. hese include the augmented multiscale model of Figure 1, as well as a prototype distributed sensing application. II. GAUSSIAN GRAPHICAL MODELS A graph G = (V, E) consists of a node or vertex set V, and a corresponding edge set E. In a graphical model [7], a random variable x s is associated with each node s V. In this paper, we focus on the case where {x s s V} is a jointly Gaussian process, and the random vector x s at each node has dimension d (assumed uniform for notational simplicity). Given any subset A V, let x A {x s s A} denote the set of random variables in A. If the N V nodes are indexed by the integers V = {1, 2,..., N}, the Gaussian process defined on the overall graph is given by x [ x 1 x 2 N] x. Let x N (µ, P ) indicate that x is a Gaussian process with mean µ and covariance P. With graphical models, it is often useful to consider an alternative information parameterization x N 1 (h, J), where J = P 1 is the inverse covariance matrix and the mean is equal to µ = J 1 h. Graphical models implicitly use edges to specify a set of conditional independencies. Each edge (s, t) E connects two nodes s, t V, where s t. In this paper, we exclusively employ undirected graphical models for which the edges (s, t) and (t, s) are equivalent. 3 Figure 3(a) shows an example of an undirected graph representing five different random variables. Such models are also known as Markov random fields (MRFs), or for the special case of jointly Gaussian random variables as covariance selection models in the statistics literature [16 18]. In undirected graphical models, conditional independence is associated with graph separation. Suppose that A, B, and C are subsets of V. hen B separates A and C if there are no paths between sets A and C which do not pass through B. he stochastic process x is said to be Markov with respect to G if x A and x C are independent conditioned on the random variables x B in any separating set. For example, in Figure 3(a) the random variables x 1 and {x 4, x 5 } are conditionally independent given {x 2, x 3 }. If the neighborhood of a node s is defined to be Γ(s) {t (s, t) E}, the set of all nodes which are directly connected to s, it follows immediately that p ( x s x V\s ) = p ( xs x Γ(s) ) (2) 3 here is another formalism for associating Markov properties with graphs which uses directed edges. Any directed graphical model may be converted into an equivalent undirected model, although some structure may be lost in the process [14, 15].

6 6 MI LIDS ECHNICAL REPOR P-2562 hat is, conditioned on its immediate neighbors, the probability distribution of the random vector at any given node is independent of the rest of the process. For general graphical models, the Hammersley Clifford heorem [16] relates the Markov properties of G to a factorization of the probability distribution p (x) over cliques, or fully connected subsets, of G. For Gaussian models, this factorization takes a particular form which constrains the structure of the inverse covariance matrix [16, 18]. Given a Gaussian process x N (µ, P ), we partition the inverse covariance J P 1 into an N N grid of d d submatrices {J s,t s, t V}. We then have the following result: heorem 1. Let x be a Gaussian stochastic process with inverse covariance J which is Markov with respect to G = (V, E). Assume that E is minimal, so that x is not Markov with respect to any G = (V, E ) such that E E. hen for any s, t V such that s t, J s,t = J t,s will be nonzero if and only if (s, t) E. Figure 3 illustrates heorem 1 for a small sample graph. In most graphical models, each node is only connected to a small subset of the other nodes. heorem 1 then shows that P 1 will be a sparse matrix with a small (relative to N) number of nonzero entries in each row and column. Any Gaussian distribution satisfying heorem 1 can be written as a product of positive pairwise potential functions involving adjacent nodes: p (x) = 1 ψ s,t (x s, x t ) (3) Z (s,t) E Here, Z is a normalization constant. For any inverse covariance matrix J, the pairwise potentials can be expressed in the form ψ s,t (x s, x t ) = exp 1 2 [ x s ] J s(t) J s,t x s x t J t,s J t(s) x t (4) where the J s(t) terms are chosen so that for all s V, t Γ(s) J s(t) = J s,s. Note that there are many different ways to decompose p (x) into pairwise potentials, each corresponding to a different partitioning of the block diagonal entries of J. However, all of the algorithms and results presented in this paper are invariant to the specific choice of decomposition. A. Graph Based Inference Algorithms Graphical models may be used to represent the prior distributions underlying Bayesian inference problems. Let x N (0, P ) be an unobserved random vector which is Markov with respect to G = (V, E). We assume that the graphical prior is parameterized by the graph structured inverse covariance matrix J = P 1, or equivalently by pairwise potential functions as in equation (4). Given a vector of noisy

7 SUDDERH et al.: EMBEDDED REES: ESIMAION OF GAUSSIAN PROCESSES 7 observations y = Cx+v, v N (0, R), the conditional distribution p (x y) N( x, P ) may be calculated from the information form of the normal equations: P 1 x = C R 1 y (5) P = ( J + C R 1 C ) 1 (6) he conditional mean x provides both the Bayes least squares and maximum a posteriori (MAP) estimates of x. he error covariance matrix P measures the expected deviation of x from the estimate x. We assume, without loss of generality, 4 that the observation vector y decomposes into a set {y s } N s=1 of local observations of the individual variables {x s } N s=1. In this case, C and R are block diagonal matrices. We would like to compute the marginal distributions p (x s y) N( x s, P s ) for all s V. Note that each x s is a subvector of x, while each P s is a block diagonal element of P. hese marginal distributions could be directly calculated from equations (5, 6) by matrix inversion in O((Nd) 3 ) operations. For large problems, however, this cost is prohibitively high, and the structure provided by G must be exploited. When G is a Markov chain, efficient dynamic programming based recursions may be used to exactly compute p (x s y) in O(Nd 3 ) operations. For example, when the potentials are specified by a state space model, a standard Kalman filter may be combined with a complementary reverse time recursion [6]. hese algorithms may be directly generalized to any graph which contains no cycles [8, 14, 19, 20]. We refer to such graphs as tree structured. ree based inference algorithms use a series of local message passing operations to exchange statistical information between neighboring nodes. One of the most popular such methods is known as the sum product [19] or belief propagation (BP) [14] algorithm. he junction tree algorithm [7, 16] extends tree based inference procedures to general graphs by first clustering nodes to break cycles, and then running the BP algorithm on the tree of clusters. However, in order to ensure that the junction tree is probabilistically consistent, the dimension of the clustered nodes may often be quite large. In these cases, the resulting computational cost is generally comparable to direct matrix inversion. he intractability of exact inference methods has motivated the development of alternative iterative algorithms. One of the most popular is known as loopy belief propagation [21]. Loopy BP iterates the local message passing updates underlying tree based inference algorithms until they (hopefully) converge to a fixed point. For many graphs, especially those arising in error correcting codes [19], these fixed points very closely approximate the true marginal distributions [22]. For Gaussian graphical models, it has been shown that when loopy BP does converge, it always calculates the correct conditional means [21, 23]. However, the error variances are incorrect because the algorithm fails to account properly 4 Any observation involving multiple nodes may be represented by a set of pairwise potentials coupling those nodes, and handled identically to potentials arising from the prior.

8 8 MI LIDS ECHNICAL REPOR P-2562 for the correlations induced by the graph s cycles. For more general graphical models, recently developed connections to the statistical physics literature have led to a deeper understanding of the approximations underlying loopy BP [15, 24, 25]. Recently, two independent extensions of the loopy BP algorithm have been proposed which allow exact computation of error variances, albeit with greater computational cost. Welling and eh [26] have proposed propagation rules for computing linear response estimates of the joint probability of all pairs of nodes. For Gaussian models, their method iteratively computes the full error covariance matrix at a cost of O(N E ) operations per iteration. However, for connected graphs E is at least O(N), and the resulting O(N 2 ) cost is undesirable when only the N marginal variances are needed. Plarre and Kumar [27] have proposed a different extended message passing algorithm which exploits the correspondence between recursive inference and Gaussian elimination [12]. We discuss their method in more detail in Section V. B. Linear Algebraic Inference Algorithms As shown by equation (5), the conditional mean x of a Gaussian inference problem can be viewed as the solution of a linear system of equations. hus, any algorithm for solving sparse, positive definite linear systems may be used to calculate such estimates. Letting Ĵ P 1 denote the inverse error covariance matrix and ȳ C R 1 y the normalized observation vector, equation (5) may be rewritten as Ĵ x = ȳ (7) A wide range of iterative algorithms for solving linear systems may be derived using a matrix splitting Ĵ = M K. Clearly, any solution x of equation (7) is also a solution to (Ĵ + K ) x = K x + ȳ (8) Assuming M = iterates { x n } n=1 Ĵ + K is invertible, equation (8) naturally suggests the generation of a sequence of according to the recursion x n = M 1 ( K x n 1 + ȳ ) (9) he matrix M is known as a preconditioner, and equation (9) is an example of a preconditioned Richardson iteration [28, 29]. Many classic algorithms, including the Gauss Jacobi and successive overrelaxation methods, are Richardson iterations generated by specific matrix splittings. he convergence of the Richardson iteration (9) is determined by the eigenvalues { λ i ( M 1 K )} of the matrix M 1 K. Letting ρ ( M 1 K ) max λ {λi(m 1 K)} λ denote the spectral radius, x n will converge to x, for arbitrary x 0, if and only if ρ ( M 1 K ) < 1. he asymptotic convergence rate is ρ ( M 1 K ) ( ) = ρ I M 1 Ĵ (10)

9 SUDDERH et al.: EMBEDDED REES: ESIMAION OF GAUSSIAN PROCESSES 9 If M is chosen so that M 1 Ĵ I, the Richardson iteration will converge after a small number of iterations. However, at each iteration it is necessary to multiply K x n 1 by M 1, or equivalently to solve a linear system of the form M x = b. he challenge, then, is to determine a preconditioner M which well approximates the original system Ĵ, but whose solution is much simpler. Although Richardson iterations are often quite effective, more sophisticated algorithms have been proposed. For positive definite systems, the conjugate gradient (CG) iteration is typically the method of choice [28, 30]. Each iteration of the CG algorithm chooses x n to minimize the weighted error metric Ĵ xn ȳ Ĵ 1 over subspaces of increasing dimension. his minimization can be performed in O(Nd 2 ) operations per iteration, requiring only a few matrix vector products involving the matrix Ĵ and some inner products [28]. CG is guaranteed to converge (with exact arithmetic) in at most N d iterations. However, as with Richardson iterations, any symmetric preconditioning matrix may be used to modify the spectral properties of Ĵ, thereby accelerating convergence. he standard CG algorithm, like Richardson iterations, does not explicitly calculate any entries of Ĵ 1 = P, and thus does not directly provide error variance information. In principle, it is possible to extend CG to iteratively compute error variances along with estimates [31]. However, these error variance formulas are very sensitive to finite precision effects, and for large or poorly conditioned problems they typically produce highly inaccurate results (see the simulations in Section VI). With any iterative method, it is necessary to determine when to stop the iteration, i.e. to decide that x n is a sufficiently close approximation to Ĵ 1 ȳ. For all of the simulations in this paper, we follow the standard practice [30] of iterating until the residual r n = ȳ Ĵ xn satisfies r n 2 ȳ 2 ɛ (11) where ɛ is a tolerance parameter. he final error is then upper bounded as Ĵ 1 ȳ x n 2 ɛλ min (Ĵ) ȳ 2. III. CALCULAING CONDIIONAL MEANS USING EMBEDDED REES In this section, we develop the class of embedded trees (E) algorithms for finding the conditional mean of Gaussian inference problems defined on graphs with cycles. Complementary E algorithms for the calculation of error variances are discussed in Section IV. he method we introduce explicitly exploits graphical structure inherent in the problem to form a series of tractable approximations to the full model. For a given graphical model, we do not define a single iteration, but a family of non stationary generalizations of the Richardson iteration introduced in Section II-B. Our theoretical and empirical results establish that this non stationarity can substantially improve the E algorithm s performance.

10 10 MI LIDS ECHNICAL REPOR P-2562 A. Graph Structure and Embedded rees As discussed in Section II-A, inference problems defined on tree structured graphs may be efficiently solved by direct, recursive algorithms. Each iteration of the E algorithm exploits this fact to perform exact computations on a tree embedded in the original graph. For a graph G = (V, E), an embedded tree G = (V, E ) is defined to be a subgraph (E E) which has no cycles. We use the term tree to include both spanning trees in which G is connected, as well as disconnected forests of trees. As Figure 4 illustrates, there are typically a large number of trees embedded within graphs with cycles. More generally, the E algorithm can exploit any embedded subgraph for which exact inference is tractable. We provide an example of this generality in Section VI-C. For clarity, however, we frame our development in the context of tree structured subgraphs. For Gaussian graphical models, embedded trees are closely connected to the structural properties of the inverse covariance matrix. Consider a Gaussian process x N 1 (0, J) which is Markov with respect to an undirected graph G = (V, E). By heorem 1, for any s, t V such that s t, J s,t will be nonzero if and only if (s, t) E. hus, modifications of the edge set E are precisely equivalent to changes in the locations of the nonzero off diagonal entries of J. In particular, consider a modified stochastic process x N 1 (0, J ) which is Markov with respect to an embedded tree G = (V, E ). For any tree structured inverse covariance J, there exists a symmetric matrix K such that J = J + K (12) Because it acts to remove edges from the graph, K is called a cutting matrix. As Figure 4 illustrates, different cutting matrices may produce different trees embedded within the original graph. Note that the cutting matrix K also defines a matrix splitting J = (J K ) as introduced in Section II-B. Certain elements of the cutting matrix, such as the off diagonal blocks corresponding to discarded edges, are uniquely defined by the choice of embedded tree G. However, other entries of K, such as the diagonal elements, are not constrained by graph structure. Consequently, there exist many different cutting matrices K, and associated inverse covariances J, corresponding to a given tree G. In later sections, it will be useful to define a restricted class of regular cutting matrices: Definition 1. For a regular cutting matrix K corresponding to an embedded tree G, all off diagonal entries not corresponding to cut edges must be zero. In addition, the block diagonal entries for nodes from which no edge is cut must be zero. hus, for a given G, elements of the corresponding family of regular cutting matrices may differ only in the block diagonal entries corresponding to nodes involved in at least one cut edge.

11 SUDDERH et al.: EMBEDDED REES: ESIMAION OF GAUSSIAN PROCESSES 11 As discussed in Section I, many potentially interesting classes of graphical models are nearly tree structured. For such models, it is possible to reveal an embedded tree by removing a small (relative to N) number of edges. Let E E \ E denote the number of discarded or cut edges. Clearly, any regular cutting matrix K removing E edges may have nonzero entries in at most 2Ed columns, implying that rank(k ) is at most O(Ed). hus, for sparsely connected graphical models where E N, cutting matrices may always be chosen to have low rank. his fact is exploited in later sections of this paper. B. ree Based Stationary and Non Stationary Richardson Iterations Consider the graphical inference problem introduced in Section II-A. As before, let Ĵ P 1 denote the inverse error covariance matrix, and ȳ C R 1 y the normalized observation vector. As discussed in the previous section, any embedded tree G, and associated cutting matrix K, defines a matrix splitting Ĵ = (Ĵ K ). he standard Richardson iteration (9) for this splitting is given by x n = Ĵ 1 ( K x n 1 + ȳ ) (13) By comparison to equation (7), we see that, assuming Ĵ is positive definite, this iteration corresponds to a tree structured Gaussian inference problem, with a set of perturbed observations given by (K x n 1 +ȳ). hus, each iteration can be efficiently computed. More generally, it is sufficient to choose K so that Ĵ is invertible. Although the probabilistic interpretation of equation (13) is less clear in this case, standard tree based inference recursions will still correctly solve this linear system. In Section III-D, we discuss conditions that guarantee invertibility of Ĵ. Because there are many embedded trees within any graph cycles, there is no reason that the iteration of equation (13) must use the same matrix splitting at every iteration. Let {G n } n=1 of trees embedded within G, and {K n } n=1 Ĵ n be a sequence a corresponding sequence of cutting matrices such that = (Ĵ + K n ) is Markov with respect to G n. hen, from some initial guess x 0, we may generate a sequence of iterates { x n } n=1 using the recursion x n = Ĵ 1 n ( Kn x n 1 + ȳ ) (14) We refer to this non stationary generalization of the standard Richardson iteration as the embedded trees (E) algorithm [1, 2]. he cost of computing x n from x n 1 is O(Nd 3 + Ed 2 ), where E = E \ E n is the number of cut edges. ypically E is at most O(N), so the overall cost of each iteration is O(Nd 3 ). Consider the evolution of the error e n ( x n x) between the estimate x n at the n th iteration and the solution x of the original inference problem. Combining equations (7) and (14), we have e n = Ĵ 1 n K n e n 1 (15)

12 12 MI LIDS ECHNICAL REPOR P-2562 From the invertibility of Ĵ and Ĵ n, it follows immediately that x, the conditional mean of the original inference problem (7), is the unique fixed point of the E recursion. One natural implementation of the E algorithm cycles through a fixed set of embedded trees {G n } n=1 in a periodic order, so that G n+k = G n k Z + (16) In this case, e n evolves according to a linear periodically varying system whose convergence can by analyzed as follows: Proposition 1. Suppose the E mean recursion (14) is implemented by periodically cycling through embedded trees, as in equation (16). hen the error e n x n x evolves according to e n+ = e n Re n (17) j=1 If ρ (R) < 1, then for arbitrary x 0, e n n 0 at an asymptotic rate of at most γ ρ (R) 1. Ĵ 1 j K j hus, the convergence rate of the E algorithm may be optimized by choosing the cutting matrices K n such that ρ (R) is as small as possible. As discussed earlier, when the E iteration uses the same cutting matrix K at every iteration (as in equation (13)), it is equivalent to a stationary preconditioned Richardson iteration. he following proposition shows that when the recursion is implemented by periodically cycling through > 1 cutting matrices, we may still recover a stationary Richardson iteration by considering every th iterate. Proposition 2. Suppose that the E recursion (14) is implemented by periodically cycling through cutting matrices {K n } n=1. Consider the subsampled sequence of estimates { xn } n=0 produced at every th iteration. he E procedure generating these iterates is equivalent to a preconditioned Richardson iteration ( x n = I M 1 ) Ĵ x (n 1) + M 1 C R 1 y (18) where the preconditioner M 1 is defined according to the recursion M 1 n = with initial condition M 1 1 = (Ĵ + K1 ) 1. (Ĵ + Kn ) 1 Kn M 1 n 1 + (Ĵ + Kn ) 1 (19) Proof. his result follows from an induction argument; see [2, heorem 3.3] for details. Several classic Richardson iterations can be seen as special cases of the E algorithm. For example, if the cutting matrix removes every edge, producing a disconnected forest of single node trees, the result is

13 SUDDERH et al.: EMBEDDED REES: ESIMAION OF GAUSSIAN PROCESSES 13 the well known Gauss Jacobi algorithm [28, 29]. For nearest neighbor grids (as in Figure 4), one possible E iteration alternates between two cutting matrices, the first removing all vertical edges and the second all horizontal ones. It can be shown that this iteration is equivalent to the Alternating Direction Implicit (ADI) method [29, 30, 32]. For more details on these and other connections, see [2, Sec ]. C. Motivating Example: Single Cycle Inference In this section, we explore the E algorithm s behavior on a simple graph in order to motivate later theoretical results; more extensive simulations are presented in Section VI. Our results provide a dramatic demonstration of the importance of allowing for non stationary Richardson iterations. Consider a 20 node single cycle graph, with randomly chosen inhomogeneous potentials. Although in practice single cycle problems can be solved easily by direct methods [33], they provide a useful starting point for understanding iterative algorithms. he cutting matrix K must only remove a single edge to reveal a spanning tree of a single cycle graph. In this section, we consider only regular cutting matrices, for which all off diagonal entries are zero except for the pair required to remove the chosen edge. However, we may freely choose the two diagonal entries (K ) s,s, (K ) t,t corresponding to the nodes from which the single edge (s, t) is cut. We consider three different possibilities, corresponding to positive semidefinite ((K ) s,s = (K ) t,t = (K ) s,t ), zero diagonal ((K ) s,s = (K ) t,t = 0), and negative semidefinite ((K ) s,s = (K ) t,t = (K ) s,t ) cutting matrices. When the E iteration is implemented with a single cutting matrix K, Proposition 1 shows that the ) 1 convergence rate is given by γ = ρ (Ĵ K. Figure 5(a) plots γ as a function of the magnitude Ĵs,t of the off diagonal error covariance entry corresponding to the cut edge. Intuitively, convergence is fastest when the magnitude of the cut edge is small. For this example, zero diagonal cutting matrices always lead to the fastest convergence rates. When the E algorithm is implemented by periodically cycling between two cutting matrices K 1, K 2, Proposition 1 shows that the convergence rate is given by γ = ρ (Ĵ 1 2 ) 1/2. K 2 Ĵ 1 1 K 1 Figures 5(b) and 5(c) plot these convergence rates when K 1 is chosen to cut the cycle s weakest edge, and K 2 is varied over all other edges. When plotted against the magnitude of the second edge cut, as in Figure 5(b), the γ values display little structure. Figure 5(c) shows these same γ values plotted against an index number showing the ordering of edges in the cycle. Edge 7 is the weak edge removed by K 1. Notice that for the zero diagonal case, cutting the same edge at every iteration is the worst possible choice, despite the fact that every other edge in the graph is stronger and leads to slower single tree iterations. he best performance is obtained by choosing the second cut edge to be as far from the first edge as possible. In Figures 5(d,e), we examine the convergence behavior of the zero diagonal two tree iteration

14 14 MI LIDS ECHNICAL REPOR P-2562 corresponding to edges 7 and 19 in more detail. For both figures, the error at each iteration is measured using the normalized residual introduced in equation (11). Figure 5(d) shows that even though the single tree iteration generated by edge 19 converges rather slowly relative to the edge 7 iteration, the composite iteration is orders of magnitude faster than either single tree iteration. In Figure 5(e), we compare the performance of the parallel belief propagation (BP) and unpreconditioned conjugate gradient (CG) iterations, showing that for this problem the E algorithm is much faster. he previous plots suggest that a two tree E iteration may exhibit features quite different from those observed in the corresponding single tree iterations. his is dramatically demonstrated by Figure 5(f), which considers the negative semidefinite cutting matrices corresponding to the two strongest edges in the graph. As predicted by Figure 5(a), the single tree iterations corresponding to these edges are divergent. However, because these strong edges are widely separated in the original graph (indexes 1 and 12), they lead to a two tree iteration which outperforms even the best single tree iteration. D. Convergence Criteria When the E mean recursion periodically cycles through a fixed set of cutting matrices, Proposition 1 shows that its convergence depends entirely on the eigenstructure of R = n=1 Ĵ 1 n K n. However, this matrix is never explicitly calculated by the E recursion, and for large scale problems direct determination of its eigenvalues is intractable. In this section we derive several conditions which allow a more tractable assessment of the E algorithm s convergence. Consider first the case where the E algorithm uses the same cutting matrix K at every iteration. We then have a standard Richardson iteration, for which the following theorem, proved by Adams [34], provides a simple necessary and sufficient convergence condition: heorem 2. Let Ĵ be a symmetric positive definite matrix, and K be a symmetric cutting matrix such that (Ĵ + K ) is nonsingular. hen ) ρ ((Ĵ + K ) 1 K < 1 if and only if Ĵ + 2K 0 he following two corollaries provide simple procedures for satisfying the conditions of heorem 2: Corollary 1. If the embedded trees algorithm is implemented with a single positive semidefinite cutting matrix K, the resulting iteration will be convergent for any positive definite inverse error covariance Ĵ. ) Proof. If Ĵ is positive definite and K is positive semidefinite, then (Ĵ + 2K is positive definite.

15 SUDDERH et al.: EMBEDDED REES: ESIMAION OF GAUSSIAN PROCESSES 15 Corollary 2. Suppose that Ĵ is diagonally dominant, so that Ĵ s,s > (20) t Γ(s) for all s V. hen any regular cutting matrix with non negative diagonal entries will produce a convergent embedded trees iteration. Proof. Regular cutting matrices only modify the off diagonal entries of Ĵ by setting certain elements to ) ) zero. herefore, the entries set to zero in (Ĵ + K will simply have their signs flipped in (Ĵ + 2K, leaving the summation in equation (20) unchanged. hen, by the assumption that K has non negative ) diagonal entries, we are assured that (Ĵ + 2K is diagonally dominant, and hence positive definite. Note that Corollary 2 ensures that if Section III-C will produce convergent E iterations. Ĵs,t Ĵ is diagonally dominant, the zero diagonal cutting matrices of Although heorem 2 completely characterizes the conditions under which the single tree E iteration converges, it says nothing about the resulting convergence rate. he following theorem allows the convergence rate to be bounded in certain circumstances. heorem 3. When implemented with a single positive semidefinite cutting matrix K, the convergence rate of the embedded trees algorithm is bounded by λ max (K ) (Ĵ λ max (K ) + λ max (Ĵ) ρ 1 K ) λ max (K ) λ max (K ) + λ min (Ĵ) Proof. his follows from [35, heorem 2.2]; see [2, heorem 3.14] for details. Increasing the diagonal entries of K will also increase the upper bound on ρ ) 1 (Ĵ K provided by heorem 3. his matches the observation made in Section III-C that positive semidefinite cutting matrices tend to produce slower convergence rates. When the E algorithm employs multiple trees, obtaining a precise characterization of its convergence behavior is more difficult. he following theorem provides a simple set of sufficient conditions for two tree iterations: heorem 4. Consider the embedded trees iteration generated by a pair of cutting matrices {K 1, K 2 }. Suppose that the following three matrices are positive definite: Ĵ + K 1 + K 2 0 Ĵ + K 1 K 2 0 Ĵ K 1 + K 2 0 hen the resulting iteration is convergent (ρ (R) < 1). Proof. See [2, Appendix C.4].

16 16 MI LIDS ECHNICAL REPOR P-2562 he conditions of this theorem show that in the multiple tree case, there may be important interactions between the cutting matrices which affect the convergence of the composite iteration. hese interactions were demonstrated by the single cycle inference results of Section III-C, and are further explored in Section VI. Note that it is not necessary for the cutting matrices to be individually convergent, as characterized by heorem 2, in order for the conditions of heorem 4 to be satisfied. When more than two trees are used, a singular value analysis may be used to provide sufficient conditions for convergence; see [2, Sec ] for details. IV. EXAC ERROR VARIANCE CALCULAION he embedded trees (E) algorithm introduced in the preceding section used a series of exact computations on spanning trees to calculate the conditional mean x of a loopy Gaussian inference problem. In this section, we examine the complementary problem of determining marginal error variances 5 { P s s V}. We develop a class of iterative algorithms for calculating error variances which are particularly efficient for very sparsely connected loopy graphs, like that of Section I-A. Due to the linear algebraic structure underlying Gaussian inference problems, any procedure for calculating x may be easily adapted to the calculation of error variances. In particular, suppose that the full error covariance matrix P is partitioned into columns {ˆp i } Nd i=1. hen, letting e i be an Nd dimensional vector of zeros with a one in the i th position, the i th column of P is equal to P e i, or equivalently Ĵ ˆp i = e i (21) By comparison to equation (7), we see that ˆp i is equal to the conditional mean of a particular inference problem defined by the synthetic observation vector e i. hus, given an inference procedure like the E algorithm which calculates conditional means at a cost of O(Nd 3 ) per iteration, we may calculate a series of approximations to P using O(N 2 d 4 ) operations per iteration. While the procedure described in the previous paragraph is theoretically sound, the computational cost may be too large for many applications. We would prefer an algorithm which only calculates the N desired marginal error variances P s, avoiding the O(N 2 ) cost which any algorithm calculating all of P must require. Consider the E iteration generated by a single cutting matrix K chosen so that ) 1 ρ (Ĵ K < 1. When x 0 = 0, a simple induction argument shows that subsequent iterations x n may 5 When nodes represent Gaussian vectors (d > 1), Ps is actually a d dimensional covariance matrix. However, to avoid confusion with the full error covariance matrix P, we always refer to { P s s V} as error variances.

17 SUDDERH et al.: EMBEDDED REES: ESIMAION OF GAUSSIAN PROCESSES 17 be expressed as a series of linear functions of the normalized observation vector ȳ: Since we have assumed that ρ F n = Ĵ 1 1 (Ĵ K n=0 ] x n 1 = [Ĵ + F n ȳ (22) ] 1 [Ĵ + F n 1 F 1 = 0 (23) K ) < 1, Proposition 1 guarantees that x n n x for any ȳ. herefore, the sequence of matrices defined by equations (22, 23) must converge to P : [ ] n ) P = Ĵ 1 K Ĵ 1 = lim 1 (Ĵ n + F n In fact, these matrices correspond exactly to the series expansion of P generated by the following fixed point equation: P = Ĵ 1 (24) + Ĵ 1 K P (25) he matrix sequence of equation (24) may be derived by repeatedly using equation (25) to expand itself. Clearly, if we could somehow track the matrices composing the E series expansion (24), we could recover the desired error covariance matrix P. In order to perform this tracking efficiently, we must exploit the fact that, as discussed in Section III, for many models the cutting matrix K is low rank. his allows us to focus our computation on a particular low dimensional subspace, leading to computational gains when rank(k ) < N. We begin in Section IV-A by presenting techniques for explicitly constructing rank revealing decompositions of cutting matrices. hen, in Section IV-B we use these decompositions to calculate the desired error variances. A. Low Rank Decompositions of Cutting Matrices For any cutting matrix K, there exist many different additive decompositions into rank one terms: K = i ω i u i u i u i R Nd (26) For example, because any symmetric matrix has an orthogonal set of eigenvectors, the eigendecomposition K = UDU is of this form. However, for large graphical models the O(N 3 d 3 ) cost of direct eigenanalysis algorithms is intractable. In this section, we provide an efficient decomposition procedure which exploits the structure of K to reduce the number of needed terms. Consider a regular cutting matrix K that acts to remove the edges E \ E from the graph G = (V, E), and let E = E \ E. he decomposition we construct depends on a set of key nodes defined as follows: Definition 2. Consider a graph G = (V, E) with associated embedded tree G = (V, E ). A subset W V of the vertices forms a key node set if for any cut edge (s, t) E \ E, either s or t (or both) belong to W.

18 18 MI LIDS ECHNICAL REPOR P-2562 In other words, at least one end of every cut edge must be a key node. In most graphs, there will be many ways to choose W (see Figure 6). In such cases, the size of the resulting decomposition is minimized by minimizing W W. Note that W may always be chosen so that W E. Given a set of key nodes W of dimension d, we decompose K as K = w W H w (27) where H w is chosen so that its only nonzero entries are in the d rows and columns corresponding to w. Because every cut edge adjoins a key node, this is always possible. he following proposition shows how each H w may be further decomposed into rank one terms: Proposition 3. Let H be a symmetric matrix whose only nonzero entries lie in a set of d rows, and the corresponding columns. hen the rank of H is at most 2d. Proof. See Appendix. hus, the rank of a cutting matrix with W corresponding key nodes can be at most 2W d. he proof of Proposition 3 is constructive, providing an explicit procedure for determining a rank one decomposition as in equation (26). he cost of this construction is at most O(NW d 3 ) operations. However, in the common case where the size of each node s local neighborhood does not grow with N, this reduces to O(W d 3 ). Note that because at most one key node is needed for each cut edge, Proposition 3 implies that a cutting matrix K which cuts E edges may always be decomposed into at most 2Ed terms. When d = 1, the number of terms may be reduced to only E by appropriately choosing the diagonal entries of K (see [2, Sec ] for details), a fact we exploit in the simulations of Section VI-B. B. Fixed Point Error Variance Iteration In this section, we provide an algorithm for tracking the terms of the E series expansion (24). Since the block diagonal entries of Ĵ 1 may be determined in O(Nd 3 ) operations by an exact recursive algorithm, one obvious solution would be to directly track the evolution of the F n matrices. his possibility is explored in [2, Sec ], where it is shown that low rank decompositions of F n may be recursively updated in O(NW 2 d 5 ) operations per iteration. Here, however, we focus on a more efficient approach based upon a specially chosen set of synthetic inference problems. We begin by considering the fixed point equation (25) characterizing the single tree E series expansion.

19 SUDDERH et al.: EMBEDDED REES: ESIMAION OF GAUSSIAN PROCESSES 19 Given a tree structured matrix splitting Ĵ = (Ĵ K ), where the cutting matrix K has W key nodes: 1) Compute the block diagonal entries of Ĵ 1 using the BP algorithm (O(Nd 3 ) operations). 2) Determine a rank one decomposition of K into O(W d) vectors u i, as in equation (26). 3) For each of the O(W d) vectors u i : a) Compute Ĵ 1 u i using the BP algorithm (O(Nd 3 ) operations). b) Compute P u i using any convenient conditional mean estimation algorithm (typically O(Nd 3 ) operations per iteration). 4) Using the results of the previous steps, calculate the block diagonal entries of P as in equation (28) (O(NW d 2 ) operations). Alg. 1. Embedded trees fixed point algorithm for computing error variances. he overall computational cost is O(NW d 4 ) operations for each conditional mean iteration in step 3(b). Using the low rank decomposition of K given by equation (26), we have P = Ĵ 1 + Ĵ 1 ( ) ω i u i u i P i = Ĵ 1 + i ω i (Ĵ 1 u i) ( P ui ) (28) As discussed in Section IV-A, the decomposition of K may always be chosen to have at most O(W d) vectors u i. Each of the terms in equation (28) has a probabilistic interpretation. For example, Ĵ 1 is the covariance matrix of a tree structured graphical model. Similarly, Ĵ 1 u i and P u i are the conditional means of synthetic inference problems with observations u i. Algorithm 1 shows how all of the terms in equation (28) may be efficiently computed using the inference algorithms developed earlier in this paper. he computational cost is dominated by the solution of the synthetic estimation problems on the graph with cycles (step 3(b)). Any inference algorithm can be used in this step, including the E algorithm (using one or multiple trees), loopy BP, or conjugate gradient. hus, this fixed point algorithm can effectively transform any method for computing conditional means into a fast error variance iteration. Although equation (28) was motivated by the E algorithm, nothing about this equation requires that the cutting matrix produce a tree structured graph. All that is necessary is that the remaining edges form a structure for which exact error variance calculation is tractable. In addition, note that equation (28) gives the correct perturbation of Ĵ 1 for calculating the entire error covariance P, not just the block diagonals P s. hus, if the BP algorithm is extended to calculate a particular set of off diagonal elements of Ĵ 1, the exact values of the corresponding entries of P may be found in the same manner.

20 20 MI LIDS ECHNICAL REPOR P-2562 V. REE SRUCURED PRECONDIIONERS Preconditioners play an important role in accelerating the convergence of the conjugate gradient (CG) method. In Section III-B, we demonstrated that whenever the E algorithm is implemented by periodically cycling through a fixed set of cutting matrices, it is equivalent to a preconditioned Richardson iteration. An explicit formula for the implicitly generated preconditioning matrix is given in Proposition 2. By simply applying the standard E iteration in equation (14) once for each of the cutting matrices, we can compute the product of the E preconditioning matrix with any vector in O( Nd 3 ) operations. his is the only operation needed to use the E iteration as a preconditioner for CG [28, 30]. In this section, we explore the theoretical properties of embedded tree based preconditioners (see Section VI for simulation results). For a preconditioning matrix to be used with CG, it must be symmetric. While any single tree E iteration leads to a symmetric preconditioner, the preconditioners associated with multiple tree iterations are in general nonsymmetric. It is possible to choose multiple tree cutting matrices so that symmetry is guaranteed. However, in dramatic contrast to the tree based Richardson iterations of Section III, multiple tree preconditioners typically perform slightly worse than their single tree counterparts [2, Sec. 4.3]. hus, in this paper we focus our attention on single tree preconditioners. If a single spanning tree, generated by cutting matrix K, is used as a preconditioner, the system which CG effectively solves is given by (Ĵ + K ) 1 Ĵ x = (Ĵ + K ) 1 ȳ (29) he convergence rate of CG is determined by how well the eigenspectrum of the preconditioned system (Ĵ + K ) 1 Ĵ can be fit by a polynomial [28]. Effective preconditioners produce smoother, flatter eigenspectra which are accurately fit by a lower order polynomial. Several authors have recently rediscovered [36], analyzed [37], and extended [38, 39] a technique called support graph theory which provides powerful methods for constructing effective preconditioners from maximum weight spanning trees. Support graph theory is especially interesting because it provides a set of results guaranteeing the effectiveness of the resulting preconditioners. Preliminary experimental results [37] indicate that tree based preconditioners can be very effective for many, but not all, of the canonical test problems in the numerical linear algebra literature. In general, the support graph literature has focused on tree based preconditioners for relatively densely connected graphs, such as nearest neighbor grids. However, for graphs which are nearly tree structured (K is low rank), it is possible to make stronger statements about the convergence of preconditioned CG, as shown by the following theorem:

21 SUDDERH et al.: EMBEDDED REES: ESIMAION OF GAUSSIAN PROCESSES 21 heorem 5. Suppose that the conjugate gradient algorithm is used to solve the N d dimensional preconditioned linear system given by equation (29). hen if the cutting matrix K has rank(k ) = m, the preconditioned CG method will converge to the exact solution in at most m + 1 iterations. Proof. Let λ be any eigenvalue of definition, we have (Ĵ + K ) 1 Ĵ, and v one of its corresponding eigenvectors. By (Ĵ + K ) 1 Ĵv = λv (1 λ)ĵv = λk v (30) Since rank(k ) = m, there exist N m linearly independent eigenvectors in the null space of K. Each one of these eigenvectors satisfies equation (30) when λ = 1. hus, λ = 1 is an eigenvalue of (Ĵ + K ) 1 Ĵ, and its multiplicity is at least N m. Let {λ i } m i=1 denote the m eigenvalues of (Ĵ + K ) 1 Ĵ not constrained to equal one. Consider the (m + 1) st order polynomial p m+1 (λ) defined as p m+1 (λ) = (1 λ) m i=1 (λ i λ) λ i (31) By construction, p m+1 (0) = 1. Let A (Ĵ + K ) 1 Ĵ. hen, from [28, p. 313] we see that r m+1 A 1 r 0 A 1 max p m+1(λ) (32) λ {λ i(a)} where r m+1 is the residual at iteration m + 1. Since p m+1 ( λ) = 0 for all λ {λ i (A)}, we must have r m+1 = 0. herefore, CG must converge by iteration m + 1. hus, when the cutting matrix rank is smaller than the dimension of Ĵ, the preconditioned CG iteration is guaranteed to converge in strictly fewer iterations than the unpreconditioned method. When combined with the results of the previous sections, heorem 5 has a number of interesting implications. In particular, from Section IV-A we know that a cutting matrix K with W associated key nodes can always be chosen so that rank(k ) O(W d). hen, if this cutting matrix is used to precondition the CG iteration, we see that the conditional mean can be exactly (non iteratively) calculated in O(NW d 4 ) operations. Similarly, if single tree preconditioned CG is used to solve the synthetic estimation problems (step 3(b)) of the fixed point error variance algorithm (Section IV-B), error variances can be exactly calculated in O(NW 2 d 5 ) operations. Note that in problems where W is reasonably large, we typically hope (and find) that iterative methods will converge in less than O(W d) iterations. Nevertheless, it is useful to be able to provide such guarantees of worst case computational cost. See the following section for examples of models where finite convergence dramatically improves performance.

22 22 MI LIDS ECHNICAL REPOR P-2562 Exact error variances may also be calculated by a recently proposed extended message passing algorithm [27], which defines additional messages to cancel the fill associated with Gaussian elimination on graphs with cycles [12]. Like the E fixed point iteration of Algorithm 1, the extended message passing procedure is based on accounting for perturbations from a spanning tree of the graph. he cost of extended message passing is O(NL 2 ), where L is the number of nodes adjacent to any cut edge. In contrast, Algorithm 1 requires O(NW 2 ) operations, where the number of key nodes W is strictly less than L (and sometimes much less, as in Section VI-C). More importantly, extended message passing is non iterative and always requires the full computational cost, while Algorithm 1 produces a series of approximate error variances which are often accurate after far fewer than W iterations. VI. SIMULAIONS In this section, we present several computational examples which explore the empirical performance of the inference algorithms developed in this paper. When calculating conditional means, we measure convergence using the normalized residual of equation (11). For error variance simulations, we use the normalized error metric ( P s n P 2 s s V ) 1 /( 2 P 2 s s V ) 1 2 (33) where P s are the true error variances, and P s n the approximations at the n th iteration. Error variance errors are always plotted versus the number of equivalent BP iterations, so that we properly account for the extra cost of solving multiple synthetic inference problems in the E fixed point method (Algorithm 1, step 3(b)). A. Multiscale Modeling In Figure 7, we compare the performance of the inference methods developed in this paper on the augmented multiscale model G aug (see Figures 1 2) constructed in Section I-A. o do this, we associate a two dimensional observation vector y s = x s + v s with each of the 64 finest scale nodes, where v s is independent noise with variance equal to the marginal prior variance of x s. he resulting inverse error covariance matrix is not well conditioned 6 (κ(ĵ) ), making this a challenging inference problem for iterative methods. We first consider the E Richardson iteration of Section III. We use zero diagonal regular cutting matrices to create two different spanning trees: the multiscale tree G mar of Figure 1, and an alternative 6 For a symmetric positive definite matrix J, the condition number is defined as κ(j) λmax(j) λ min (J).

23 SUDDERH et al.: EMBEDDED REES: ESIMAION OF GAUSSIAN PROCESSES 23 tree where three of the four edges connecting the second and third coarsest scales are removed. hese cutting matrices provide a pair of single tree iterations (denoted by E(1) and E(2)), as well as a two tree iteration (denoted by E(1,2)) which alternates between trees. As shown in Figure 7(a), despite the large condition number, all three iterations converge at an asymptotically linear rate as predicted by Proposition 1. However, as in Section III-C, the E(1,2) iteration converges much faster than either single tree method. In Figure 7(b), we compare the E(1,2) iteration to conjugate gradient (CG) and loopy belief propagation (BP), as well as to the single tree preconditioned CG (PCG) algorithm of Section V. Due to this problem s large condition number, BP converges rather slowly, while standard CG behaves extremely erratically, requiring over 1000 iterations for convergence. In contrast, since the cutting matrices are rank 12 for this graph, heorem 5 guarantees that (ignoring numerical issues) the E PCG iteration will converge in at most 13 iterations. Since the preconditioned system is much better conditioned (κ((ĵ + K ) 1 Ĵ) 273), this finite convergence is indeed observed. Finally, we compare the E fixed point error variance iteration (Section IV-B, Algorithm 1) to loopy BP s approximate error variances, as well as variances derived from the CG iteration as in [31]. We consider two options for solving the synthetic inference problems (step 3(b)): the E(1,2) Richardson iteration, and single tree PCG. As shown in Figure 7(c), this problem s poor conditioning causes the CG error variance formulas to produce very inaccurate results which never converge to the correct solution. Although loopy BP converges to somewhat more accurate variances, even a single iteration of the PCG fixed point iteration produces superior estimates. We also see that the PCG method s advantages over E(1,2) for conditional mean estimation translate directly to more rapid error variance convergence. B. Sparse versus Dense Graphs with Cycles his section examines the performance of the E algorithms on graphs with randomly generated potentials. We consider two different graphs: the sparse augmented multiscale graph G aug of Figure 1, and a more densely connected, nearest neighbor grid (analogous to the small grid in Figure 4). For grids, one must remove O(N) edges to reveal an embedded tree, so that the E fixed point error variance algorithm requires O(N 2 ) operations per iteration. his cost is intractable for large N, so in this section we focus solely on the calculation of conditional means. For each graph, we assume all nodes represent scalar Gaussian variables, and consider two different potential function assignments. In the first case, we create a homogeneous model by assigning the same attractive potential ψ s,t (x s, x t ) = exp { 12 } (x s x t ) 2 (34)

24 24 MI LIDS ECHNICAL REPOR P-2562 Inference Augmented ree Nearest Neighbor Grid Algorithm Homogeneous Disordered Homogeneous Disordered E(1) ± ± E(2) ± ± E(1,2) ± ± 34.7 E(1) PCG 4 4 ± ± 4.8 CG ± ± 8.1 BP ± ± 19.6 ABLE I CONDIIONAL MEAN SIMULAION RESULS: NUMBER OF IERAIONS BEFORE CONVERGENCE O A NORMALIZED RESIDUAL OF FOR HE DISORDERED CASE, WE REPOR HE AVERAGE NUMBER OF IERAIONS ACROSS 100 RANDOM PRIOR MODELS, PLUS OR MINUS WO SANDARD DEVIAIONS. to each edge. We also consider disordered models where each edge is randomly assigned a different potential of the form ψ s,t (x s, x t ) = exp { 12 } w st (x s a st x t ) 2 (35) Here, w st is sampled from an exponential distribution with mean 1, while a st is set to +1 or 1 with equal probability. hese two model types provide extreme test cases, each revealing different qualitative features of the proposed algorithms. o create test inference problems, we assign a measurement of the form y s = x s + v s, v s N (0, 10), to each node in the graph. We consider high noise levels because they lead to the most (numerically) challenging inference problems. For the augmented multiscale model, the E algorithms use the same spanning trees as in Section VI-A, while the grid simulations use trees similar to the first two spanning trees in Figure 4. able I lists the number of iterations required for each algorithm to converge to a normalized residual of For the disordered case, we generated 100 different random graphical priors, and we report the mean number of required iterations, plus or minus two standard deviations. For the augmented multiscale model, as in Section VI-A we find that the E(1,2) and E PCG methods both dramatically outperform their competitors. In particular, since the cutting matrix must remove only three edges and nodes represent scalar variables, preconditioned CG always finds the exact solution in four iterations. For the nearest neighbor grid, the performance of the E Richardson iterations is typically worse than other methods. However, even on this densely connected graph we find that single tree preconditioning improves CG s convergence rate, particularly for disordered potentials. his performance is not predicted by heorem 5, but is most likely attributable to the spectral smoothing provided by the tree based preconditioner.

25 SUDDERH et al.: EMBEDDED REES: ESIMAION OF GAUSSIAN PROCESSES 25 C. Distributed Sensor Networks In this section, we examine an inference problem of the type that arises in applications involving networks of distributed sensors. Figure 8(a) shows a network in which each of the 600 nodes is connected to its spatially nearest neighbors, except for the three central nodes which produce long range connections (colored gray). hese long range relationships might be caused by the presence of a few nodes with long range communications abilities or spatial sensitivities, or could represent objects sensed in the environment. We assign potentials to this graph using the disordered distribution of Section VI-B. We assume that exact inference is possible for the graph consisting of only the black nodes (e.g. via local clustering), but difficult when the longer range relationships are introduced. hus, for this example the tree based algorithms developed earlier in this paper will actually solve this (loopy) system of local interactions at each iteration. Figure 8(b) compares several different methods for calculating conditional means on this graph. BP performs noticeably better than either unpreconditioned CG or the single tree Richardson iteration. However, since the three central nodes form a key node set (see Section IV-A), the cutting matrix has rank 6, and preconditioned CG converges to the exact answer in only 7 iterations. For the error variance calculation (Figure 8(c)), the gains are more dramatic. BP quickly converges to a suboptimal answer, but after only one iteration of the six synthetic inference problems, the fixed point covariance method finds a superior solution. he CG error variance formulas of [31] are again limited by finite precision effects. VII. DISCUSSION We have proposed and analyzed a new class of embedded trees (E) algorithms for iterative, exact inference on Gaussian graphical models with cycles. Each E iteration exploits the presence of a subgraph for which exact estimation is tractable. Analytic and experimental results demonstrate that the E preconditioned conjugate gradient method rapidly computes conditional means even on very densely connected graphs. he complementary E error variance method is most effective for sparser graphs. We provide two examples, drawn from the fields of multiscale modeling and distributed sensing, which show how such graphs may naturally arise. Although we have developed inference algorithms based upon tractable embedded subgraphs, we have not provided a procedure for choosing these subgraphs. Our results indicate that there are important interactions among cut edges, suggesting that simple methods (e.g. maximal spanning trees) may not provide the best performance. Although support graph theory [36, 37] provides some guarantees for embedded trees, extending these methods to more general subgraphs is an important open problem.

26 26 MI LIDS ECHNICAL REPOR P-2562 he multiscale modeling example of Section I-A suggests that adding a small number of edges to tree structured graphs may greatly increase their effectiveness, in particular alleviating the commonly encountered boundary artifact problem. he E algorithms demonstrate that it is indeed possible to perform efficient, exact inference on such augmented multiscale models. However, our methods also indicate that this computational feasibility may depend on how quickly the number of extra edges must grow as a function of the process size. his edge allocation problem is one example of a more general modeling question: How should hidden graphical structures be chosen to best balance the goals of model accuracy and computational efficiency? APPENDIX We present the proof of Proposition 3. Without loss of generality, assume that the nonzero entries ] of H lie in the first d rows and columns. Let H be partitioned as H =, where A is a d d symmetric matrix. Similarly, partition the eigenvectors of H as v = he eigenvalues of H then satisfy [ va [ A B B 0 vb ], where v a is of dimension d. Av a + B v b = λv a (36) Bv a = λv b (37) Suppose that λ 0. From equation (37), v b is uniquely determined by λ and v a. Plugging equation (37) into (36), we have λ 2 v a λav a B Bv a = 0 (38) his d dimensional symmetric quadratic eigenvalue problem has at most 2d distinct solutions. hus, H can have at most 2d nonzero eigenvalues, and rank(h) 2d. ACKNOWLEDGMEN he authors would like to thank Dr. Michael Schneider for many helpful discussions. REFERENCES [1] M. J. Wainwright, E. B. Sudderth, and A. S. Willsky. ree based modeling and estimation of Gaussian processes on graphs with cycles. In Neural Information Processing Systems 13, pages MI Press, [2] E. B. Sudderth. Embedded trees: Estimation of Gaussian processes on graphs with cycles. Master s thesis, Massachusetts Institute of echnology, February [3] R. Szeliski. Bayesian modeling of uncertainty in low level vision. International Journal of Computer Vision, 5(3): , [4] M. R. Luettgen, W. C. Karl, and A. S. Willsky. Efficient multiscale regularization with applications to the computation of optical flow. IEEE ransactions on Image Processing, 3(1):41 64, January [5] P. W. Fieguth, W. C. Karl, A. S. Willsky, and C. Wunsch. Multiresolution optimal interpolation and statistical analysis of OPEX/POSEIDON satellite altimetry. IEEE ransactions on Geoscience and Remote Sensing, 33(2): , March [6]. Kailath, A. H. Sayed, and B. Hassibi. Linear Estimation. Prentice Hall, New Jersey, 2000.

27 SUDDERH et al.: EMBEDDED REES: ESIMAION OF GAUSSIAN PROCESSES 27 [7] M. Jordan. Graphical models. Statistical Science, o appear in Special Issue on Bayesian Statistics. [8] K. C. Chou, A. S. Willsky, and A. Benveniste. Multiscale recursive estimation, data fusion, and regularization. IEEE ransactions on Automatic Control, 39(3): , March [9] N. A. C. Cressie. Statistics for Spatial Data. John Wiley, New York, [10] P. Abrahamsen. A review of Gaussian random fields and correlation functions. echnical Report 917, Norwegian Computing Center, April [11]. M. Cover and J. A. homas. Elements of Information heory. John Wiley, New York, [12] A. S. Willsky. Multiresolution Markov models for signal and image processing. Proceedings of the IEEE, 90(8): , August [13] M. M. Daniel and A. S. Willsky. A multiresolution methodology for signal level fusion and data assimilation with applications to remote sensing. Proceedings of the IEEE, 85(1): , January [14] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufman, San Mateo, [15] J. S. Yedidia, W.. Freeman, and Y. Weiss. Understanding belief propagation and its generalizations. In International Joint Conference on Artificial Intelligence, August [16] S. L. Lauritzen. Graphical Models. Oxford University Press, Oxford, [17] A. P. Dempster. Covariance selection. Biometrics, 28: , March [18]. P. Speed and H.. Kiiveri. Gaussian Markov distributions over finite graphs. he Annals of Statistics, 14(1): , March [19] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger. Factor graphs and the sum product algorithm. IEEE ransactions on Information heory, 47(2): , February [20] S. M. Aji and R. J. McEliece. he generalized distributive law. IEEE ransactions on Information heory, 46(2): , March [21] Y. Weiss and W.. Freeman. Correctness of belief propagation in Gaussian graphical models of arbitrary topology. Neural Computation, 13: , [22] K. P. Murphy, Y. Weiss, and M. I. Jordan. Loopy belief propagation for approximate inference: An empirical study. In Uncertainty in Artificial Intelligence 15, pages Morgan Kaufmann, [23] P. Rusmevichientong and B. Van Roy. An analysis of belief propagation on the turbo decoding graph with Gaussian densities. IEEE ransactions on Information heory, 47(2): , February [24] J. S. Yedidia, W.. Freeman, and Y. Weiss. Constructing free energy approximations and generalized belief propagation algorithms. echnical Report , MERL, August [25] M. J. Wainwright,. S. Jaakkola, and A. S. Willsky. ree based reparameterization analysis of sum product and its generalizations. IEEE ransactions on Information heory, 49(5), May Appeared as MI LIDS echnical Report P-2510, May [26] M. Welling and Y. W. eh. Linear response algorithms for approximate inference. Submitted to Uncertainty in Artificial Intelligence, [27] K. H. Plarre and P. R. Kumar. Extended message passing algorithm for inference in loopy Gaussian graphical models. Submitted to Computer Networks, [28] J. W. Demmel. Applied Numerical Linear Algebra. SIAM, Philadelphia, [29] D. M. Young. Iterative Solution of Large Linear Systems. Academic Press, New York, [30] R. Barrett et al. emplates for the Solution of Linear Systems: Building Blocks for Iterative Methods. SIAM, Philadelphia, [31] J. G. Berryman. Analysis of approximate inverses in tomography II: Iterative inverses. Optimization and Engineering, 1: , [32] D. W. Peaceman and H. H. Rachford, Jr. he numerical solution of parabolic and elliptic differential equations. Journal of the SIAM, 3(1):28 41, March [33] R. Nikoukhah, A. S. Willsky, and B. C. Levy. Kalman filtering and Riccati equations for descriptor systems. IEEE ransactions on Automatic Control, 37(9): , September [34] L. Adams. m Step preconditioned conjugate gradient methods. SIAM Journal on Scientific and Statistical Computing, 6(2): , April [35] O. Axelsson. Bounds of eigenvalues of preconditioned matrices. SIAM Journal on Matrix Analysis and Applications, 13(3): , July [36] M. Bern, J. R. Gilbert, B. Hendrickson, N. Nguyen, and S. oledo. Support graph preconditioners. Submitted to SIAM Journal on Matrix Analysis and Applications, January [37] D. Chen and S. oledo. Vaidya s preconditioners: Implementation and experimental study. Submitted to Electronic ransactions on Numerical Analysis, August [38] E. Boman and B. Hendrickson. Support theory for preconditioning. Submitted to SIAM Journal on Matrix Analysis and Applications, [39] E. Boman, D. Chen, B. Hendrickson, and S. oledo. Maximum weight basis preconditioners. o appear in Numerical Linear Algebra with Applications.

28 28 MI LIDS ECHNICAL REPOR P-2562 G ss G mar G aug Fig. 1. hree different graphs for modeling 128 samples of a 1D process: a Markov chain (state space model) G ss, a multiscale autoregressive (MAR) model G mar, and an augmented multiscale model G aug which adds three additional edges (below arrows) to the finest scale of G mar. All nodes represent Gaussian vectors of dimension 2. P C(τ;ω)= 1 ωτ sin(ωτ) P ss D(P P ss) = 9.0 P mar D(P P mar) = 13.6 P aug D(P P aug) = 6.7 Fig. 2. Approximate modeling of the target covariance matrix P (sampled from C(τ; ω)) using each of the three graphs in Figure 1. he KL divergence of each approximating distribution is given below an intensity plot of the corresponding covariance matrix (largest values in white).

Embedded Trees: Estimation of Gaussian Processes on Graphs with Cycles

Embedded Trees: Estimation of Gaussian Processes on Graphs with Cycles SUBMIED O IEEE RANSACIONS ON SIGNAL PROCESSING 1 Embedded rees: Estimation of Gaussian Processes on Graphs with Cycles Erik B. Sudderth *, Martin J. Wainwright, and Alan S. Willsky EDICS: 2-HIER (HIERARCHICAL

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

Gaussian Multiresolution Models: Exploiting Sparse Markov and Covariance Structure

Gaussian Multiresolution Models: Exploiting Sparse Markov and Covariance Structure Gaussian Multiresolution Models: Exploiting Sparse Markov and Covariance Structure The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters.

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 12: Gaussian Belief Propagation, State Space Models and Kalman Filters Guest Kalman Filter Lecture by

More information

Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation

Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation Jason K. Johnson, Dmitry M. Malioutov and Alan S. Willsky Department of Electrical Engineering and Computer Science Massachusetts Institute

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector

More information

Message-Passing Algorithms for GMRFs and Non-Linear Optimization

Message-Passing Algorithms for GMRFs and Non-Linear Optimization Message-Passing Algorithms for GMRFs and Non-Linear Optimization Jason Johnson Joint Work with Dmitry Malioutov, Venkat Chandrasekaran and Alan Willsky Stochastic Systems Group, MIT NIPS Workshop: Approximate

More information

6.867 Machine learning, lecture 23 (Jaakkola)

6.867 Machine learning, lecture 23 (Jaakkola) Lecture topics: Markov Random Fields Probabilistic inference Markov Random Fields We will briefly go over undirected graphical models or Markov Random Fields (MRFs) as they will be needed in the context

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

Iterative Methods for Solving A x = b

Iterative Methods for Solving A x = b Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http

More information

Graphical Models for Statistical Inference and Data Assimilation

Graphical Models for Statistical Inference and Data Assimilation Graphical Models for Statistical Inference and Data Assimilation Alexander T. Ihler a Sergey Kirshner a Michael Ghil b,c Andrew W. Robertson d Padhraic Smyth a a Donald Bren School of Information and Computer

More information

Undirected Graphical Models

Undirected Graphical Models Undirected Graphical Models 1 Conditional Independence Graphs Let G = (V, E) be an undirected graph with vertex set V and edge set E, and let A, B, and C be subsets of vertices. We say that C separates

More information

Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract)

Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract) Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract) Arthur Choi and Adnan Darwiche Computer Science Department University of California, Los Angeles

More information

A Multiresolution Methodology for Signal-Level Fusion and Data Assimilation with Applications to Remote Sensing

A Multiresolution Methodology for Signal-Level Fusion and Data Assimilation with Applications to Remote Sensing A Multiresolution Methodology for Signal-Level Fusion and Data Assimilation with Applications to Remote Sensing MICHAEL M. DANIEL, STUDENT MEMBER, IEEE, AND ALAN S. WILLSKY, FELLOW, IEEE Invited Paper

More information

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013 School of Computer Science Probabilistic Graphical Models Theory of Variational Inference: Inner and Outer Approximation Junming Yin Lecture 15, March 4, 2013 Reading: W & J Book Chapters 1 Roadmap Two

More information

Stabilization and Acceleration of Algebraic Multigrid Method

Stabilization and Acceleration of Algebraic Multigrid Method Stabilization and Acceleration of Algebraic Multigrid Method Recursive Projection Algorithm A. Jemcov J.P. Maruszewski Fluent Inc. October 24, 2006 Outline 1 Need for Algorithm Stabilization and Acceleration

More information

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014 Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 2014 Problem Set 3 Issued: Thursday, September 25, 2014 Due: Thursday,

More information

Probabilistic Graphical Models Lecture Notes Fall 2009

Probabilistic Graphical Models Lecture Notes Fall 2009 Probabilistic Graphical Models Lecture Notes Fall 2009 October 28, 2009 Byoung-Tak Zhang School of omputer Science and Engineering & ognitive Science, Brain Science, and Bioinformatics Seoul National University

More information

Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 19: Computing the SVD; Sparse Linear Systems Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical

More information

Lecture 21: Spectral Learning for Graphical Models

Lecture 21: Spectral Learning for Graphical Models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation

More information

Lecture Notes 5: Multiresolution Analysis

Lecture Notes 5: Multiresolution Analysis Optimization-based data analysis Fall 2017 Lecture Notes 5: Multiresolution Analysis 1 Frames A frame is a generalization of an orthonormal basis. The inner products between the vectors in a frame and

More information

Graphical Models for Statistical Inference and Data Assimilation

Graphical Models for Statistical Inference and Data Assimilation Graphical Models for Statistical Inference and Data Assimilation Alexander T. Ihler a Sergey Kirshner b Michael Ghil c,d Andrew W. Robertson e Padhraic Smyth a a Donald Bren School of Information and Computer

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

14 : Theory of Variational Inference: Inner and Outer Approximation

14 : Theory of Variational Inference: Inner and Outer Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Maria Ryskina, Yen-Chia Hsu 1 Introduction

More information

Chapter 7 Iterative Techniques in Matrix Algebra

Chapter 7 Iterative Techniques in Matrix Algebra Chapter 7 Iterative Techniques in Matrix Algebra Per-Olof Persson persson@berkeley.edu Department of Mathematics University of California, Berkeley Math 128B Numerical Analysis Vector Norms Definition

More information

Fisher Information in Gaussian Graphical Models

Fisher Information in Gaussian Graphical Models Fisher Information in Gaussian Graphical Models Jason K. Johnson September 21, 2006 Abstract This note summarizes various derivations, formulas and computational algorithms relevant to the Fisher information

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

Bayesian Machine Learning - Lecture 7

Bayesian Machine Learning - Lecture 7 Bayesian Machine Learning - Lecture 7 Guido Sanguinetti Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh gsanguin@inf.ed.ac.uk March 4, 2015 Today s lecture 1

More information

Graphical models for statistical inference and data assimilation

Graphical models for statistical inference and data assimilation Physica D ( ) www.elsevier.com/locate/physd Graphical models for statistical inference and data assimilation Alexander T. Ihler a,, Sergey Kirshner b, Michael Ghil c,d, Andrew W. Robertson e, Padhraic

More information

14 : Theory of Variational Inference: Inner and Outer Approximation

14 : Theory of Variational Inference: Inner and Outer Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2014 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Yu-Hsin Kuo, Amos Ng 1 Introduction Last lecture

More information

17 Variational Inference

17 Variational Inference Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms for Inference Fall 2014 17 Variational Inference Prompted by loopy graphs for which exact

More information

The Conjugate Gradient Method

The Conjugate Gradient Method The Conjugate Gradient Method Classical Iterations We have a problem, We assume that the matrix comes from a discretization of a PDE. The best and most popular model problem is, The matrix will be as large

More information

Lecture 4 October 18th

Lecture 4 October 18th Directed and undirected graphical models Fall 2017 Lecture 4 October 18th Lecturer: Guillaume Obozinski Scribe: In this lecture, we will assume that all random variables are discrete, to keep notations

More information

Lecture 18 Classical Iterative Methods

Lecture 18 Classical Iterative Methods Lecture 18 Classical Iterative Methods MIT 18.335J / 6.337J Introduction to Numerical Methods Per-Olof Persson November 14, 2006 1 Iterative Methods for Linear Systems Direct methods for solving Ax = b,

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

Inference in Bayesian Networks

Inference in Bayesian Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Inference in graphical models Description Assume we have evidence e on the state of a subset of variables E in the model (i.e. Bayesian Network)

More information

Learning discrete graphical models via generalized inverse covariance matrices

Learning discrete graphical models via generalized inverse covariance matrices Learning discrete graphical models via generalized inverse covariance matrices Duzhe Wang, Yiming Lv, Yongjoon Kim, Young Lee Department of Statistics University of Wisconsin-Madison {dwang282, lv23, ykim676,

More information

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014 Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 2014 Recitation 3 1 Gaussian Graphical Models: Schur s Complement Consider

More information

12 : Variational Inference I

12 : Variational Inference I 10-708: Probabilistic Graphical Models, Spring 2015 12 : Variational Inference I Lecturer: Eric P. Xing Scribes: Fattaneh Jabbari, Eric Lei, Evan Shapiro 1 Introduction Probabilistic inference is one of

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Expectation Propagation in Factor Graphs: A Tutorial

Expectation Propagation in Factor Graphs: A Tutorial DRAFT: Version 0.1, 28 October 2005. Do not distribute. Expectation Propagation in Factor Graphs: A Tutorial Charles Sutton October 28, 2005 Abstract Expectation propagation is an important variational

More information

Modeling and Estimation in Gaussian Graphical Models: Maximum-Entropy Methods and Walk-Sum Analysis. Venkat Chandrasekaran

Modeling and Estimation in Gaussian Graphical Models: Maximum-Entropy Methods and Walk-Sum Analysis. Venkat Chandrasekaran Modeling and Estimation in Gaussian Graphical Models: Maximum-Entropy Methods and Walk-Sum Analysis by Venkat Chandrasekaran Submitted to the Department of Electrical Engineering and Computer Science in

More information

Y1 Y2 Y3 Y4 Y1 Y2 Y3 Y4 Z1 Z2 Z3 Z4

Y1 Y2 Y3 Y4 Y1 Y2 Y3 Y4 Z1 Z2 Z3 Z4 Inference: Exploiting Local Structure aphne Koller Stanford University CS228 Handout #4 We have seen that N inference exploits the network structure, in particular the conditional independence and the

More information

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Chapter 14 SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Today we continue the topic of low-dimensional approximation to datasets and matrices. Last time we saw the singular

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 24: Preconditioning and Multigrid Solver Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 5 Preconditioning Motivation:

More information

Dual Decomposition for Marginal Inference

Dual Decomposition for Marginal Inference Dual Decomposition for Marginal Inference Justin Domke Rochester Institute of echnology Rochester, NY 14623 Abstract We present a dual decomposition approach to the treereweighted belief propagation objective.

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

MARKOV random fields (MRFs) [1] [3] are statistical

MARKOV random fields (MRFs) [1] [3] are statistical IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 10, OCTOBER 2008 4621 Low-Rank Variance Approximation in GMRF Models: Single and Multiscale Approaches Dmitry M. Malioutov, Student Member, IEEE, Jason

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures

More information

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G. 10-708: Probabilistic Graphical Models, Spring 2015 26 : Spectral GMs Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G. 1 Introduction A common task in machine learning is to work with

More information

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm Probabilistic Graphical Models 10-708 Homework 2: Due February 24, 2014 at 4 pm Directions. This homework assignment covers the material presented in Lectures 4-8. You must complete all four problems to

More information

Technical Vignette 5: Understanding intrinsic Gaussian Markov random field spatial models, including intrinsic conditional autoregressive models

Technical Vignette 5: Understanding intrinsic Gaussian Markov random field spatial models, including intrinsic conditional autoregressive models Technical Vignette 5: Understanding intrinsic Gaussian Markov random field spatial models, including intrinsic conditional autoregressive models Christopher Paciorek, Department of Statistics, University

More information

Course Notes: Week 1

Course Notes: Week 1 Course Notes: Week 1 Math 270C: Applied Numerical Linear Algebra 1 Lecture 1: Introduction (3/28/11) We will focus on iterative methods for solving linear systems of equations (and some discussion of eigenvalues

More information

Fractional Belief Propagation

Fractional Belief Propagation Fractional Belief Propagation im iegerinck and Tom Heskes S, niversity of ijmegen Geert Grooteplein 21, 6525 EZ, ijmegen, the etherlands wimw,tom @snn.kun.nl Abstract e consider loopy belief propagation

More information

Numerical Methods I Non-Square and Sparse Linear Systems

Numerical Methods I Non-Square and Sparse Linear Systems Numerical Methods I Non-Square and Sparse Linear Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 September 25th, 2014 A. Donev (Courant

More information

2.1 Laplacian Variants

2.1 Laplacian Variants -3 MS&E 337: Spectral Graph heory and Algorithmic Applications Spring 2015 Lecturer: Prof. Amin Saberi Lecture 2-3: 4/7/2015 Scribe: Simon Anastasiadis and Nolan Skochdopole Disclaimer: hese notes have

More information

Kasetsart University Workshop. Multigrid methods: An introduction

Kasetsart University Workshop. Multigrid methods: An introduction Kasetsart University Workshop Multigrid methods: An introduction Dr. Anand Pardhanani Mathematics Department Earlham College Richmond, Indiana USA pardhan@earlham.edu A copy of these slides is available

More information

High-dimensional graphical model selection: Practical and information-theoretic limits

High-dimensional graphical model selection: Practical and information-theoretic limits 1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John

More information

From Stationary Methods to Krylov Subspaces

From Stationary Methods to Krylov Subspaces Week 6: Wednesday, Mar 7 From Stationary Methods to Krylov Subspaces Last time, we discussed stationary methods for the iterative solution of linear systems of equations, which can generally be written

More information

Algebraic Multigrid Preconditioners for Computing Stationary Distributions of Markov Processes

Algebraic Multigrid Preconditioners for Computing Stationary Distributions of Markov Processes Algebraic Multigrid Preconditioners for Computing Stationary Distributions of Markov Processes Elena Virnik, TU BERLIN Algebraic Multigrid Preconditioners for Computing Stationary Distributions of Markov

More information

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for 1 Iteration basics Notes for 2016-11-07 An iterative solver for Ax = b is produces a sequence of approximations x (k) x. We always stop after finitely many steps, based on some convergence criterion, e.g.

More information

Simultaneous Multi-frame MAP Super-Resolution Video Enhancement using Spatio-temporal Priors

Simultaneous Multi-frame MAP Super-Resolution Video Enhancement using Spatio-temporal Priors Simultaneous Multi-frame MAP Super-Resolution Video Enhancement using Spatio-temporal Priors Sean Borman and Robert L. Stevenson Department of Electrical Engineering, University of Notre Dame Notre Dame,

More information

A Note on the comparison of Nearest Neighbor Gaussian Process (NNGP) based models

A Note on the comparison of Nearest Neighbor Gaussian Process (NNGP) based models A Note on the comparison of Nearest Neighbor Gaussian Process (NNGP) based models arxiv:1811.03735v1 [math.st] 9 Nov 2018 Lu Zhang UCLA Department of Biostatistics Lu.Zhang@ucla.edu Sudipto Banerjee UCLA

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Variational Inference II: Mean Field Method and Variational Principle Junming Yin Lecture 15, March 7, 2012 X 1 X 1 X 1 X 1 X 2 X 3 X 2 X 2 X 3

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

Contents. Preface... xi. Introduction...

Contents. Preface... xi. Introduction... Contents Preface... xi Introduction... xv Chapter 1. Computer Architectures... 1 1.1. Different types of parallelism... 1 1.1.1. Overlap, concurrency and parallelism... 1 1.1.2. Temporal and spatial parallelism

More information

High-dimensional graphical model selection: Practical and information-theoretic limits

High-dimensional graphical model selection: Practical and information-theoretic limits 1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John

More information

Definition A finite Markov chain is a memoryless homogeneous discrete stochastic process with a finite number of states.

Definition A finite Markov chain is a memoryless homogeneous discrete stochastic process with a finite number of states. Chapter 8 Finite Markov Chains A discrete system is characterized by a set V of states and transitions between the states. V is referred to as the state space. We think of the transitions as occurring

More information

Independencies. Undirected Graphical Models 2: Independencies. Independencies (Markov networks) Independencies (Bayesian Networks)

Independencies. Undirected Graphical Models 2: Independencies. Independencies (Markov networks) Independencies (Bayesian Networks) (Bayesian Networks) Undirected Graphical Models 2: Use d-separation to read off independencies in a Bayesian network Takes a bit of effort! 1 2 (Markov networks) Use separation to determine independencies

More information

Loopy Belief Propagation: Convergence and Effects of Message Errors

Loopy Belief Propagation: Convergence and Effects of Message Errors LIDS Technical Report # 2602 (corrected version) Loopy Belief Propagation: Convergence and Effects of Message Errors Alexander T. Ihler, John W. Fisher III, and Alan S. Willsky Initial version June, 2004;

More information

Density Propagation for Continuous Temporal Chains Generative and Discriminative Models

Density Propagation for Continuous Temporal Chains Generative and Discriminative Models $ Technical Report, University of Toronto, CSRG-501, October 2004 Density Propagation for Continuous Temporal Chains Generative and Discriminative Models Cristian Sminchisescu and Allan Jepson Department

More information

Multi-Robotic Systems

Multi-Robotic Systems CHAPTER 9 Multi-Robotic Systems The topic of multi-robotic systems is quite popular now. It is believed that such systems can have the following benefits: Improved performance ( winning by numbers ) Distributed

More information

Proofs for Large Sample Properties of Generalized Method of Moments Estimators

Proofs for Large Sample Properties of Generalized Method of Moments Estimators Proofs for Large Sample Properties of Generalized Method of Moments Estimators Lars Peter Hansen University of Chicago March 8, 2012 1 Introduction Econometrica did not publish many of the proofs in my

More information

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics. EECS 281A / STAT 241A Statistical Learning Theory

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics. EECS 281A / STAT 241A Statistical Learning Theory UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics EECS 281A / STAT 241A Statistical Learning Theory Solutions to Problem Set 2 Fall 2011 Issued: Wednesday,

More information

Exact Inference I. Mark Peot. In this lecture we will look at issues associated with exact inference. = =

Exact Inference I. Mark Peot. In this lecture we will look at issues associated with exact inference. = = Exact Inference I Mark Peot In this lecture we will look at issues associated with exact inference 10 Queries The objective of probabilistic inference is to compute a joint distribution of a set of query

More information

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018 Lab 8: Measuring Graph Centrality - PageRank Monday, November 5 CompSci 531, Fall 2018 Outline Measuring Graph Centrality: Motivation Random Walks, Markov Chains, and Stationarity Distributions Google

More information

Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE

Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 55, NO. 9, SEPTEMBER 2010 1987 Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE Abstract

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Lecture 24: Element-wise Sampling of Graphs and Linear Equation Solving. 22 Element-wise Sampling of Graphs and Linear Equation Solving

Lecture 24: Element-wise Sampling of Graphs and Linear Equation Solving. 22 Element-wise Sampling of Graphs and Linear Equation Solving Stat260/CS294: Randomized Algorithms for Matrices and Data Lecture 24-12/02/2013 Lecture 24: Element-wise Sampling of Graphs and Linear Equation Solving Lecturer: Michael Mahoney Scribe: Michael Mahoney

More information

Undirected graphical models

Undirected graphical models Undirected graphical models Semantics of probabilistic models over undirected graphs Parameters of undirected models Example applications COMP-652 and ECSE-608, February 16, 2017 1 Undirected graphical

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Stochastic Processes

Stochastic Processes qmc082.tex. Version of 30 September 2010. Lecture Notes on Quantum Mechanics No. 8 R. B. Griffiths References: Stochastic Processes CQT = R. B. Griffiths, Consistent Quantum Theory (Cambridge, 2002) DeGroot

More information

Probabilistic and Bayesian Machine Learning

Probabilistic and Bayesian Machine Learning Probabilistic and Bayesian Machine Learning Day 4: Expectation and Belief Propagation Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London http://www.gatsby.ucl.ac.uk/

More information

13 : Variational Inference: Loopy Belief Propagation

13 : Variational Inference: Loopy Belief Propagation 10-708: Probabilistic Graphical Models 10-708, Spring 2014 13 : Variational Inference: Loopy Belief Propagation Lecturer: Eric P. Xing Scribes: Rajarshi Das, Zhengzhong Liu, Dishan Gupta 1 Introduction

More information

Learning from Sensor Data: Set II. Behnaam Aazhang J.S. Abercombie Professor Electrical and Computer Engineering Rice University

Learning from Sensor Data: Set II. Behnaam Aazhang J.S. Abercombie Professor Electrical and Computer Engineering Rice University Learning from Sensor Data: Set II Behnaam Aazhang J.S. Abercombie Professor Electrical and Computer Engineering Rice University 1 6. Data Representation The approach for learning from data Probabilistic

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Prof. Tapio Elomaa tapio.elomaa@tut.fi Course Basics A new 4 credit unit course Part of Theoretical Computer Science courses at the Department of Mathematics There will be 4 hours

More information

Numerical Methods - Numerical Linear Algebra

Numerical Methods - Numerical Linear Algebra Numerical Methods - Numerical Linear Algebra Y. K. Goh Universiti Tunku Abdul Rahman 2013 Y. K. Goh (UTAR) Numerical Methods - Numerical Linear Algebra I 2013 1 / 62 Outline 1 Motivation 2 Solving Linear

More information

Solving PDEs with Multigrid Methods p.1

Solving PDEs with Multigrid Methods p.1 Solving PDEs with Multigrid Methods Scott MacLachlan maclachl@colorado.edu Department of Applied Mathematics, University of Colorado at Boulder Solving PDEs with Multigrid Methods p.1 Support and Collaboration

More information

A Recursive Model-Reduction Method for Approximate Inference in Gaussian Markov Random Fields

A Recursive Model-Reduction Method for Approximate Inference in Gaussian Markov Random Fields 1 A Recursive Model-Reduction Method for Approximate Inference in Gaussian Markov Random Fields Jason K. Johnson and Alan S. Willsky Abstract This paper presents recursive cavity modeling a principled,

More information

Classical Complexity and Fixed-Parameter Tractability of Simultaneous Consecutive Ones Submatrix & Editing Problems

Classical Complexity and Fixed-Parameter Tractability of Simultaneous Consecutive Ones Submatrix & Editing Problems Classical Complexity and Fixed-Parameter Tractability of Simultaneous Consecutive Ones Submatrix & Editing Problems Rani M. R, Mohith Jagalmohanan, R. Subashini Binary matrices having simultaneous consecutive

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 21: Sensitivity of Eigenvalues and Eigenvectors; Conjugate Gradient Method Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis

More information

Inference and estimation in probabilistic time series models

Inference and estimation in probabilistic time series models 1 Inference and estimation in probabilistic time series models David Barber, A Taylan Cemgil and Silvia Chiappa 11 Time series The term time series refers to data that can be represented as a sequence

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Variational Inference IV: Variational Principle II Junming Yin Lecture 17, March 21, 2012 X 1 X 1 X 1 X 1 X 2 X 3 X 2 X 2 X 3 X 3 Reading: X 4

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 7, 04 Reading: See class website Eric Xing @ CMU, 005-04

More information

Lecture 15. Probabilistic Models on Graph

Lecture 15. Probabilistic Models on Graph Lecture 15. Probabilistic Models on Graph Prof. Alan Yuille Spring 2014 1 Introduction We discuss how to define probabilistic models that use richly structured probability distributions and describe how

More information

Linearly-solvable Markov decision problems

Linearly-solvable Markov decision problems Advances in Neural Information Processing Systems 2 Linearly-solvable Markov decision problems Emanuel Todorov Department of Cognitive Science University of California San Diego todorov@cogsci.ucsd.edu

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

A Graphical Model for Simultaneous Partitioning and Labeling

A Graphical Model for Simultaneous Partitioning and Labeling A Graphical Model for Simultaneous Partitioning and Labeling Philip J Cowans Cavendish Laboratory, University of Cambridge, Cambridge, CB3 0HE, United Kingdom pjc51@camacuk Martin Szummer Microsoft Research

More information