AND/OR Importance Sampling

Size: px

Start display at page:

Download "AND/OR Importance Sampling"

Russell O’Neal’
5 years ago
Views:

1 D/ Importance Sampling Vibhav Gogate and Rina Dechter Department of Information and Computer Science, University of California, Irvine, C 9697, {vgogate,dechter}@ics.uci.edu bstract The paper introduces D/ importance sampling for probabilistic graphical models. In contrast to importance sampling, D/ importance sampling caches samples in the D/ space and then extracts a new sample mean from the stored samples. We prove that D/ importance sampling may have lower variance than importance sampling; thereby providing a theoretical justification for preferring it over importance sampling. Our empirical evaluation demonstrates that D/ importance sampling is far more accurate than importance sampling in many cases. Introduction Many problems in graphical models such as computing the probability of evidence in ayesian networks, solution counting in constraint networks and computing the partition function in Markov random fields are summation problems, defined as a sum of a function over a domain. ecause these problems are P-hard, sampling based techniques are often used to approximate the sum. The focus of the current paper is on importance sampling. The main idea in importance sampling [Geweke, 989, Rubinstein, 98] is to transform the summation problem to that of computing a weighted average over the domain by using a special distribution called the proposal (or importance) distribution. Importance sampling then generates samples from the proposal distribution and approximates the true average over the domain by an average over the samples; often referred to as the sample average. The sample average is simply a ratio of the sum of sample weights and the number of samples, and it can be computed in a memory-less fashion since it requires keeping only these two quantities in memory. The main idea in this paper is to equip importance sampling with memoization or caching in order to exploit conditional independencies that exist in the graphical model. Specifically, we cache the samples on an D/ tree or graph [Dechter and Mateescu, 007] which respects the structure of the graphical model and then compute a new weighted average over that D/ structure, yielding, as we show, an unbiased estimator that has a smaller variance than the importance sampling estimator. Similar to D/ search [Dechter and Mateescu, 007], our new D/ importance sampling scheme recursively combines samples that are cached in independent components yielding an increase in the effective sample size which is part of the reason that its estimates have lower variance. We present a detailed experimental evaluation comparing importance sampling with D/ importance sampling on ayesian network benchmarks. We observe that the latter outperforms the former on most benchmarks and in some cases quite significantly. The rest of the paper is organized as follows. In the next section, we describe preliminaries on graphical models, importance sampling and D/ search spaces. In sections 3, 4 and 5 we formally describe D/ importance sampling and prove that its sample mean has lower variance than conventional importance sampling. Experimental results are described in section 6 and we conclude with a discussion of related work and summary in section 7. Preliminaries We represent sets by bold capital letters and members of a set by capital letters. n assignment of a value to a variable is denoted by a small letter while bold small letters indicate an assignment to a set of variables. Definition. (belief networks). belief network () is a graphical model R = (,D,P), where = {,..., n } is a set of random variables over multi-valued domains D = {D,...,D n }. Given a directed acyclic graph G over, P = {P i }, where P i = P( i pa( i )) are conditional probability tables (CPTs) associated with each i. pa( i ) is the set of parents of the variable i in G. belief network represents a probability distribution over, P() =

2 E D C [C] C [] [] E [E] D D D C E 0 0 C E 0 0 C E 0 0 C E 0 0 D D D 0 0 C E C E D D D D 0 C E C E D D D D F G [DC] D D D D 0 0 D D 0 0 D D 0 0 D D 0 0 D (c) (d) Figure : ayesian etwork, Pseudo-tree (c) D/ tree (d) D/ search graph n i= P( i pa( i )). n evidence set E = e is an instantiated subset of variables.the moral graph (or primal graph) of a belief network is the undirected graph obtained by connecting the parent nodes and removing direction. Definition. (Probability of Evidence). Given a belief network R and evidence E = e, the probability of evidence P(E = e) is defined as: P(e) = n \E j= P( j pa( j )) E=e () The notation h() E=e stands for a function h over \ E with the assignment E = e.. D/ search spaces We can compute probability of evidence by search, by accumulating probabilities over the search space of instantiated variables. In the simplest case, this process defines an search tree, whose nodes represent partial variable assignments. This search space does not capture the structure of the underlying graphical model. To remedy this problem, [Dechter and Mateescu, 007] introduced the notion of D/ search space. Given a bayesian network R = (,D,P), its D/ search space is driven by a pseudo tree defined below. Definition.3 (Pseudo Tree). Given an undirected graph G = (V,E), a directed rooted tree T = (V,E) defined on all its nodes is called pseudo tree if any arc of G which is not included in E is a back-arc, namely it connects a node to an ancestor in T. Definition.4 (Labeled D/ tree). Given a graphical model R =,D,P, its primal graph G and a backbone pseudo tree T of G, the associated D/ search tree, has alternating levels of D and nodes. The nodes are labeled i and correspond to the variables. The D nodes are labeled i,x i and correspond to the value assignments in the domains of the variables. The structure of the D/ search tree is based on the underlying backbone tree T. The root of the D/ search tree is an node labeled by the root of T. Each arc, emanating from an node to an D node is associated with a label which can be derived from the CPTs of the bayesian network [Dechter and Mateescu, 007]. Each node and D node is also associated with a value that is used for computing the quantity of interest. Semantically, the states represent alternative assignments, whereas the D states represent problem decomposition into independent subproblems, all of which need be solved. When the pseudo-tree is a chain, the D/ search tree coincides with the regular search tree. The probability of evidence can be computed from a labeled D/ tree by recursively computing the value of all nodes from leaves to the root [Dechter and Mateescu, 007]. Example.5. Figure shows a bayesian network over seven variables with domains of {0,}. F and G are evidence nodes. Figure (c) shows the D/-search tree for the bayesian network based on the Pseudo-tree in Figure. ote that because F and G are instantiated, the search space has only 5 variables.. Computing Probability of Evidence Using Importance Sampling Importance sampling [Rubinstein, 98] is a simulation technique commonly used to evaluate the sum, M = x f(x) for some real function f. The idea is to generate samples x,...,x from a proposal distribution Q (satisfying f(x) > 0 Q(x) > 0) and then estimate M as follows: M = f(x) = x x M = i= f(x) Q(x) Q(x) = E Q[ f(x) Q(x) ] () w(x i ), where w(x i ) = f(xi ) Q(x i ) w is often referred to as the sample weight. It is known that the expected value E( M) = M [Rubinstein, 98]. To compute the probability of evidence by importance sampling, we use the substitution: f(x) = n j= (3) P( j pa( j )) E=e (4) Several choices are available for the proposal distribution Q(x) ranging from the prior distribution as in likelihood weighting to more sophisticated alternatives such as

3 IJGP-Sampling [Gogate and Dechter, 005] and EPIS- [uan and Druzdzel, 006] where the output of belief propagation is used to compute the proposal distribution. s in prior work [Cheng and Druzdzel, 000], we assume that the proposal distribution is expressed in a factored product form: Q() = n i= Q i( i,..., i ) = n i= Q i( i i ), where i {,..., i }, Q i ( i i ) = Q( i,..., i ) and i < c for some constant c. We can generate a full sample from Q as follows. For i = to n, sample i = x i from the conditional distribution Q( i = x,..., i = x i ) and set i = x i. 3 D/ importance sampling We first discuss computing expectation by parts; which forms the backbone of D/ importance sampling. We then present the D/ importance sampling scheme formally and derive its properties. 3. Estimating Expectation by Parts In Equation, the expectation of a multi-variable function is computed by summing over the entire domain. This method is clearly inefficient because it does not take into account the decomposition of the multi-variable function as we illustrate below. Consider the tree graphical model given in Figure. Let = a and = b be the evidence variables. Let Q() = Q()Q( )Q( ) be the proposal distribution. For simplicity, let us assume that f() = P(), f() = P( )P( = a ) and f( ) = P( )P( = b ). We can express probability of evidence P(a,b) as: f() f() f() P(a,b) = Q()Q( )Q( ) Q()Q( )Q( ) [ ] f() f() f() = E Q()Q( )Q( ) We can decompose the expectation in Equation 5 into smaller components as follows: P(a,b) = f()q() Q() ( f()q( ) Q( ) )( ) f( )Q( ) Q( ) The quantities in the two brackets in Equation 6 are, by definition, conditional expectations of a function over and respectively given. Therefore, Equation 6 can be written as: P(a,b) = f() Q() E (5) (6) [ ] [ ] f() f( ) Q( ) E Q( ) Q() (7) y definition, Equation 7 can be written as: [ [ ] [ ]] f() f() f( ) P(a,b) = E Q() E Q( ) E Q( ) (8) We will refer to Equations of the form 8 as expectation by parts borrowing from similar terms such as integration and summation by parts. If the domain size of all variables is d = 3, for example, computing expectation using Equation 5 would require summing over d 3 = 3 3 = 7 terms while computing the same expectation by parts would require summing over d + d + d = = terms. Therefore, exactly computing expectation by parts is clearly more efficient. Importance sampling ignores the decomposition of expectation while approximating it by the sample average. Our new algorithm estimates the true expectation by decomposing it into several conditional expectations and then approximating each by an appropriate weighted average over the samples. Since computing expectation by parts is less complex than computing expectation by summing over the domain; we expect that approximating it by parts will be easier as well. We next illustrate how to estimate expectation by parts on our example ayesian network given in Figure. ssume that we are given samples (z,x,y ),...,(z,x,y ) generated from Q decomposed according to Figure. For simplicity, let {0,} be the domain of and let = 0 and = be sampled [ 0 and] times respectively. We can approximate E f() Q( ) and E below: [ f( ) Q( ) ] by g ( = j) and g ( = j) = f(x j i, = j)i(x i, = j) i= Q(x i, = j) g ( = j) = f(y j i, = j)i(y i, = j) i= Q(y i, = j) g ( = j) defined where I(x i, = j) (or I(y i, = j)) is an indicator function which is iff the tuple (x i, = j) ( or (y i, = j) ) is generated in any of the samples and 0 otherwise. From Equation 8, we can now derive the following unbiased estimator for P(a,b): P(a, b) = j=0 j f( = j) g ( = j) g ( = j) Q( = j) (9) (0) Importance sampling on the other hand would estimate P(a, b) as follows: P(a,b) = f( = j) j j=0 Q( = j) f(x j i, = j) f(y i, = j) i= Q(x i = j)q( i = j) I(xi,y i, = j) () where I(x i,y i, = j) is an indicator function which is iff the tuple (x i,y i, = j) is generated in any of the samples and 0 otherwise.

4 P( ) =0 = = = = P( ) =0 = = = = Evidence =0, =0 P() =0 = P( ) =0 = = = = P( ) =0 = = = = Sample # Q()=uniform distribution <0.6,> 0 0 <0.36,> <0.,> <0.4,> <.6,> <0.4,> P( = = 0) P( = 0 = ) (0.4)(0.) = = 0.6 Q( = = 0) 0.5 (c) <0.,> <0.08,> <0.8,> <0.84,> Figure : ayesian etwork, its CPTs, Proposal Distribution and Samples (c) D/ sample tree P( = ) 0. = = 0.4 Q( = ) Equation 0 which is an unbiased estimator of expectation by parts given in Equation 8 provides another rationale for preferring it over the usual importance sampling estimator given by Equation. In particular in Equation 0, we estimate two functions defined over the random variables = z and = z respectively from the generated samples. In importance sampling, on the other hand, we estimate a function over the joint random variable = z using the generated samples. ecause the samples for = z and = z are considered independently in Equation 0, j samples drawn over the joint random variable = z in Equation correspond to a larger set j j = j of virtual samples. We know that [Rubinstein, 98] the variance (and therefore the mean-squared error) of an unbiased estimator decreases with an increase in the effective sample size. Consequently, our new estimation technique will have lower error than the conventional approach. In the following subsection, we discuss how the D/ structure can be used for estimating expectation by parts yielding the D/ importance sampling scheme. 3. Computing Sample Mean in D/-space In this subsection, we formalize the ideas of estimating expectation by parts on a general D/ tree starting with some required definitions. We define the notion of an D/ sample tree which is restricted to the generated samples and which will be used to compute the D/ sample mean. The labels on this D/ tree are set to account for the importance weights. Definition 3. (rc Labeled D/ Sample Tree). Given a a graphical model R =,D,P, a pseudo-tree T(V,E), a proposal distribution Q = n i= Q( i nc( i )) such that nc( i ) is a subset of all ancestors of i in T, a sequence of assignments (samples) S and a complete D/ search tree φ T, an D/ sample tree S OT is constructed from φ T by removing all edges and corresponding nodes which are not in S i.e. they are not sampled. The rc-label for an node i to an D node i = x i in S OT is a pair w,# where: w = P( i=x i,anc(x i )) Q( i =x i anc(x i )) is called the weight of the arc. anc(x i ) is the assignment of values to all variables from the node i to the root node of S O and P( i = x i,anc(x i )) is the product of all functions in R that mention i but do not mention any variable ordered below it in T given ( i = x i,anc(x i )). # is the frequency of the arc. amely, it is equal to the number of times the assignment ( i = x i,anc(x i )) is sampled. Example 3.. Consider again the ayesian network given in Figure. ssume that the proposal distribution Q( ) is uniform. Figure shows four hypothetical random samples drawn from Q. Figure (c) shows the D/ sample tree over the four samples. Each arc from an node to an D node in the D/ sample tree is labeled with appropriate frequencies and weights according to Definition 3.. Figure (c) shows the derivation of arc-weights for two arcs. The main virtue of arranging the samples on an D/ sample tree is that we can exploit the independencies to define the D/ sample mean. Definition 3.3 (D/ Sample Mean). Given a D/ sample tree with arcs labeled according to Definition 3., the value of a node is defined recursively as follows. The value of leaf D nodes is and the value of leaf nodes is 0. Let C(n) denote the child nodes and v(n) denotes the value of node n. If n is a D node then: v(n) = n C(n) v(n ) and if n is a node then v(n) = n C(n)(#(n,n )w(n,n )v(n )) n C(n) #(n,n ) The D/ sample mean is the value of the root node. We can show that the value of an node is equal to an unbiased estimate of the conditional expectation of the variable at the node given an assignment from the root to the parent of the node. Since all variables, except the evidence variables are unassigned at the root node, the value of the root node equals the D/ sample mean which is an unbiased estimate of probability of evidence. Formally, THEEM 3.4. The D/ sample mean is an unbiased estimate of probability of evidence. Example 3.5. The calculations involved in computing the sample mean on the D/ sample tree on our example

5 0.6*0.7 = = = 0. = 0.46 = 0.7 <0.6,> <0.36,> <0.,> <0.,> <0.4,> <0.08,> <0.8,> <0.84,> 0 ( ) + ( ) = <.6,> <0.4,> 0.*0.46 = Figure 3: Computation of Values of and D nodes in a D/ sample tree. The value of root node is equal to the D/ sample mean ayesian network given in Figure are shown in Figure 3. Each D node and node in Figure 3 is marked with a value that is computed recursively using definition 3.3. The value of nodes and given = j {0,} is equal to g ( = j) and g ( = j) respectively defined in Equation 9. The value of the root node is equal to the D/ sample mean which is equal to the sample mean computed by parts in Equation 0. lgorithm D/ Importance Sampling Input: an ordering O = (,..., n ),a ayesian network and a proposal distribution Q Output: Estimate of Probability of Evidence : Generate samples x,...,x from Q along O. : uild a D/ sample tree S OT for the samples x,...,x along the ordering O. 3: Initialize all labeling functions w,# on each arc from an Ornode n to an nd-node n using Definition 3.. 4: F all leaf nodes i of S OT do 5: IF nd-node v(i)= ELSE v(i)=0 6: For every node n from leaves to the root do 7: Let C(n) denote the child nodes of node n 8: IF n =,x is a D node, then v(n) = n C(n) v(n ) 9: ELSE if n = is a node then 0: Return v(root node) v(n) = n C(n)(#(n,n )w(n,n )v(n )) n C(n) #(n,n. ) We now have the necessary definitions to formally present the D/ importance sampling scheme (see lgorithm ). In Steps -3, the algorithm generates samples from Q and stores them on an D/ sample tree. The algorithm then computes the D/ sample mean over the D/ sample tree recursively from leaves to the root in Steps 4 9. We can show that the value v(n) of a node in the D/ sample tree stores the sample average of the subproblem rooted at n, subject to the current variable instantiation along the path from the root to n. If n is the root, then v(n) is the D/ sample mean which is our D/ estimator of probability of evidence. Finally, we summarize the complexity of computing D/ sample mean in the following theorem: THEEM 3.6. Given samples and n variables (with constant domain size), the time complexity of computing D/ sample mean is O(n) (same as importance sampling) and its space complexity is O(n) (the space complexity of importance sampling is constant). 4 Variance Reduction In this section, we prove that the D/ sample mean may have lower variance than the sample mean computed using importance sampling (Equation 3). THEEM 4. (Variance Reduction). Variance of D/ sample mean is less than or equal to the variance of importance sampling sample mean. Proof. The details of the proof are quite complicated and therefore we only provide the intuitions involved. s noted earlier the guiding principle of D/ sample mean is to take advantage of conditional independence in the graphical model. Let us assume that we have three random variables, and with the following relationship: and are independent of each other given (similar to our example ayesian network). The expression for variance derived here can be used in an induction step (induction is carried on the nodes of the pseudo tree) to prove the theorem. In this case, importance sampling generates samples ((x,y,z ),...,(x,y,z )) along the order,, and estimates the mean as follows: µ IS () = i= xi y i z i () Without loss of generality, let {z,z } be the domain of and let these values be sampled and times respectively. We can rewrite Equation as follows: µ IS () = j= j z j i= xi y i I(z j,x i,y i ) j (3) where I(z j,x i,y i ) is an indicator function which is iff the partial assignment (z j,x i,y i ) is generated in any of the samples and 0 otherwise. D/ sample mean is defined as: µ O () = ( j z )( i= x i I(z j,x i ) ) i= y i I(z j,y i ) j j j j= (4) where I(x j,z i ) (and similarly I(y j,z i )) is an indicator function which equals when one of the samples contains the tuple (x j,z i ) (and similarly (y j,z i ))) and is 0 otherwise. y simple algebraic manipulations, we can prove that the variance of estimator µ IS () is given by: ( ( Var(µ IS ()) = z j Q(z j) µ( z j ) V( z j )+ j= µ( z j ) V( z j )+V( z j )V( z j )) ) / µ / (5)

6 Similarly, the variance of D/ sample mean is given by: Var(µ O ()) = ( ( z j Q(z j) µ( z j ) V( z j ) j= + µ( z j ) V( z j )+ V( z ) j)v( z j )) / µ / (6) j where µ( z j ) and V( z j ) are the conditional mean and variance respectively of given = z j. Similarly, µ( z j ) and V( z j ) are the conditional mean and variance respectively of given = z j. From Equations 5 and 6, if j = for all j, then we can see that the Var(µ O ()) = Var(µ IS ()). However if j >, Var(µ O ()) < Var(µ IS ()). This proves that the variance of D/ sample mean is less than or equal to the variance of conventional sample mean on this special case. s noted earlier using this case in induction over the nodes of a general pseudo-tree completes the proof. 5 Estimation in D/ graphs ext, we describe a more powerful algorithm for estimating mean in D/-space by moving from D/-trees to D/ graphs as presented in [Dechter and Mateescu, 007]. n D/-tree may contain nodes that root identical subtrees. When such unifiable nodes are merged, the tree becomes a graph and its size becomes smaller. Some unifiable nodes can be identified using contexts defined below. Definition 5. (Context). Given a belief network and the corresponding D/ search tree S OT relative to a pseudo-tree T, the context of any D node i,x i S OT, denoted by context( i ), is defined as the set of ancestors of i in T, that are connected to i and descendants of i. The context minimal D/ graph is obtained by merging all the context unifiable D nodes. The size of the largest context is bounded by the tree width w of the pseudo-tree [Dechter and Mateescu, 007]. Therefore, the time and space complexity of a search algorithm traversing the context-minimal D/ graph is O(exp(w )). Example 5.. For illustration, consider the contextminimal graph in Figure (e) of the pseudo-tree from Figure (c). Its size is far smaller that that of the D/ tree from Figure (c) (30 nodes vs. 38 nodes). The contexts of the nodes can be read from the pseudo-tree in Figure as follows: context() = {}, context() = {,}, context(c) = {C,,}, context(d) = {D,C,} and context(e) = {E,,}. The main idea in D/-graph estimation is to store all samples on an D/-graph instead of an D/-tree. Similar to an D/ sample tree, we can define an identical notion of an D/ sample graph. Definition 5.3 ( rc labeled D/ sample graph). Given a complete D/ graph φ G and a set of samples S, an D/ sample graph S OG is obtained by removing all nodes and arcs not in S from φ G. The labels on S OG are set similar to that of an D/ sample tree (see Definition 3.). Example 5.4. The bold edges and nodes in Figure (c) define an D/ sample tree. The bold edges and nodes in Figure (d) define an D/ sample graph corresponding to the same samples that define the D/ sample tree in Figure (c). The algorithm for computing the sample mean on D/ sample graphs is identical to the algorithm for D/tree (Steps 4-0 of lgorithm ). The main reason in moving from trees to graphs is that the variance of the sample mean computed on an D/ sample graph can be even smaller than that computed on an D/ sample tree. More formally, THEEM 5.5. Let V(µ OG ), V(µ OT ) and V(µ IS ) be the variance of D/ sample mean on an D/ sample graph, variance of D/ sample mean on an D/ sample tree and variance of sample mean of importance sampling respectively. Then given the same set of input samples: V(µ OG ) V(µ OT ) V(µ IS ) We omit the proof due to lack of space. THEEM 5.6 (Complexity of computing D/ graph sample mean). Given a graphical model with n variables, a psuedo-tree with treewidth w and samples, the time complexity of D/ graph sampling is O(nw ) while its space complexity is O(n). 6 Experimental Evaluation 6. Competing lgorithms The performance of importance sampling based algorithms is highly dependent on the proposal distribution [Cheng and Druzdzel, 000]. It was shown that computing the proposal distribution from the output of a Generalized elief Propagation scheme of Iterative Join Graph Propagation (IJGP) yields better empirical performance than other available choices [Gogate and Dechter, 005]. Therefore, we use the output of IJGP to compute the proposal distribution Q. The complexity of IJGP is time and space exponential in its i-bound, a parameter that bounds cluster sizes. We use a i-bound of 5 in all our experiments. We experimented with three sampling algorithms for benchmarks which do not have determinism: (pure) IJGP-sampling, D/-tree IJGP-sampling and (c) D/-graph IJGP-sampling. ote that the underlying scheme for generating the samples is identical in all the

7 methods. What changes is the method of accumulating the samples and deriving the estimates. On benchmarks which have zero probabilities or determinism, we use the Sample- Search scheme introduced by [Gogate and Dechter, 007] to overcome the rejection problem. We experiment with the following versions of on deterministic networks: pure, D/-tree Sample- Search and (c) D/-graph. 6.. Results We experimented with three sets of benchmark belief networks Random networks, Linkage networks and (c) Grid networks. ote that only linkage and grid networks have zero probabilities on which we use.the exact P(e) for most instances is available from the UI 006 competition web-site. Our results are presented in Figures 4-6. Each Figure shows approximate probability of evidence as a function of time. The bold line in each Figure indicates the exact probability of evidence. The reader can visualize the error from the distance between the approximate curves and the exact line. For lack of space, we show only part of our results. Each Figure shows the number of variables n, the maximum-domain size d and the number of evidence nodes E for the respective benchmark. Random etworks From Figures 4 and 4, we see that D/-graph sampling is better than D/-tree sampling which in turn is better than pure IJGP-sampling. However there is not much difference in the error because the proposal distribution seems to be a very good approximation of the posterior. Grid etworks ll Grid instances have 444 binary nodes and between 5-0 evidence nodes. From Figures 5 and 5, we can see that D/-graph and D/-tree are substantially better than pure. Linkage etworks The linkage instances are generated by converting a Pedigree to a ayesian network [Fishelson and Geiger, 003]. These networks have between nodes with a maximum domain size of 36. ote that it is hard to compute exact probability of evidence in these networks [Fishelson and Geiger, 003]. We observe from Figures 6, (c) and (d) that D/graph is substantially more accurate than D/-tree which in turn is substantially more accurate than pure. otice the logscale in Figures 6 -(d) which means that there is an order of magnitude difference between the errors. Our results suggest that D/-graph and tree estimators yield far better performance than conventional estimators especially on problems in which the proposal distribution is a bad approximation of the posterior distribution. P(e) P(e) P(e) P(e) Random -98 : n=57,d=50, E =6.35e-.3e-.5e-.e-.5e-.e-.05e- e-.95e-.9e IJGP-Sampling D/-Graph-IJGP-Sampling D/-Tree-IJGP-Sampling Random -0 : n=76,d=50, E =5.e-6.5e-6.e-6.05e-6 e-6.95e-6.9e-6.85e-6.8e-6.75e-6.7e e-.5e- e-.5e- e- IJGP-Sampling D/-Graph-IJGP-Sampling D/-Tree-IJGP-Sampling Figure 4: Random etworks Grids -30: n=56,d=, E =0 5e D/-Graph- D/-Tree- Grids -40 : n=444,d=, E =50 7.5e-4 7e-4 6.5e-4 6e-4 5.5e-4 5e-4 4.5e-4 4e-4 3.5e-4 3e D/-Graph- D/-Tree- Figure 5: Grid etworks 7 Related Work and Summary The work presented in this paper is related to the work by [Hernndez and Moral, 995, Kjærulff, 995, Dawid et al., 994] who perform sampling based inference on a junction tree. The main idea in these papers is to perform message passing on a junction tree by substituting messages which are too hard to compute exactly by their sampling-based approx-

8 Log P(e) Log P(e) linkage-instance n=777,d=36, E = D/-Graph- D/-Tree- linkage-instance n=80,d=36, E = D/-Graph- D/-Tree- exact computations to achieve variance reduction. In fact, variance reduction due to Rao-lackwellisation is orthogonal to the variance reduction achieved by D/ estimation and therefore the two could be combined to achieve more variance reduction. lso, unlike our work which focuses on probability of evidence, the focus of these aforementioned papers was on belief updating. To summarize, the paper introduces a new sampling based estimation technique called D/ importance sampling. The main idea of our new scheme is to derive statistics on the generated samples by using an D/ tree or graph that takes advantage of the independencies present in the graphical model. We proved that the sample mean computed on an D/ tree or graph may have smaller variance than the sample mean computed using the conventional approach. Our experimental evaluation is preliminary but quite promising showing that on most instances D/ sample mean has lower error than importance sampling and sometimes by significant margins. cknowledgements Log P(e) linkage-instance n=00,d=36, E = D/-Graph- D/-Tree- (c) This work was supported in part by the SF under award numbers IIS , IIS and IIS-0738 and the IH grant R0-HG References [idyuk and Dechter, 003] idyuk,. and Dechter, R. (003). n empirical study of w-cutset sampling for bayesian networks. In Proceedings of the 9th nnual Conference on Uncertainty in rtificial Intelligence (UI-03). [Cheng and Druzdzel, 000] Cheng, J. and Druzdzel, M. J. (000). is-bn: n adaptive importance sampling algorithm for evidential reasoning in large bayesian networks. J. rtif. Intell. Res. (JIR), 3: Log P(e) linkage-instance n=749,d=36, E = D/-Graph- D/-Tree- (d) Figure 6: Linkage ayesian etworks imations. [Kjærulff, 995, Dawid et al., 994] use Gibbs sampling while [Hernndez and Moral, 995] use importance sampling to approximate the messages. Similar to recent work on Rao-lackwellised sampling such as [idyuk and Dechter, 003, Paskin, 004, Gogate and Dechter, 005], variance reduction is achieved in these junction tree based sampling schemes because of some exact computations; as dictated by the Rao-lackwell theorem. D/ estimation, however, does not require [Dawid et al., 994] Dawid,. P., Kjaerulff, U., and Lauritzen, S. L. (994). Hybrid propagation in junction trees. In IPMU 94, pages [Dechter and Mateescu, 007] Dechter, R. and Mateescu, R. (007). D/ search spaces for graphical models. rtificial Intelligence, 7(-3): [Fishelson and Geiger, 003] Fishelson, M. and Geiger, D. (003). Optimizing exact genetic linkage computations. In RECOM 003. [Geweke, 989] Geweke, J. (989). ayesian inference in econometric models using monte carlo integration. Econometrica, 57(6): [Gogate and Dechter, 005] Gogate, V. and Dechter, R. (005). pproximate inference algorithms for hybrid bayesian networks with discrete constraints. UI [Gogate and Dechter, 007] Gogate, V. and Dechter, R. (007). Samplesearch: scheme that searches for consistent samples. ISTTS 007. [Hernndez and Moral, 995] Hernndez, L. D. and Moral, S. (995). Mixing exact and importance sampling propagation algorithms in dependence graphs. International Journal of pproximate Reasoning, (8): [Kjærulff, 995] Kjærulff, U. (995). Hugs: Combining exact inference and gibbs sampling in junction trees. In UI, pages [Paskin, 004] Paskin, M.. (004). Sample propagation. In Thrun, S., Saul, L., and Schölkopf,., editors, dvances in eural Information Processing Systems 6. MIT Press, Cambridge, M. [Rubinstein, 98] Rubinstein, R.. (98). Simulation and the Monte Carlo Method. John Wiley & Sons, Inc., ew ork,, US. [uan and Druzdzel, 006] uan, C. and Druzdzel, M. J. (006). Importance sampling algorithms for ayesian networks: Principles and performance. Mathematical and Computer Modelling.

Sampling Algorithms for Probabilistic Graphical models

Sampling Algorithms for Probabilistic Graphical models Vibhav Gogate University of Washington References: Chapter 12 of Probabilistic Graphical models: Principles and Techniques by Daphne Koller and Nir