Integrating Correlated Bayesian Networks Using Maximum Entropy

Similar documents
Belief Update in CLG Bayesian Networks With Lazy Propagation

Published in: Tenth Tbilisi Symposium on Language, Logic and Computation: Gudauri, Georgia, September 2013

13: Variational inference II

Expectation Maximization

Variational Inference and Learning. Sargur N. Srihari

Some New Results on Information Properties of Mixture Distributions

p L yi z n m x N n xi

Quantitative Biology II Lecture 4: Variational Methods

Using Graphs to Describe Model Structure. Sargur N. Srihari

STA 4273H: Statistical Machine Learning

Speech Recognition Lecture 7: Maximum Entropy Models. Mehryar Mohri Courant Institute and Google Research

Instructor: Dr. Volkan Cevher. 1. Background

Introduction to Probability and Statistics (Continued)

12 : Variational Inference I

Variational Inference (11/04/13)

Introduction to Probabilistic Graphical Models

bound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o

Applying Bayesian networks in the game of Minesweeper

Posterior Regularization

Proceedings of the 2016 Winter Simulation Conference T. M. K. Roeder, P. I. Frazier, R. Szechtman, E. Zhou, T. Huschka, and S. E. Chick, eds.

Estimation of Dynamic Regression Models

The Expectation Maximization or EM algorithm

Introduction to Machine Learning. Lecture 2

Information geometry for bivariate distribution control

A New Approach for Hybrid Bayesian Networks Using Full Densities

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

Chris Bishop s PRML Ch. 8: Graphical Models

Basic Sampling Methods

A NEW CLASS OF SKEW-NORMAL DISTRIBUTIONS

Bayesian network modeling. 1

Announcements. CS 188: Artificial Intelligence Spring Probability recap. Outline. Bayes Nets: Big Picture. Graphical Model Notation

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline.

An Introduction to Bayesian Machine Learning

CUADERNOS DE TRABAJO ESCUELA UNIVERSITARIA DE ESTADÍSTICA

An Empirical Study of Probability Elicitation under Noisy-OR Assumption

Information Theory Primer:

Lecture 13 : Variational Inference: Mean Field Approximation

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )

Variational Inference. Sargur Srihari

Latent Variable Models and EM algorithm

Weighting Expert Opinions in Group Decision Making for the Influential Effects between Variables in a Bayesian Network Model

Machine Learning Lecture Notes

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Scalable robust hypothesis tests using graphical models

Tractable Inference in Hybrid Bayesian Networks with Deterministic Conditionals using Re-approximations

Comparing Bayesian Networks and Structure Learning Algorithms

Fractional Belief Propagation

The Variational Gaussian Approximation Revisited

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Representing Independence Models with Elementary Triplets

17 Variational Inference

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure

Gaussian Estimation under Attack Uncertainty

11. Learning graphical models

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Graphical models: parameter learning

G8325: Variational Bayes

Bayesian networks approximation

10708 Graphical Models: Homework 2

Beyond Uniform Priors in Bayesian Network Structure Learning

An Empirical-Bayes Score for Discrete Bayesian Networks

Introduction to Machine Learning

Probabilistic Models

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

1 Priors, Contd. ISyE8843A, Brani Vidakovic Handout Nuisance Parameters

SOLVING STOCHASTIC PERT NETWORKS EXACTLY USING HYBRID BAYESIAN NETWORKS

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

CSE 473: Artificial Intelligence Autumn 2011

CS 540: Machine Learning Lecture 2: Review of Probability & Statistics

Directed Graphical Models or Bayesian Networks

Some Practical Issues in Inference in Hybrid Bayesian Networks with Deterministic Conditionals

Variational Principal Components

Fisher Information in Gaussian Graphical Models

Econometrics I, Estimation

Probabilistic and Bayesian Machine Learning

Chapter 16. Structured Probabilistic Models for Deep Learning

On the errors introduced by the naive Bayes independence assumption

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Structured Variational Inference

Probability and Information Theory. Sargur N. Srihari

Machine Learning Lecture Notes

Notes on Random Variables, Expectations, Probability Densities, and Martingales

Vines in Overview Invited Paper Third Brazilian Conference on Statistical Modelling in Insurance and Finance Maresias, March 25-30, 2007

On the Identification of a Class of Linear Models

arxiv: v1 [math.st] 2 Nov 2017

Variational Learning : From exponential families to multilinear systems

Learning in Bayesian Networks

14 : Mean Field Assumption

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Outline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car

Conditional Random Field

Online Estimation of Discrete Densities using Classifier Chains

Expectation Propagation Algorithm

g(x,y) = c. For instance (see Figure 1 on the right), consider the optimization problem maximize subject to

Decision-Theoretic Specification of Credal Networks: A Unified Language for Uncertain Modeling with Sets of Bayesian Networks

Inference in Hybrid Bayesian Networks Using Mixtures of Gaussians

An Empirical-Bayes Score for Discrete Bayesian Networks

THE VINE COPULA METHOD FOR REPRESENTING HIGH DIMENSIONAL DEPENDENT DISTRIBUTIONS: APPLICATION TO CONTINUOUS BELIEF NETS

Transcription:

Applied Mathematical Sciences, Vol. 5, 2011, no. 48, 2361-2371 Integrating Correlated Bayesian Networks Using Maximum Entropy Kenneth D. Jarman Pacific Northwest National Laboratory PO Box 999, MSIN K7-90 Richland, WA 99352 USA Paul D. Whitney Pacific Northwest National Laboratory PO Box 999, MSIN K7-90 Richland, WA 99352 USA Abstract We consider the problem of generating a joint distribution for a pair of Bayesian networks that preserves the multivariate marginal distribution of each network and satisfies prescribed correlation between pairs of nodes taken from both networks. We derive the maximum entropy distribution for any pair of multivariate random vectors and prescribed correlations and demonstrate numerical results for an example integration of Bayesian networks. Mathematics Subject Classification: 60E05; 62E10; 62B10; 62H10; 94A17 Keywords: Bayesian Networks, Maximum Entropy, model integration 1 Introduction Bayesian networks provide a compact representation of relationships between variables in terms of conditional independence of events and associated conditional probabilities [3]. They were originally used to model causal relationships [9], and their use has extended to a wide range of applications. Graphically, a Bayesian network consists of a set of nodes, representing individual random variables, and directed arcs from parent nodes to child nodes indicating conditional dependence. Cycles are not allowed, so that the resulting structure

2362 K. D. Jarman and P. D. Whitney is an acyclic directed graph. The complete joint distribution of the set of nodes is defined by a conditional probability distribution for each node relative to its parent nodes. Bayes theorem can be used to make inferences about the underlying (unobserved) state of the modeled system from observations. This structure provides a means to decompose and rapidly compute conditional and marginal probability distributions, generating likelihoods of specific events or conditions. Often only partial information is available on the dependence structure and conditional probability distributions of a Bayesian network. In this case some method for completing the model is needed. Schramm and Fronhöfer [11] for example consider the problem of completing a Bayesian network given an incomplete set of conditional probabilities, such that dependence structure is preserved. Additionally, distinct, complete Bayesian networks may be developed to model different aspects of a single process or system, and it may be desired to link the models together to create an integrated model. If the models are consistent (that is, they do not have conflicting assumptions in their structure or parameters), then they may be combined by determining dependence between pairs of nodes from each network (a sort of cross-network dependence) and estimating the conditional probabilities that are associated with that dependence. In general, model building in this manner can be time-consuming due to the challenges in obtaining the needed conditional probability parameters for the resulting combined model. Additionally if the component models were developed by separate teams there may not be a straightforward way to arrive at a combined model that is both a Bayesian network and is consistent with the component models. We consider integration of Bayesian networks into a single, complete distributional model when only linear correlation values for pairs of cross-network nodes (or more generally, moment conditions) are provided, without requiring that the result be representable as a Bayesian network in itself. We impose the constraints that the resulting model match the (joint) marginal distributions given by the original networks as well as the prescribed correlations. Figure (1) illustrates this concept of integration for a simple example. Figure 1: Bayes Net diagrams for example integration.

Integrating Bayes nets using maximum entropy 2363 General problems of constructing joint distributions with known marginals and prescribed moments has been studied for some time (see for example [1], [5], and reviews in [7] and [10]). Maximum entropy and related methods provide natural approaches to such problems that have broad interest and applicability. Maximum entropy methods aim to derive the joint distribution that satisfies supposed a priori knowledge about the distribution while expressing the greatest possible ignorance or uncertainty about what is not known. Closely related, minimum cross-entropy methods aim to produce the distribution that is closest to a reference distribution while satisfying the imposed knowledge about the distribution. Most applications of these methods impose either marginal (almost always univariate) distributions or moments as constraints, but rarely combine both constraints as desired here. Csiszár [1] produced a set of foundational theorems for very general application of minimum cross-entropy for this purpose, with maximum entropy as a special case, for either marginal or moment constraints. He argues that the approach extends to multivariate marginal constraints. The author also generalized an iterative method for calculating the minimum cross-entropy solution. Miller and Liu [7] summarized existing theory and, following Csiszár, extended the applicability of minimum cross-entropy to include multivariate moment constraints, using the product of univariate marginals as the reference distribution. Pasha and Mansoury [8] considered the maximum entropy approach using simultaneous constraints for univariate marginals and moments for multivariate distributions. They propose directly solving the set of nonlinear equations that obtain by imposing the constraints on the form of maximum entropy distribution, to calculate the distribution parameters. To solve the problem of combining distinct, complete Bayesian networks given some knowledge of the dependence between each network, we provide the extension of the maximum entropy method to constraints in the form of multivariate marginal distributions and moments. The resulting distribution for Bayesian networks with discrete conditional distributions is implicitly defined by a set of nonlinear equations. We demonstrate a method of numerical solution of those equations for a simple example to produce the joint distribution, and conclude with a discussion of the applicability and computational challenges of implementing this approach to model integration. 2 Maximum entropy solutions Consider random vectors X = (X 1, X 2,..., X m ) and Y = (Y 1, Y 2,..., Y n ) with probability density (or mass) functions g (x) and h (y), respectively. We will ultimately apply the maximum entropy solution to the problem of integrating Bayesian networks, where elements of X and Y correspond to nodes in each network. First, a more general case is studied. We wish to find a joint prob-

2364 K. D. Jarman and P. D. Whitney ability density (mass) function f (z) for Z = (X 1, X 2,..., X m, Y 1, Y 2,..., Y n ) defined on Ω R m+n subject to the marginal constraints and a vector of constraints on moments f X (x) = g (x), f Y (y) = h (y), (1) E f γ (z) = γ, (2) where each element of the vector γ is real-valued and has finite expectation. Let F denote the class of all (m + n)-variate probability density (mass) functions f (z) taking on positive values on Ω and satisfying the above constraints: { } F (g, h, γ) = f (z) f (z) dy = g (x), f (z) dx = h (y), E f γ (z) = γ, Ω x Ω y where Ω x = {y R n z Ω} and Ω y = {x R m z Ω}, and dx and dy are shorthand for dx 1 dx 2 dx m and dy 1 dy 2 dy n, respectively. The entropy of the distribution f (z) is defined by H (f) = E [log f (Z)] = [f (z) log f (z)] dz. A maximum entropy solution is a member of F that maximizes H (f). The form of the maximum entropy distribution is easily derived, or assumed based on similar maximum entropy problems. Proof can follow directly from arguments in Csiszár [1] or Ebrahimi et al. [2]; a proof is given here for completeness. Note that the maximum entropy solution may not exist, conditions for which are discussed by Csiszár and others. Theorem 2.1 If F = Ø, then the maximum entropy solution is uniquely given by Ω f (z) = f X (x) f Y (y) exp [ν γ (z)] (3) for some functions fx (x) and f Y (y) and parameter vector ν. Proof: The solution is found by calculus of variations. The Lagrangian is ( ) L (f) = f log f dz + c f dz 1 Ω Ω [ ] + λ (x) f (x, y) dy g (x) dx Ω y Ω x [ ] + µ (y) f (x, y) dx h (y) dy Ω x Ω y + ν [γ (z) γ] f dz, Ω

Integrating Bayes nets using maximum entropy 2365 where λ (x) and µ (y) are Lagrange multipliers corresponding to the marginal constraints, and ν is a vector of Lagrange multipliers corresponding to the moment constraints. The extremal function f of L is a unique maximum due to the concavity of L in f, a consequence of the concavity of entropy in f. Setting a variational derivative with respect to f equal to zero at f gives log f + c 1 + λ (x) + µ (y) + ν γ (z) = 0 (equivalently, the Euler-Lagrange equations apply). Solving for f gives f (z) = exp [λ 0 + λ (x) + µ (y) + ν γ (z)] = f X (x) f Y (y) exp [ν γ (z)], for some functions f X (x) and f Y (y), and parameter vector ν, where λ 0 = c 1. For any function f in F and f as in Equation (3), the relative entropy (or Kullback-Leibler divergence) of f with respect to f is given by D (f f ) = E f log (f/f ). The relative entropy is non-negative, and is zero if and only if f = f, by Gibbs inequality. The entropy of f may be written as so that H (f) = D (f f ) E f log f H (f) E f log f (Z) = E f log (exp [λ 0 + λ (X) + µ (Y) + ν γ (z)]) = λ 0 E f λ (X) E f µ (Y) ν γ = λ 0 E f λ (X) E f µ (Y) ν γ = H (f ). Equality holds if and only if f = f almost everywhere on Ω. The maximum entropy solution can be explicitly obtained in a direct manner by solving the nonlinear system of equations that is obtained by substituting Equation (3) into the constraint equations (1) and (2). While we have focused on the combination of two multivariate marginal densities along with moment constraints, the theory easily extends to any number of multivariate marginal densities. In particular, letting X i R d i be random vectors with probability density (or mass) functions g i (x i ) for i = 1, 2,..., n where d i Z + and letting Z = (X 1, X 2,..., X n ), the maximum entropy solution f (z) that satisfies marginal constraints equivalent to Equation (1) and moment constraints equivalent to Equation (2) is given by n f (z) = fx i (x i ) exp [ν γ (z)] i=1 for some functions f X i (x i ), i = 1, 2,..., n and parameter vector ν.

2366 K. D. Jarman and P. D. Whitney 3 Application to Bayesian networks We use the maximum entropy solution to demonstrate the calculation of a complete distribution for a pair of simple Bayesian networks with discrete state space, given their multivariate marginal distributions and a subset of the (x, y) correlations: EX i Y j = ρ ij σ i τ j + µ i ν j for (i, j) I, for prescribed values of ρ ij, where σi 2 = Var X i, τj 2 = Var Y j, µ i = EX i, ν j = EY j, and I is the set of paired indices (i, j) for which correlation is imposed. The nonlinear system that obtains by applying the constraint equations (1) and (2) to the formal maximum entropy distribution in Equation (3) is fx (x) fy (y) exp λ ij x i y j dy = g (x), Ω x (i,j) I fy (y) fx (x) exp λ ij x i y j dx = h (y), (4) Ω y (i,j) I x k y l fx (x) fy (y) exp λ ij x i y j dxdy = ρ kl σ k τ l + µ k ν l, Ω (i,j) I for (k, l) I. Consider the example shown earlier in Figure 1, with conditional probabilities in Table 1. Let the probability of success of X 1 be 0.6, and likewise for Y 1. We impose the single correlation constraint ρ (X 3, Y 2 ) = ρ 32 = 0.036. The system of equations was implemented in Matlab [6], and a complete numerical solution is given in Table 2. The required positive correlation between X 3 and Y 2 is reflected in the table by comparing the results to the joint probabilities in Table 3 obtained when X and Y are assumed to be independent. 4 Discussion The maximum entropy solution presented here for deriving a joint probability distribution subject to both multivariate marginal distribution constraints and moment constraints may have general applicability beyond the present goal of integrating Bayesian network models. The primary challenge to its application is in the computational complexity that arises from solving a nonlinear

Integrating Bayes nets using maximum entropy 2367 Table 1: Conditional probabilities for example integration X1 X2 P (X2 X1) X1 X3 P (X3 X1) Y1 Y2 P (Y2 Y1) Y2 Y3 P (Y3 Y2) 0 0 0.4 0 0 0.4 0 0 0.4 0 0 0.4 1 0 0.3 1 0 0.3 1 0 0.3 1 0 0.3 0 1 0.6 0 1 0.6 0 1 0.6 0 1 0.6 1 1 0.7 1 1 0.7 1 1 0.7 1 1 0.7

2368 K. D. Jarman and P. D. Whitney Table 2: Joint probability table enforcing correlation between X3 and Y2 Y1 0 1 0 1 0 1 0 1 Y2 0 0 1 1 0 0 1 1 Y3 0 0 0 0 1 1 1 1 X1 X2 X3 0 0 0 0.0054 0.0061 0.0039 0.0068 0.0080 0.0090 0.0090 0.0158 1 0 0 0.0045 0.0051 0.0033 0.0057 0.0068 0.0076 0.0076 0.0133 0 1 0 0.0081 0.0091 0.0058 0.0102 0.0121 0.0136 0.0135 0.0236 1 1 0 0.0106 0.0119 0.0077 0.0134 0.0158 0.0178 0.0178 0.0310 0 0 1 0.0051 0.0058 0.0075 0.0131 0.0077 0.0087 0.0175 0.0306 1 0 1 0.0068 0.0076 0.0098 0.0172 0.0102 0.0114 0.0229 0.0401 0 1 1 0.0077 0.0087 0.0112 0.0196 0.0116 0.0131 0.0262 0.0459 1 1 1 0.0158 0.0177 0.0229 0.0401 0.0237 0.0267 0.0535 0.0937

Integrating Bayes nets using maximum entropy 2369 Table 3: Joint probability table assuming independent X, Y Y1 0 1 0 1 0 1 0 1 Y2 0 0 1 1 0 0 1 1 Y3 0 0 0 0 1 1 1 1 X1 X2 X3 0 0 0 0.0041 0.0046 0.0046 0.0081 0.0061 0.0069 0.0108 0.0188 1 0 0 0.0035 0.0039 0.0039 0.0068 0.0052 0.0058 0.0091 0.0159 0 1 0 0.0061 0.0069 0.0069 0.0121 0.0092 0.0104 0.0161 0.0282 1 1 0 0.0081 0.0091 0.0091 0.0159 0.0121 0.0136 0.0212 0.0370 0 0 1 0.0061 0.0069 0.0069 0.0121 0.0092 0.0104 0.0161 0.0282 1 0 1 0.0081 0.0091 0.0091 0.0159 0.0121 0.0136 0.0212 0.0370 0 1 1 0.0092 0.0104 0.0104 0.0181 0.0138 0.0156 0.0242 0.0423 1 1 1 0.0188 0.0212 0.0212 0.0370 0.0282 0.0318 0.0494 0.0864

2370 K. D. Jarman and P. D. Whitney system with a number of parameters increasing rapidly with increasing event space dimension. Cost reduction should be studied through approximations that reduce the dimensions of the parameter space as well as optimization techniques that take maximum advantage of the structure of this particular problem, including iterative, indirect methods [1], [4]. We considered a special case in which the marginals and moments are required to be exactly matched by the combined distribution, thus assuming that these constraints are all consistent. However, Bayesian networks, and more generally any set of distributions to be integrated, can have conflicting model assumptions such that not all the contraints can be exactly met. Miller and Liu [7] point to a solution when the marginals may be inconsistent with correlation, by formulating the problem in terms of minimizing the cross-entropy where the reference distribution is the product of marginals. While they considered only univariate marginals, the concept is easily extended to incorporate multivariate marginals. Yet this still requires that moments be exactly met, and a generalization that allows for approximate matching of the moments as well as marginals is needed to allow balance between the competing objectives. ACKNOWLEDGEMENTS This project was supported by Laboratory Directed Research and Development at Pacific Northwest National Laboratory (PNNL). PNNL is operated by Battelle Memorial Institute for the U.S. DOE under Contract DE-AC06-76RLO 1830. The work described in this paper was developed within the context of the Technosocial Predictive Analytics Initiative (http://predictiveanalytics.pnl.gov) at the Pacific Northwest National Laboratory. References [1] I. Csiszár, I-Divergence geometry of probability distributions and minimization problems, Annals of Probability, 3(1) (1975), 146 158. [2] N. Ebrahimi, E.S. Soofi, and R. Soyer, Multivariate maximum entropy identification, transformation, and dependence, Journal of Multivariate Analysis, 99(6) (2008), 1217 1231. [3] F.V. Jensen and T.D. Nielsen, Bayesian Networks and Decision Graphs, Springer, 2nd edition, 2007. [4] R. Jirousek and S. Preucil, On the effective implementation of the iterative proportional fitting procedure, Computational Statistics & Data Analysis, 19(2) (1995), 177 189.

Integrating Bayes nets using maximum entropy 2371 [5] S. Kullback, Probability densities with given marginals, Annals of Mathematical Statistics, 39(4) (1968), 1236 1243. [6] MATLAB, version 7.8.0, Natick, MA, The Mathworks, Inc. (2009). [7] D.J. Miller and W.H. Liu, On the recovery of joint distributions from limited information, Journal of Econometrics, 107(1-2) (2002), 259 274. [8] E. Pasha and S. Mansoury, Determination of maximum entropy multivariate probability distributions under some constraints, Applied Mathematical Sciences, 2(57) (2008), 2843 2849. [9] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann Publishers, Inc., 2nd edition, 1988. [10] J.M. Sarabia and E. Gómez-Déniz, Construction of multivariate distributions: A review of some recent results, SORT-Statistics and Operations Research Transactions, 32(1) (2008), 3 36. [11] M. Schramm and B. Fronhöfer, Completing incomplete Bayesian networks, In G. Kern-Isberner and W. Rödder and F. Kulmann, editors, Conditionals, Information, and Inference (2005), 200 218, International Workshop on Conditionals, Information, and Inference, Hagen, Germany, May 13-15, 2002. Received: September, 2010