On the Relationship between Sum-Product Networks and Bayesian Networks

On the Relationship between Sum-Product Networks and Bayesian Networks International Conference on Machine Learning, 2015 Han Zhao Mazen Melibari Pascal Poupart University of Waterloo, Waterloo, ON, Canada 06 November 2015 Presented by: Kyle Ulrich

Introduction Graphical models represent distributions compactly as normalized products of factors: P(X = x) = 1 φ k (x Z {k} ) where x X is d-dimensional φ k is a potential function of a subset of variables Z is the partition function The partition function of most useful models is represented by an intractable integral/sum k

Introduction The partition function is represented using a polynomial number of sums and products, Z = φ k (x {k} ) x X In many useful models, Z can be represented compactly using a deep architecture Sum-product networks (Poon and Domingos, 2011) use a deep architecture with tractable inference: k Sum nodes correspond to mixtures over subsets of variables Product nodes correspond to features or mixture components

Network Polynomial Definition (Network Polynomial) Let f ( ) 0 be an unnormalized probability distribution over a Boolean random vector X 1:N. The network polynomial of f ( ) is a multilinear function N f (x) Example x n=1 The network polynomial for the Bayesian network X 1 X 2 is I xn Pr(x 1 )Pr(x 2 x 1 )I x1 I x2 + Pr(x 1 )Pr( x 2 x 1 )I x1 I x2 + Pr( x 1 )Pr(x 2 x 1 )I x1 I x2 + Pr( x 1 )Pr( x 2 x 1 )I x1 I x2

Sum-Product Network Definition (Sum-Product Network (Poon & Domingos, 2011)) A Sum-Product Network (SPN) S over Boolean variables X 1:N is a rooted DAG whose leaves are indicators I x1,..., I xn and I x1,..., I xn and whose internal nodes are sums and products. Value of product node v i : product of the values of children Value of sum node v i : v j Ch(v i ) w ijval(v j ) The root node is represented by the network polynomial S(x) (Gens et al., 2012)

Example SPN Identical uniform distribution over states of five variables containing an even number of 1 s Either represented by a shallow SPN with an exponential size or a compact deep SPN (Poon and Domingos, 2011)

Validity An SPN is valid if it defines an (unnormalized) probability distribution (generative model). Sufficient conditions: complete and consistent Definition (Complete) An SPN is complete iff each sum node has children with the same scope. Definition (Consistent) An SPN is consistent iff no variable appears negated in one child of a product node and non-negated in another. Definition (Decomposable) An SPN is decomposable iff for every product node v, scope(v i ) scope(v j ) = where v i, v j Ch(v), i j.

Computations in SPNs 1 Partition function: set all indicators to 1, and evaluate network polynomial, Z S = x X S(x) = S(1,..., 1) 2 State probability: normalize network polynomial at state x (either x i = 1 and x i = 0 or x i = 0 and x i = 1 for each x i ), P(x) = S(x)/Z S 3 Marginal probability: for all unobserved x i set both x i = 1 and x i = 1 to define evidence e P(e) = S(e)/Z S

Extension to Continuous Variables Instead of having sum nodes over leaves of indicator children, we can consider multinomial variables with an infinite number of values The weighted sum becomes the integral p(x)dx where p(x) is the p.d.f. of X The value of integral node n is either p n (x) or 1 Computation of evidence proceeds as usual

Learning in SPNs 1 First, evaluate all S i (x) in an upward pass 2 On a downward pass, Compute likelihood gradient through backpropagation: S(x) S i (x) = S(x) k Pa i w ki S k (x) k Pa i S(x) S k (x) Compute gradient on weights: S(x) w ij l Ch i (k) S l(x) = S(x) S i (x) S j(x) Product node Sum node 3 Compute marginals: For a latent variable representing sum node n k with child n i : P(Y k = i e) w ki S(e) S k (e) For an indicator I xi, P(X i = 1 e) S(e) S i (e)

Gradient Diffusion Unfortunately, deep SPNs suffer from gradient diffusion, i.e., the gradient becomes uniform The most probable explanation (MPE) may be used to define hard EM: 1 In the upward pass, replace all weighted sums with the maximum weighted value 2 On downward pass, choose only the highest valued child nodes 3 Increment a count for each chosen child node (M-step) 4 Re-normalize the counts to obtain weights (E-step)

Experiment: Face Completion Restoration of half-occluded face Original SPN DBM DBN PCA Nearest neighbor

Contributions of Paper This paper discusses the tractability of three topics: 1 Any valid SPN may be represented as a normal SPN 2 Any normal SPN may be converted to a Bayesian Network represented by Algebraic Decision Diagrams 3 The generated BN above can recover the original SPN probability distribution

Normal SPN Definition (Normal SPN) An SPN is said to be normal if 1 It is complete and decomposable. 2 For each sum node in the SPN, the weights of the edges emanating from the sum node are nonnegative and sum to 1. 3 Every terminal node in the SPN is a univariate distribution over a Boolean variable and the size of the scope of a sum node is at least 2. Theorem (Convert SPN to Normal SPN) For any complete and consistent SPN S, there exists a normal SPN S such that Pr S ( ) = Pr S ( ) and S = O( S 2 ).

Normal SPN: Consistent to Decomposable The authors provide an algorithm/proof that any valid SPN may be converted to a decomposable SPN Definition (Decomposable) An SPN is decomposable iff for every product node v, scope(v i ) scope(v j ) = where v i, v j Ch(v), i j.

Normal SPN: Normalize Weights The weights associated with sum nodes may then be normalized for a complete and decomposable SPN

SPN to BN Theorem (SPN to BN) There exists an algorithm that converts any complete and decomposable SPN S over Boolean variables X 1:N into a BN B with CPDs represented by ADDs in time O(N S ). Furthermore, S and B represent the same distribution and B = O(N S ).

SPN to BN: Structure of BN 1 Create an observable variable X in B for each terminal node 2 Create a hidden variable H v in place of each sum node v 3 Build directed edges from hidden variables to observable variables in scope of sub-tree

SPN to BN: Algebraic Decision Diagrams Algebraic Decision Diagrams (ADD) are used to represent the full conditional probability distribution Definition (Algebraic Decision Diagram) An ADD is a DAG representing the function f : X 1 X N R where X n is the domain of variable X n and X n is the number of values X n takes.

BN to SPN Theorem (BN to SPN) Given the BN B with ADD representation of CPDs generated from a complete and decomposable SPN S over Boolean variables X 1:N, the original SPN S can be recovered by applying the Variable Elimination algorithm to B in O(N S ). The authors prove the generated BN can recover an SPN with an identical distribution to the original SPN

BN to SPN: Variable Elimination The authors prove the generated BN can recover an SPN with an identical probability distribution to the original SPN Use variable elimination (VE) to 1 Multiply two factors 2 Sum out hidden variables