On the Relationship between Sum-Product Networks and Bayesian Networks

Similar documents
Sum-Product Networks: A New Deep Architecture

On the Relationship between Sum-Product Networks and Bayesian Networks

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Discriminative Learning of Sum-Product Networks. Robert Gens Pedro Domingos

Collapsed Variational Inference for Sum-Product Networks

Linear Time Computation of Moments in Sum-Product Networks

The Origin of Deep Learning. Lili Mou Jan, 2015

Chris Bishop s PRML Ch. 8: Graphical Models

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Online Algorithms for Sum-Product

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

APPROXIMATION COMPLEXITY OF MAP INFERENCE IN SUM-PRODUCT NETWORKS

Probabilistic Graphical Models

{ p if x = 1 1 p if x = 0

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

STA 4273H: Statistical Machine Learning

Bayesian Networks. Motivation

Online and Distributed Bayesian Moment Matching for Parameter Learning in Sum-Product Networks

STA 4273H: Statistical Machine Learning

Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller

Bayesian Networks: Representation, Variable Elimination

2 : Directed GMs: Bayesian Networks

Need for Sampling in Machine Learning. Sargur Srihari

Probabilistic Graphical Models (I)

Chapter 16. Structured Probabilistic Models for Deep Learning

Pattern Recognition and Machine Learning

Graphical Models and Kernel Methods

Approximation Complexity of Maximum A Posteriori Inference in Sum- Product Networks

An Introduction to Bayesian Machine Learning

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Machine Learning Lecture 14

Computational Complexity of Inference

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Bayes Nets. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Stephen Scott.

Statistical Approaches to Learning and Discovery

Bayesian Networks. Exact Inference by Variable Elimination. Emma Rollon and Javier Larrosa Q

Bayesian Networks Inference with Probabilistic Graphical Models

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

STA 4273H: Statistical Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Directed and Undirected Graphical Models

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Probabilistic Graphical Models (Cmput 651): Hybrid Network. Matthew Brown 24/11/2008

13: Variational inference II

Variable Elimination: Algorithm

Probabilistic Graphical Models

Expectation Propagation Algorithm

Bayesian Networks. Machine Learning, Fall Slides based on material from the Russell and Norvig AI Book, Ch. 14

Directed Graphical Models or Bayesian Networks

CPSC 540: Machine Learning

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Graphical Models. Andrea Passerini Statistical relational learning. Graphical Models

The Sum-Product Theorem: A Foundation for Learning Tractable Models (Supplementary Material)

Linear Dynamical Systems

Machine Learning Techniques for Computer Vision

Probabilistic Graphical Models

BAYESIAN DECISION THEORY

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Y. Xiang, Inference with Uncertain Knowledge 1

Bayesian Networks BY: MOHAMAD ALSABBAGH

Parameter learning in CRF s

Lecture 6: Graphical Models: Learning

Variational Inference (11/04/13)

Probabilistic Graphical Models: Representation and Inference

Inference as Optimization

Variable Elimination: Algorithm

Artificial Intelligence Bayes Nets: Independence

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Probabilistic Graphical Models

Rapid Introduction to Machine Learning/ Deep Learning

Machine Learning Summer School

Based on slides by Richard Zemel

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

FINAL: CS 6375 (Machine Learning) Fall 2014

Representation of undirected GM. Kayhan Batmanghelich

Lecture 16 Deep Neural Generative Models

Using Graphs to Describe Model Structure. Sargur N. Srihari

From Bayesian Networks to Markov Networks. Sargur Srihari

Undirected Graphical Models

Undirected Graphical Models: Markov Random Fields

On Theoretical Properties of Sum-Product Networks

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

Bayes Nets: Independence

Probabilistic Graphical Models

CS Lecture 4. Markov Random Fields

Probabilistic Graphical Models

Belief Update in CLG Bayesian Networks With Lazy Propagation

Directed Graphical Models

CS6220: DATA MINING TECHNIQUES

EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS

Cheng Soon Ong & Christian Walder. Canberra February June 2018

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Symbolic Variable Elimination in Discrete and Continuous Graphical Models. Scott Sanner Ehsan Abbasnejad

Introduction to Probabilistic Graphical Models

Chapter 17: Undirected Graphical Models

Transcription:

On the Relationship between Sum-Product Networks and Bayesian Networks International Conference on Machine Learning, 2015 Han Zhao Mazen Melibari Pascal Poupart University of Waterloo, Waterloo, ON, Canada 06 November 2015 Presented by: Kyle Ulrich

Introduction Graphical models represent distributions compactly as normalized products of factors: P(X = x) = 1 φ k (x Z {k} ) where x X is d-dimensional φ k is a potential function of a subset of variables Z is the partition function The partition function of most useful models is represented by an intractable integral/sum k

Introduction The partition function is represented using a polynomial number of sums and products, Z = φ k (x {k} ) x X In many useful models, Z can be represented compactly using a deep architecture Sum-product networks (Poon and Domingos, 2011) use a deep architecture with tractable inference: k Sum nodes correspond to mixtures over subsets of variables Product nodes correspond to features or mixture components

Network Polynomial Definition (Network Polynomial) Let f ( ) 0 be an unnormalized probability distribution over a Boolean random vector X 1:N. The network polynomial of f ( ) is a multilinear function N f (x) Example x n=1 The network polynomial for the Bayesian network X 1 X 2 is I xn Pr(x 1 )Pr(x 2 x 1 )I x1 I x2 + Pr(x 1 )Pr( x 2 x 1 )I x1 I x2 + Pr( x 1 )Pr(x 2 x 1 )I x1 I x2 + Pr( x 1 )Pr( x 2 x 1 )I x1 I x2

Sum-Product Network Definition (Sum-Product Network (Poon & Domingos, 2011)) A Sum-Product Network (SPN) S over Boolean variables X 1:N is a rooted DAG whose leaves are indicators I x1,..., I xn and I x1,..., I xn and whose internal nodes are sums and products. Value of product node v i : product of the values of children Value of sum node v i : v j Ch(v i ) w ijval(v j ) The root node is represented by the network polynomial S(x) (Gens et al., 2012)

Example SPN Identical uniform distribution over states of five variables containing an even number of 1 s Either represented by a shallow SPN with an exponential size or a compact deep SPN (Poon and Domingos, 2011)

Validity An SPN is valid if it defines an (unnormalized) probability distribution (generative model). Sufficient conditions: complete and consistent Definition (Complete) An SPN is complete iff each sum node has children with the same scope. Definition (Consistent) An SPN is consistent iff no variable appears negated in one child of a product node and non-negated in another. Definition (Decomposable) An SPN is decomposable iff for every product node v, scope(v i ) scope(v j ) = where v i, v j Ch(v), i j.

Computations in SPNs 1 Partition function: set all indicators to 1, and evaluate network polynomial, Z S = x X S(x) = S(1,..., 1) 2 State probability: normalize network polynomial at state x (either x i = 1 and x i = 0 or x i = 0 and x i = 1 for each x i ), P(x) = S(x)/Z S 3 Marginal probability: for all unobserved x i set both x i = 1 and x i = 1 to define evidence e P(e) = S(e)/Z S

Extension to Continuous Variables Instead of having sum nodes over leaves of indicator children, we can consider multinomial variables with an infinite number of values The weighted sum becomes the integral p(x)dx where p(x) is the p.d.f. of X The value of integral node n is either p n (x) or 1 Computation of evidence proceeds as usual

Learning in SPNs 1 First, evaluate all S i (x) in an upward pass 2 On a downward pass, Compute likelihood gradient through backpropagation: S(x) S i (x) = S(x) k Pa i w ki S k (x) k Pa i S(x) S k (x) Compute gradient on weights: S(x) w ij l Ch i (k) S l(x) = S(x) S i (x) S j(x) Product node Sum node 3 Compute marginals: For a latent variable representing sum node n k with child n i : P(Y k = i e) w ki S(e) S k (e) For an indicator I xi, P(X i = 1 e) S(e) S i (e)

Gradient Diffusion Unfortunately, deep SPNs suffer from gradient diffusion, i.e., the gradient becomes uniform The most probable explanation (MPE) may be used to define hard EM: 1 In the upward pass, replace all weighted sums with the maximum weighted value 2 On downward pass, choose only the highest valued child nodes 3 Increment a count for each chosen child node (M-step) 4 Re-normalize the counts to obtain weights (E-step)

Experiment: Face Completion Restoration of half-occluded face Original SPN DBM DBN PCA Nearest neighbor

Contributions of Paper This paper discusses the tractability of three topics: 1 Any valid SPN may be represented as a normal SPN 2 Any normal SPN may be converted to a Bayesian Network represented by Algebraic Decision Diagrams 3 The generated BN above can recover the original SPN probability distribution

Normal SPN Definition (Normal SPN) An SPN is said to be normal if 1 It is complete and decomposable. 2 For each sum node in the SPN, the weights of the edges emanating from the sum node are nonnegative and sum to 1. 3 Every terminal node in the SPN is a univariate distribution over a Boolean variable and the size of the scope of a sum node is at least 2. Theorem (Convert SPN to Normal SPN) For any complete and consistent SPN S, there exists a normal SPN S such that Pr S ( ) = Pr S ( ) and S = O( S 2 ).

Normal SPN: Consistent to Decomposable The authors provide an algorithm/proof that any valid SPN may be converted to a decomposable SPN Definition (Decomposable) An SPN is decomposable iff for every product node v, scope(v i ) scope(v j ) = where v i, v j Ch(v), i j.

Normal SPN: Normalize Weights The weights associated with sum nodes may then be normalized for a complete and decomposable SPN

SPN to BN Theorem (SPN to BN) There exists an algorithm that converts any complete and decomposable SPN S over Boolean variables X 1:N into a BN B with CPDs represented by ADDs in time O(N S ). Furthermore, S and B represent the same distribution and B = O(N S ).

SPN to BN: Structure of BN 1 Create an observable variable X in B for each terminal node 2 Create a hidden variable H v in place of each sum node v 3 Build directed edges from hidden variables to observable variables in scope of sub-tree

SPN to BN: Algebraic Decision Diagrams Algebraic Decision Diagrams (ADD) are used to represent the full conditional probability distribution Definition (Algebraic Decision Diagram) An ADD is a DAG representing the function f : X 1 X N R where X n is the domain of variable X n and X n is the number of values X n takes.

BN to SPN Theorem (BN to SPN) Given the BN B with ADD representation of CPDs generated from a complete and decomposable SPN S over Boolean variables X 1:N, the original SPN S can be recovered by applying the Variable Elimination algorithm to B in O(N S ). The authors prove the generated BN can recover an SPN with an identical distribution to the original SPN

BN to SPN: Variable Elimination The authors prove the generated BN can recover an SPN with an identical probability distribution to the original SPN Use variable elimination (VE) to 1 Multiply two factors 2 Sum out hidden variables