Collapsed Variational Inference for Sum-Product Networks

Similar documents
Collapsed Variational Inference for Sum-Product Networks

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

On the Relationship between Sum-Product Networks and Bayesian Networks

13: Variational inference II

Online Algorithms for Sum-Product

Study Notes on the Latent Dirichlet Allocation

Linear Time Computation of Moments in Sum-Product Networks

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Probabilistic Graphical Models

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Probabilistic Graphical Models

Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning

Online Bayesian Passive-Aggressive Learning

Machine Learning Summer School

14 : Theory of Variational Inference: Inner and Outer Approximation

Machine Learning Techniques for Computer Vision

Graphical Models for Collaborative Filtering

Variational Inference (11/04/13)

Exponential Families

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Integrated Non-Factorized Variational Inference

Lecture : Probabilistic Machine Learning

Posterior Regularization

Lecture 6: Graphical Models: Learning

Estimating Latent Variable Graphical Models with Moments and Likelihoods

COS513 LECTURE 8 STATISTICAL CONCEPTS

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian Machine Learning

Two Useful Bounds for Variational Inference

Variational Bayes and Variational Message Passing

Probabilistic Graphical Models

Lecture 13 : Variational Inference: Mean Field Approximation

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Probabilistic Graphical Models

Introduction to Probabilistic Graphical Models: Exercises

Bayesian Machine Learning

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Graphical Models and Kernel Methods

Probabilistic Graphical Models

Note 1: Varitional Methods for Latent Dirichlet Allocation

Lecture 5: Bayesian Network

Chapter 16. Structured Probabilistic Models for Deep Learning

Latent Dirichlet Allocation

CS6220: DATA MINING TECHNIQUES

Overfitting, Bias / Variance Analysis

Introduction to Machine Learning

Variational Scoring of Graphical Model Structures

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Bayesian Networks. Motivation

Learning Bayesian network : Given structure and completely observed data

Basic math for biology

p L yi z n m x N n xi

Introduction to Probabilistic Machine Learning

Expectation Propagation Algorithm

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Non-Parametric Bayes

Latent Dirichlet Allocation (LDA)

Online Bayesian Passive-Agressive Learning

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Chris Bishop s PRML Ch. 8: Graphical Models

Lecture 8: Graphical models for Text

Online and Distributed Bayesian Moment Matching for Parameter Learning in Sum-Product Networks

11. Learning graphical models

Introduction to Machine Learning

Outline Lecture 2 2(32)

EM & Variational Bayes

ECE521 week 3: 23/26 January 2017

Stat 5101 Lecture Notes

Introduction to Machine Learning Midterm Exam Solutions

Gaussian Process Regression

Lecture 21: Spectral Learning for Graphical Models

Quantitative Biology II Lecture 4: Variational Methods

Introduction to Bayesian inference

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

The Expectation-Maximization Algorithm

Bayesian Learning in Undirected Graphical Models

Graphical Models 359

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Collapsed Variational Bayesian Inference for Hidden Markov Models

Structure Learning: the good, the bad, the ugly

Variational Learning : From exponential families to multilinear systems

Variational Principal Components

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Introduction to Machine Learning Midterm, Tues April 8

PATTERN RECOGNITION AND MACHINE LEARNING

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Patterns of Scalable Bayesian Inference Background (Session 1)

Bayesian Modeling of Conditional Distributions

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Analysis for Natural Language Processing Lecture 2

Recent Advances in Bayesian Inference Techniques

The Monte Carlo Method: Bayesian Networks

Learning P-maps Param. Learning

The Origin of Deep Learning. Lili Mou Jan, 2015

Text Mining for Economics and Finance Latent Dirichlet Allocation

Variational Mixture of Gaussians. Sargur Srihari

Transcription:

for Sum-Product Networks Han Zhao 1, Tameem Adel 2, Geoff Gordon 1, Brandon Amos 1 Presented by: Han Zhao Carnegie Mellon University 1, University of Amsterdam 2 June. 20th, 2016 1 / 26

Outline Background Sum-Product Networks Variational Inference Collapsed Variational Inference Motivations and Challenges Efficient Marginalization Logarithmic Transformation Experiments Summary 2 / 26

Sum-Product Networks Definition A Sum-Product Network (SPN) is a Rooted directed acyclic graph of univariate distributions, sum nodes and product nodes. Value of a product node is the product of its children. Value of a sum node is the weighted sum of its children, where the weights are nonnegative. Value of the network is the value at the root. + w 1 w 2 w 3 I x1 I x1 I x2 I x2 3 / 26

Sum-Product Networks Mixture of Trees Each SPN can be decomposed as a mixture of trees: w1 w3 w2 = w 1 +w 2 +w 3 + + + + Ix1 I x1 Ix2 I x2 Ix1 Ix2 Ix1 I x2 I x1 I x2 Each tree is a product of univariate distributions. Number of mixture components is Ω(2 Depth ). Each network computes a positive polynomial (posynomial) function of model parameters: τ S n V root (x w) = p t (X i = x i ) w kj t=1 (k,j) T te i=1 4 / 26

Sum-Product Networks Bayesian Network Alternatively, each SPN S is equivalent to a Bayesian network B with bipartite structure. H + H1 + H2 H3 + + H4 + H1 H3 H H2 H4 X1 X2 X3 X4 X1 X3 X2 X4 Number of sum nodes in S = Number of hidden variables in B = Θ( S ). B = O(n S ) Number of observable variables in B = Number of variables modeled by S. Typically number of hidden variables number of observable variables. 5 / 26

Variational Inference Brief Introduction Bayesian Inference: p(w x) }{{} posterior Often intractable because of: No analytical solution. p(w) }{{} prior Expensive numerical integration. p(x w) }{{} likelihood General idea: find the best approximation in a tractable family of distributions Q: minimize q Q KL[q(w) p(w x)] Typical choice of approximation families: Mean-field, structured mean-field, etc. 6 / 26

Variational Inference Brief Introduction Variational method: Optimization-based, deterministic approach for approximate Bayesian inference. inf KL[q(w) p(w x)] sup E q [log p(w, x)] + H[q] q Q q Q Evidence Lower Bound L: log p(x) sup E q [log p(w, x)] + H[q] =: L q Q 7 / 26

Motivations and Challenges Bayesian inference algorithms for SPNs: Flexible at incorporating prior knowledge about the structure of SPNs. More robust to overfitting. 8 / 26

Motivations and Challenges W 1 W 2 W 3 W m H 1 H 2 H 3 H m X 1 X 2 X 3 X n D W Model parameters, global hidden variables. H Assignments of sum nodes, local hidden variables. X Observable variables. D Number of instances. Challenges for standard VB: Large number of local hidden variables: number of local hidden variables = Number of sum nodes = Θ( S ). 9 / 26

Motivations and Challenges W 1 W 2 W 3 W m H 1 H 2 H 3 H m X 1 X 2 X 3 X n D W Model parameters, global hidden variables. H Assignments of sum nodes, local hidden variables. X Observable variables. D Number of instances. Challenges for standard VB: Large number of local hidden variables: number of local hidden variables = Number of sum nodes = Θ( S ). Memory overhead: space complexity O(D S ). 10 / 26

Motivations and Challenges W 1 W 2 W 3 W m H 1 H 2 H 3 H m X 1 X 2 X 3 X n D W Model parameters, global hidden variables. H Assignments of sum nodes, local hidden variables. X Observable variables. D Number of instances. Challenges for standard VB: Large number of local hidden variables: number of local hidden variables = Number of sum nodes = Θ( S ). Memory overhead: space complexity O(D S ). Time complexity: O(nD S ). 11 / 26

Contributions Our contributions: We obtain better ELBO L to optimize than L, the one obtained by mean-field. Reduced space complexity: O(D S ) O( S ), space complexity is independent of training size. Reduced time complexity: O(nD S ) O(D S ), removing the explicit dependency on the dimension. 12 / 26

Efficient Marginalization Recall ELBO in standard VI: L := E q(w,h) [log p(w, h, x)] + H[q(w, h)] Consider the new ELBO in Collapsed VI: L :=E q(w) [log p(w, x)] + H[q(w)] =E q(w) [log h p(w, h, x)] + H[q(w)] We can establish the following inequality: log p(x) L L The new ELBO in Collapsed VI leads to a better lower bound than the one used in standard VI! 13 / 26

Comparisons Standard Variational Inference Mean-field assumption: q(w, h) = i q(w i) j q(h j) ELBO: L := E q(w,h) [log p(w, h, x)] + H[q(w, h)] Collapsed Variational Inference for LDA, HDP Collapsed out global hidden variables: q(h) = w q(w, h) dw ELBO: L h := E q(h) [log p(h, x)] + H[q(h)] Better lower bound: L h L Collapsed Variational Inference for SPN Collapsed out local hidden variables: q(w) = h q(w, h) ELBO: L w := E q(w) [log p(w, x)] + H[q(w)] Better lower bound: L w L 14 / 26

Efficient Marginalization Time complexity of the exact marginalization incurred in computing h p(w, h, x): Time complexity of marginalization in graphical model G: O(D 2 tw(g) ). Space complexity reduction: 15 / 26

Efficient Marginalization Time complexity of the exact marginalization incurred in computing h p(w, h, x): Time complexity of marginalization in graphical model G: O(D 2 tw(g) ). Exact marginalization in BN B with algebraic decision diagram as local factors: O(D B ) = O(nD S ). Space complexity reduction: 16 / 26

Efficient Marginalization Time complexity of the exact marginalization incurred in computing h p(w, h, x): Time complexity of marginalization in graphical model G: O(D 2 tw(g) ). Exact marginalization in BN B with algebraic decision diagram as local factors: O(D B ) = O(nD S ). Exact marginalization in SPN S: O(D S ). Space complexity reduction: 17 / 26

Efficient Marginalization Time complexity of the exact marginalization incurred in computing h p(w, h, x): Time complexity of marginalization in graphical model G: O(D 2 tw(g) ). Exact marginalization in BN B with algebraic decision diagram as local factors: O(D B ) = O(nD S ). Exact marginalization in SPN S: O(D S ). Space complexity reduction: No posterior over h to approximate anymore. 18 / 26

Efficient Marginalization Time complexity of the exact marginalization incurred in computing h p(w, h, x): Time complexity of marginalization in graphical model G: O(D 2 tw(g) ). Exact marginalization in BN B with algebraic decision diagram as local factors: O(D B ) = O(nD S ). Exact marginalization in SPN S: O(D S ). Space complexity reduction: No posterior over h to approximate anymore. No variational variables over h needed: O(D S ) O( S ). 19 / 26

Logarithmic Transformation New optimization objective: maximize q Q E q(w) [log h p(w, h, x)] + H[q(w)] which is equivalent to minimize q Q KL[q(w) p(w)] E q(w) [log p(x w)] p(w) prior distribution over w, product of Dirichlets. q(w) variational posterior over w, product of Dirichlets. p(x w) likelihood, not multinomial anymore after marginalization. Non-conjugate q(w) and p(x w), no analytical solution for E q(w) [log p(x w)]. 20 / 26

Logarithmic Transformation Key observation: p(x w) = V root (x w) = τ S w kj t=1 (k,j) T te i=1 n p t (X i = x i ) is a posynomial function of w. Make a bijective mapping (change of variable): w = log(w). Dates back to the literature of geometric programming. The new objective after transformation is convex in w. τ S log p(x w) = log exp c t + t=1 (k,j) T te w kj Jensen s inequality to obtain further lower bound. 21 / 26

Logarithmic Transformation Further lower bound: E q(w) [log p(x w)] = E q(w )[log p(x w )] log p(x E q (w )[w ]) Relaxed objective: minimize q Q KL[q(w) p(w)] log p(x E }{{} q (w )[w ]) }{{} Regularity Data fitting Roughly, log p(x E q (w )[w ]) corresponds the log-likelihood by setting the weights of SPN as the posterior mean of q(w). Optimized by projected GD. 22 / 26

Algorithm Line 4 8 easily parallelizable, distributed version. Sample minibatch in Line 4 8, stochastic version. 23 / 26

Experiments Experiments on 20 data sets, report average log-likelihoods, Wilcoxon ranked test. Compared with (O)MLE-SPN and OBMM-SPN. 0 MLE-SPN vs CVB-SPN, Avg. log-likelihoods 0 OMLE-SPN vs OCVB-SPN, Avg. log-likelihoods 50 100 100 200 CVB-Projected GD 150 200 250 OCVB-Projected GD 300 400 500 300 600 350 700 400 400 350 300 250 200 150 100 50 0 MLE-Projected GD 800 800 700 600 500 400 300 200 100 0 OMLE-Projected GD 24 / 26

Experiments 0 OBMM vs OCVB, Avg. log-likelihoods 50 OCVB-Projected 100 150 200 250 300 300 250 200 150 100 50 0 OBMM-Moment Matching 25 / 26

Summary Thanks Q & A 26 / 26