Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Similar documents
Online Algorithms for Sum-Product

Online and Distributed Bayesian Moment Matching for Parameter Learning in Sum-Product Networks

Online Bayesian Transfer Learning for Sequential Data Modeling

On the Relationship between Sum-Product Networks and Bayesian Networks

Discriminative Learning of Sum-Product Networks. Robert Gens Pedro Domingos

Sum-Product Networks: A New Deep Architecture

Online Algorithms for Sum-Product Networks with Continuous Variables

Linear Time Computation of Moments in Sum-Product Networks

Collapsed Variational Inference for Sum-Product Networks

STA 4273H: Statistical Machine Learning

Probabilistic and Logistic Circuits: A New Synthesis of Logic and Machine Learning

On the Relationship between Sum-Product Networks and Bayesian Networks

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Probabilistic Graphical Models

13: Variational inference II

APPROXIMATION COMPLEXITY OF MAP INFERENCE IN SUM-PRODUCT NETWORKS

Variational Inference. Sargur Srihari

Pattern Recognition and Machine Learning

University of Washington Department of Electrical Engineering EE512 Spring, 2006 Graphical Models

STA 4273H: Statistical Machine Learning

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

STA 4273H: Statistical Machine Learning

Bayesian Networks BY: MOHAMAD ALSABBAGH

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Online Forest Density Estimation

Probabilistic Graphical Models

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

J. Sadeghi E. Patelli M. de Angelis

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

Nonparametric Bayesian Methods (Gaussian Processes)

Conditional Random Field

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Based on slides by Richard Zemel

Machine Learning Lecture 5

Variational Bayes on Monte Carlo Steroids

Probabilistic and Logistic Circuits: A New Synthesis of Logic and Machine Learning

Undirected Graphical Models

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Bayesian Networks Inference with Probabilistic Graphical Models

Lecture 9: PGM Learning

Lecture 13 : Variational Inference: Mean Field Approximation

Bayesian Deep Learning

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Probabilistic Graphical Models & Applications

Graphical Models for Collaborative Filtering

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Bayesian Machine Learning

Introduction to Machine Learning

Collapsed Variational Inference for Sum-Product Networks

The Expectation-Maximization Algorithm

Estimating Latent Variable Graphical Models with Moments and Likelihoods

A brief introduction to Conditional Random Fields

Intelligent Systems (AI-2)

CSCE 478/878 Lecture 6: Bayesian Learning

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Probabilistic Graphical Models

Probabilistic Graphical Models (I)

Bayesian Machine Learning

Scaling Neighbourhood Methods

Streaming Variational Bayes

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

The Origin of Deep Learning. Lili Mou Jan, 2015

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Intelligent Systems (AI-2)

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

p(d θ ) l(θ ) 1.2 x x x

STA 4273H: Statistical Machine Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Probabilistic Time Series Classification

Dynamic models 1 Kalman filters, linearization,

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Expectation Propagation Algorithm

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Data Structures for Efficient Inference and Optimization

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Naïve Bayes classification

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Day 3 Lecture 3. Optimizing deep networks

Machine Learning Summer School

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Latent Variable Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

VCMC: Variational Consensus Monte Carlo

Asaf Bar Zvi Adi Hayat. Semantic Segmentation

Directed Graphical Models

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Part 1: Expectation Propagation

Bayesian Approaches Data Mining Selected Technique

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Machine Learning for Signal Processing Bayes Classification and Regression

Transcription:

Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Introduction Outline What is a Sum-Product Network? Inference Applications In more depth Relationship to Bayesian networks Parameter estimation Online and distributed estimation Structure estimation 2

What is a Sum-Product Network? Poon and Domingos, UAI 2011 Acyclic directed graph of sums and products Leaves can be indicator variables or univariate distributions 3

Two Views Deep neural network with clear semantics Tractable probabilistic graphical model 4

Deep Neural Network View Specific type of neural network Sum node: log ( w ( input ( ( ) Product node: exp ( input ( ( ) Advantages: Clear semantics Generative model Efficient training Structure estimation 5

Probabilistic Graphical Models Bayesian Network Markov Network Sum-Product Network Graphical view of direct dependencies Inference #P: intractable Graphical view of correlations Inference #P: intractable Graphical view of computation Inference P: tractable 6

Probabilistic Inference SPN represents a joint distribution over a set of random variables 34.8 Example: Pr X 6 = true, X ; = false = @A.C 32 36 8 9 4 0 1 1 0 7

Probabilistic Inference SPN represents a joint distribution over a set of random variables 100 Example: Pr X 6 = true, X ; = false = @A.C 100 100 100 10 10 10 Linear complexity! 1 1 1 1 8

Semantics Each node computes a probability over its scope Scope of a node: set of variables in sub-spn rooted at that node Decomposable product node: children with disjoint scopes Complete/smooth sum node: children with identical scopes decomposability + completeness 32/100 34.8/100 8/10 9/10 4/10 0 1 1 0 valid distribution linear inference 36/100 9

Queries Most Neural Nets outputs=f(inputs) Sum-Product Networks 34.8 32 36 8 9 4 0 1 1 0 inputs outputs inputs Pr X ; = F X 6 = T = Pr X ; = F, X 6 = T Pr X 6 = T = 34.8

Queries Most Neural Nets outputs=f(inputs) Sum-Product Networks 87 80 90 8 9 10 0 1 1 1 inputs outputs inputs Pr X ; = F X 6 = T = Pr X ; = F, X 6 = T Pr X 6 = T = 34.8 87

Relationship with other PGMs Any SPN can be converted into a Bayes net without any exponential blow up (Zhao, Melibari, Poupart, ICML-15) Naïve Bayes model H SPN BN X Y Z Product of Naïve Bayes models SPN H1 H2 BN X Y N Z 12

Relationship with other PGMs Probability distributions Compact: space is polynomial in # of variables Tractable: inference time is polynomial in # of variables SPN = BN (MN) Compact BN (MN) Compact SPN = Tractable SPN = Tractable BN (MN) 13

Parameter Estimation Instances?? Attributes Data?????? Maximum likelihood: stochastic gradient descent (SGD) (Poon & Domingos, 2011), expectation maximization (EM) (Perharz, 2015), signomial programming (Zhao & Poupart, 2016) Bayesian learning: Bayesian Moment Matching (BMM) (Rashwan et al., 2015), Collapsed Variational Inference (Zhao et al., 2016) 14

Applications Image completion (Poon, Domingos; 2011) Activity recognition (Amer, Todorovic; 2012) Language modeling (Cheng et al.; 2014) Speech modeling (Perhaz et al.; 2014) Mobile robotics (Pronobis, Rao; 2016) 15

Language Model An SPN-based n-gram model Fixed structure Discriminative weight estimation by gradient descent 16

Results From Cheng et al. 2014 17

Maximum Log-Likelihood Objective: w = argmax J LM log Pr data w = argmax J LM log Pr (x w) P where Pr x w = Q e x w Q 1 w = VWXX X Y TU VWXX J TU J TU VWXX Z TU VWXX Non-convex optimization max ] log ] ^ w (_ J P `abb b(p) (_ `abb s.t. w (_ 0 ij log ] ^ w (_ `abb 6 (_ `abb 18

Summary Algo Var Update Approximation w additive linear PGD w (_ hi6 projection w (_ h + γ log f e x w w (_ log f 1 w w (_ w multiplicative linear EG w (_ hi6 w (_ h exp γ log f e x w w (_ log f 1 w w (_ log w multiplicative monomial SMA w (_ hi6 w (_ h exp γ log f e x w log w (_ log f 1 w log w (_ CCCP (EM) log w multiplicative Concave lower bound w hi6 (_ w f h q U x w h f(x w h ) (_ f x w h f qt (x w h ) 19

Results Zhao, Poupart et al. (NIPS 2016) 20

Streaming Data Traffic classification App recommendation Challenge: update model after each data vector Solution: online learning for SPNs 21

Scalability Online: process data sequentially once only Distributed: process subsets of data on different computers Mini-batches: SGD, online EG, online EM Problems: loss of information due to minibatches, how to adjust learning rate? Can we do better? 22

Thomas Bayes

Bayesian Learning Bayes theorem (1764) Pr θ X 6:t Pr θ Pr X 6 θ Pr X ; θ Pr (X t θ) Broderick et al. (2013): facilitates Online learning (streaming data) Pr θ X 6:t Pr θ Pr X 6 θ Pr X ; θ Pr (X t θ) Distributed computation Pr θ Pr X 6 θ Pr X ; θ Pr (X @ θ)pr (X A θ) Pr (X u θ) core #1 core #2 core #3

Exact Bayesian Learning Assume a normal SPN where the weights w ( of each sum node i form a discrete distribution. Prior: Pr w = Dir(w ( α ( ) ( where Dir w ( α ( w (_ z TU _ Likelihood: Pr x w = f e x w = `abb b(p) (_ `abb w (_ (h) ) Posterior: h c h ( Dir(w ( α ( Exponentially large mixture of Dirichlets 25

Karl Pearson

Method of Moments (1894) Estimate model parameters by matching a subset of moments (i.e., mean and variance) Performance guarantees Break through: First provably consistent estimation algorithm for several mixture models HMMs: Hsu, Kakade, Zhang (2008) MoGs: Moitra, Valiant (2010), Belkin, Sinha (2010) LDA: Anandkumar, Foster, Hsu, Kakade, Liu (2012)

Bayesian Moment Matching for Sum Product Networks Bayesian Learning + Method of Moments Online, distributed and tractable algorithm for SPNs Approximate mixture of products of Dirichlets by a single product of Dirichlets that matches first and second order moments

Bayesian Moment Matching Bayes update Product of Dirichlets Mixture of Products of Dirichlets projection 29

Results (benchmarks) Rashwan, Zhao, Poupart (AISTATS 2016)

Results (Large Datasets) Rashwan, Zhao, Poupart (AISTATS 2016) Log likelihood obmm and odmm outperform other algos on 3 (out of 4) problems Time (minutes) odmm is significantly faster

Structure Estimation What is the most popular technique to estimate the structure of a deep neural network? Parameter estimation: Gradient descent Structure estimation: Graduate student descent State-of-the-art: evolutionary techniques, hyperparameter search 32

Structure Estimation in SPNs Instances + Attributes X + LearnSPN (Gens & Domingos, 2013): alternate between Data clustering: sum nodes Variable partition (independence testing): product nodes 33

Improved Structure Estimation Instances + Attributes X X + X Prometheus (Jaini, Ghose et al, 2017): alternate between Data clustering: sum nodes Multiple variable partitions: product nodes 34

Results (log likelihood) From Jaini, Ghose and Poupart (2017) MNIST dataset 35

Sum-Product Networks Conclusion Deep architecture with clear semantics Tractable probabilistic graphical model Related work Decision SPNs (Melibari et al., AAAI-2016) Dynamic (recurrent) SPNs (Melibari et al., PGM-2016) Future work: PyTorch library for SPNs SPNs for conversational agents 36