Online Algorithms for Sum-Product

Similar documents
Online Bayesian Transfer Learning for Sequential Data Modeling

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Online Algorithms for Sum-Product Networks with Continuous Variables

Online and Distributed Bayesian Moment Matching for Parameter Learning in Sum-Product Networks

Part 1: Expectation Propagation

Sum-Product Networks: A New Deep Architecture

Expectation Propagation Algorithm

Discriminative Learning of Sum-Product Networks. Robert Gens Pedro Domingos

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

On the Relationship between Sum-Product Networks and Bayesian Networks

Bayesian Machine Learning

STA 4273H: Statistical Machine Learning

Collapsed Variational Inference for Sum-Product Networks

Linear Time Computation of Moments in Sum-Product Networks

STA 4273H: Statistical Machine Learning

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Graphical Models for Collaborative Filtering

Expectation propagation as a way of life

p L yi z n m x N n xi

Streaming Variational Bayes

Lecture 6: Graphical Models: Learning

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Machine Learning Summer School

Data Mining Techniques

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Probabilistic Graphical Models

Variational Inference. Sargur Srihari

Bayesian Machine Learning

PMR Learning as Inference

Algorithms for Variational Learning of Mixture of Gaussians

13: Variational inference II

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller

VCMC: Variational Consensus Monte Carlo

Linear Dynamical Systems

Lecture 13 : Variational Inference: Mean Field Approximation

Expectation Propagation for Approximate Bayesian Inference

Probabilistic and Bayesian Machine Learning

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Variational Mixture of Gaussians. Sargur Srihari

Bayesian Deep Learning

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Non-Parametric Bayes

Understanding Covariance Estimates in Expectation Propagation

Probabilistic Graphical Networks: Definitions and Basic Results

Bayesian Methods for Machine Learning

CPSC 540: Machine Learning

Bayesian Deep Learning

Introduction to Probabilistic Machine Learning

STA 4273H: Statistical Machine Learning

Learning Bayesian network : Given structure and completely observed data

CSC 412 (Lecture 4): Undirected Graphical Models

Probabilistic Graphical Models & Applications

Variational Scoring of Graphical Model Structures

Bayesian Approaches Data Mining Selected Technique

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Learning Gaussian Graphical Models with Unknown Group Sparsity

Tractable Learning in Structured Probability Spaces

Ch 4. Linear Models for Classification

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Basic math for biology

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Note Set 5: Hidden Markov Models

Probabilistic Graphical Models

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Lecture 21: Spectral Learning for Graphical Models

Expectation Propagation in Factor Graphs: A Tutorial

Scaling Neighbourhood Methods

Introduction to Gaussian Processes

Fast yet Simple Natural-Gradient Variational Inference in Complex Models

Probabilistic and Logistic Circuits: A New Synthesis of Logic and Machine Learning

Tractable Learning in Structured Probability Spaces

Lightweight Probabilistic Deep Networks Supplemental Material

Model Selection for Gaussian Processes

Probabilistic Time Series Classification

APPROXIMATION COMPLEXITY OF MAP INFERENCE IN SUM-PRODUCT NETWORKS

Chris Bishop s PRML Ch. 8: Graphical Models

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Mixtures of Gaussians. Sargur Srihari

1 Bayesian Linear Regression (BLR)

Introduction to Probabilistic Graphical Models: Exercises

Bayesian Machine Learning

On the Relationship between Sum-Product Networks and Bayesian Networks

Based on slides by Richard Zemel

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

An Introduction to Bayesian Machine Learning

Variational Learning : From exponential families to multilinear systems

Probabilistic Graphical Models

Estimating Latent Variable Graphical Models with Moments and Likelihoods

Neutron inverse kinetics via Gaussian Processes

EM & Variational Bayes

Introduction to Bayesian inference

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Introduction to Gaussian Process

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Probabilistic Latent Semantic Analysis

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Transcription:

Online Algorithms for Sum-Product Networks with Continuous Variables Priyank Jaini Ph.D. Seminar

Consistent/Robust Tensor Decomposition And Spectral Learning Offline Bayesian Learning ADF, EP, SGD, oem Non convergence and local optima Mixture Models Can be distributed; Practical problems Bayesian Moment Matching algorithm SPNs Bayesian Moment Matching for continuous SPNs Comparison with other deep networks Online

Streaming Data Activity Recognition Recommendation

Streaming Data Activity Recognition Recommendation Challenge : update model after each observation

Algorithm s characteristics

Algorithm s characteristics Online x t

Algorithm s characteristics Online x t Distributed x a:d x a:b x c:d

Algorithm s characteristics Online x t Distributed x a:d x a:b x c:d Consistent x 1:t Θ*

How can we learn mixture models robustly from streaming data?

Learning Algorithms

Learning Algorithms Robust : Tensor Decomposition(Anandkumar et.al, 2014),Spectral Learning(Hsu et al, 2012, Parikar and Xing, 2011); offline

Learning Algorithms Robust : Tensor Decomposition(Anandkumar et.al, 2014),Spectral Learning(Hsu et al, 2012, Parikar and Xing, 2011); offline Online : Assumed Density Filtering (Maybeck 1982; Lauritzen 1992; Opper & Winther 1999); not robust

Learning Algorithms Robust : Tensor Decomposition(Anandkumar et.al, 2014),Spectral Learning(Hsu et al, 2012, Parikar and Xing, 2011); offline Online : Assumed Density Filtering (Maybeck 1982; Lauritzen 1992; Opper & Winther 1999); not robust Expectation Propagation (Minka 2001); does not converge

Learning Algorithms Robust : Tensor Decomposition(Anandkumar et.al, 2014),Spectral Learning(Hsu et al, 2012, Parikar and Xing, 2011); offline Online : Assumed Density Filtering (Maybeck 1982; Lauritzen 1992; Opper & Winther 1999); not robust Expectation Propagation (Minka 2001); does not converge Stochastic Gradient Descent (Zhang 2004) online Expectation Maximization (Cappe 2012) SGD and oem : local optimum and cannot be distributed

Learning Algorithms Exact Bayesian Learning : Dirichlet Mixtures(Ghosal et al 1999), Gaussian Mixtures(Lijoi et al, 2005), Non-parametric Problems (Barron et al, 1999), (Freedman, 1999)

Learning Algorithms Exact Bayesian Learning : Dirichlet Mixtures(Ghosal et al 1999), Gaussian Mixtures(Lijoi et al, 2005), Non-parametric Problems (Barron et al, 1999), (Freedman, 1999) Exact Bayesian Learning Online

Learning Algorithms Exact Bayesian Learning : Dirichlet Mixtures(Ghosal et al 1999), Gaussian Mixtures(Lijoi et al, 2005), Non-parametric Problems (Barron et al, 1999), (Freedman, 1999) Exact Bayesian Learning Online Distributed

Learning Algorithms Exact Bayesian Learning : Dirichlet Mixtures(Ghosal et al 1999), Gaussian Mixtures(Lijoi et al, 2005), Non-parametric Problems (Barron et al, 1999), (Freedman, 1999) Exact Bayesian Learning Distributed Online Consistent

Learning Algorithms Exact Bayesian Learning : Dirichlet Mixtures(Ghosal et al 1999), Gaussian Mixtures(Lijoi et al, 2005), Non-parametric Problems (Barron et al, 1999), (Freedman, 1999) Exact Bayesian Learning Online Distributed Consistent In theory; practical problems!

Thomas Bayes (c. 1700-1761)

Bayesian Learning Uses Bayes Theorem P Θ x = P(Θ)P(x Θ) P(x) Thomas Bayes (c. 1700-1761) Prior Belief New Information Bayes Theorem Updated Belief P(Θ) x P(Θ)P(x Θ) P Θ x P(x)

Bayesian Learning Mixture models

Bayesian Learning Mixture models Data : x 1:n where x i ~ M j=1 w j N(x i ; µ j, Σ j )

Bayesian Learning Mixture models Data : x 1:n where x i ~ M j=1 w j N(x i ; µ j, Σ j ) P n Θ = Pr Θ x 1:n

Bayesian Learning Mixture models Data : x 1:n where x i ~ M j=1 w j N(x i ; µ j, Σ j ) P n Θ = Pr Θ x 1:n P n 1 Θ Pr(x n Θ)

Bayesian Learning Mixture models Data : x 1:n where x i ~ M j=1 w j N(x i ; µ j, Σ j ) P n Θ = Pr Θ x 1:n P n 1 Θ Pr(x n Θ) P n 1 Θ M j=1 w j N(x i ; µ j, Σ j )

Bayesian Learning Mixture models Data : x 1:n where x i ~ M j=1 w j N(x i ; µ j, Σ j ) P n Θ = Pr Θ x 1:n P n 1 Θ Pr(x n Θ) P n 1 Θ M j=1 w j N(x i ; µ j, Σ j ) Intractable!!!

Bayesian Learning Mixture models Data : x 1:n where x i ~ M j=1 w j N(x i ; µ j, Σ j ) P n Θ = Pr Θ x 1:n P n 1 Θ Pr(x n Θ) P n 1 Θ M j=1 w j N(x i ; µ j, Σ j ) Intractable!!! Solution : Bayesian Moment Matching Algorithm

Karl Pearson (c. 1837-1936)

Method of Moments Probability distributions defined by set of parameters Parameters can be estimated by a set of moments Karl Pearson (c. 1837-1936) X~ N(X; µ, σ 2 ) E X = µ E X µ 2 = σ 2

Make Bayesian Learning Great Again

Make Bayesian Learning Great Again Bayesian Learning And Method of Moments

Make Bayesian Learning Great Again Bayesian Learning And Method of Moments

Gaussian Mixture Models x i ~ M j=1 w j N(x i ; µ j, Σ j )

Gaussian Mixture Models x i ~ M j=1 w j N(x i ; µ j, Σ j ) Parameters : weights, means and precisions (inverse covariance matrices)

Bayesian Moment Matching for Gaussian Mixture Models

Bayesian Moment Matching for Gaussian Mixture Models P Θ x = P(Θ)P(x Θ) P(x) Parameters Parameters : weights, means and precisions (inverse covariance matrices)

Bayesian Moment Matching for Gaussian Mixture Models Prior P Θ x = P(Θ)P(x Θ) P(x) Parameters : weights, means and precisions (inverse covariance matrices) Prior : P w, µ, Λ ; product of Dirichlets and Normal-Wisharts Parameters

Bayesian Moment Matching for Gaussian Mixture Models Prior Likelihood Parameters : weights, means and precisions (inverse covariance matrices) P Θ x = P(Θ)P(x Θ) P(x) Parameters Prior : P w, µ, Λ ; product of Dirichlets and Normal-Wisharts Likelihood : P x ; w, µ, Λ = M j=1 w j N(x; µ j, Λ 1 j )

Bayesian Moment Matching Algorithm

Bayesian Moment Matching Algorithm Exact Bayesian Update x t Product of Dirichlets and Normal-Wisharts Mixture of Product of Dirichlets and Normal-Wisharts

Bayesian Moment Matching Algorithm Exact Bayesian Update x t Product of Dirichlets and Normal-Wisharts Projection Moment Matching Mixture of Product of Dirichlets and Normal-Wisharts

Sufficient Moments

Sufficient Moments Dirichlet : Dir w 1, w 2 w M ; α 1, α 2, α M

Sufficient Moments Dirichlet : Dir w 1, w 2 w M ; α 1, α 2, α M E w i = α i ; E w 2 α i = i (α i +1) j α j ( j α j )(1+ j α j )

Sufficient Moments Dirichlet : Dir w 1, w 2 w M ; α 1, α 2, α M E w i = α i ; E w 2 α i = i (α i +1) j α j ( j α j )(1+ j α j ) Normal-Wishart : NW µ, Λ ; µ 0, κ, W, v Λ ~ Wi(W, v) and µ Λ ~ N d µ 0, κλ 1

Sufficient Moments Dirichlet : Dir w 1, w 2 w M ; α 1, α 2, α M E w i = α i ; E w 2 α i = i (α i +1) j α j ( j α j )(1+ j α j ) Normal-Wishart : NW µ, Λ ; µ 0, κ, W, v Λ ~ Wi(W, v) and µ Λ ~ N d µ 0, κλ 1 E µ = µ 0 E (µ µ 0 )(µ µ 0 ) T = κ+1 κ(v d 1) W 1 E Λ = vw Var Λ ij = v(w ij 2 + W ii W jj )

Overall Algorithm Bayesian Step - Compute posterior P t Θ based on observation x t Sufficient Moments - Compute set of sufficient moments S for P t Θ Moment Matching - System of linear equations - Linear complexity in the number of components

Make Bayesian Learning Great Again Bayesian Moment Matching Algorithm - Uses Bayes Theorem + Method of Moments - Analytic solutions to Moment matching (unlike EP, ADF) - One pass over data

Make Bayesian Learning Great Again Bayesian Moment Matching Algorithm - Uses Bayes Theorem + Method of Moments - Analytic solutions to Moment matching (unlike EP, ADF) - One pass over data Bayesian Moment Matching Distributed Online Consistent?

Experiments Data Set Instances oem obmm Abalone 4177-2.65-1.82 Banknote 1372-9.74-9.65 Airfoil 1503-15.86-16.53 Arabic 8800-15.83-14.99 Transfusion 748-13.26-13.09 CCPP 9568-16.53-16.51 Comp. Ac 8192-132.04-118.82 Kinematics 8192-10.37-10.32 Northridge 2929-18.31-17.97 Plastic 1650-9.46-9.01

Experiments Data (Features) Instances oem obmm odmm Heterogeneity(16) 3,930,257-176.2-174.3-180.7 Magic (10) 19,000-33.4-32.1-35.4 Year MSD (91) 515,345-513.7-506.5-513.8 Miniboone (50) 130,064-58.1-54.7-60.3 Avg. Log-Likelihood Data (Features) Instances oem obmm odmm Heterogeneity(16) 3,930,257 77.3 81.7 17.5 Magic (10) 19,000 7.3 6.8 1.4 Year MSD (91) 515,345 336.5 108.2 21.2 Miniboone (50) 130,064 48.6 12.1 2.3 Running Time

Bayesian Moment Matching Discrete Data : Omar (2015, PhD Thesis) for Dirichlets; Rashwan, Zhao & Poupart (AISTATS 16) for SPNs; Hsu & Poupart (NIPS 16) for Topic Modelling Continuous Data : Jaini & Poupart, 2016 (arxiv); Jaini, Rashwan et al, (PGM 16) for SPNs; Poupart, Chen, Jaini et al (NetworksML 16) Sequence Data and Transfer Learning : Jaini, Poupart et al, (submitted to ICLR 17)

Consistent/Robust Tensor Decomposition And Spectral Learning Offline Bayesian Learning ADF, EP, SGD, oem Non convergence and local optima Mixture Models Can be distributed; Practical problems Bayesian Moment Matching algorithm Consistent? SPNs Bayesian Moment Matching for continuous SPNs Comparison with other deep networks Online

What is a Sum-Product Network? Proposed by Poon and Domingos (UAI 2011) equivalent to Arithmetic Circuits (Darwiche 2003) Deep architecture with clear semantics Tractable Probabilistic Graphical Models

What is a Sum-Product Network? P x 1, x 2 = P x 1, x 2 x 1 x 2 + P x 1, x 2 x 1 x 2 + P x 1, x 2 x 1 x 2 + P(x 1, x 2 )x 1 x 2 a x X + + 0.4 0.6 0.5 0.5 x 1 x 1 x 2 x 2

What is a Sum-Product Network? P x 1, x 2 = P x 1, x 2 x 1 x 2 + P x 1, x 2 x 1 x 2 + P x 1, x 2 x 1 x 2 + P(x 1, x 2 )x 1 x 2 a x X SPNs - Directed Acyclic Graphs + + 0.4 0.6 0.5 0.5 x 1 x 1 x 2 x 2

What is a Sum-Product Network? P x 1, x 2 = P x 1, x 2 x 1 x 2 + P x 1, x 2 x 1 x 2 + P x 1, x 2 x 1 x 2 + P(x 1, x 2 )x 1 x 2 + a x X + 0.4 0.6 0.5 0.5 x 1 x 1 x 2 x 2 SPNs - Directed Acyclic Graphs A valid SPN is - Complete : Each sum node has children with same scope

What is a Sum-Product Network? P x 1, x 2 = P x 1, x 2 x 1 x 2 + P x 1, x 2 x 1 x 2 + P x 1, x 2 x 1 x 2 + P(x 1, x 2 )x 1 x 2 + a x X + 0.4 0.6 0.5 0.5 x 1 x 1 x 2 x 2 SPNs - Directed Acyclic Graphs A valid SPN is - Complete : Each sum node has children with same scope - Decomposable : Each product node has children with disjoint scope

Probabilistic Inference - SPN 34.8 32 36 8 9 4 0 1 1 0 SPN represents a joint distribution over a set of random variables Example : Query : Pr(X 1 = 1, X 2 = 0) Pr(X 1 = 1, X 2 = 0) = 34.8/constant

Probabilistic Inference - SPN 100 100 100 10 10 10 SPN represents a joint distribution over a set of random variables Example : Query : Pr(X 1 = 1, X 2 = 0) Pr(X 1 = 1, X 2 = 0) = 34.8/100 1 1 1 1 Linear Complexity two bottom passes for any query

Discrete Learning SPNS Parameter Learning : - Maximum Likelihood: SGD (Poon & Domingos, 2011) slow convergence, inaccurate; EM(Perharz, 2015) inaccurate; Signomial Programming (Zhao & Poupart, 2016) - Bayesian Learning: BMM(Rashwan et al., 2016) accurate; Collapsed Variational Inference (Zhao et al., 2016) accurate Online parameter learning for continuous SPNs (Jaini et al.2016) - extend obmm to Gaussian SPNS

Continuous SPNS Gaussian NN 11 (X (X 1 ) N 2 (X 1 ) N 3 (X 2 ) N 4 1 ) (X 2 )

Continuous SPNS Gaussian NN 11 (X (X 1 ) N 2 (X 1 ) N 3 (X 2 ) N 4 1 ) (X 2 ) mixture of Gaussians (Unnormalised)

Continuous SPNS Hierarchical mixture of Gaussians (Unnormalised) N 1 (X 1 ) N 2 (X 1 ) N 3 (X 2 ) N 4 (X 2 )

Bayesian Learning of SPNs + w 7 w 8 + + Parameters : weights, means and precisions (inverse covariance matrices) Prior : P w, µ, Λ ; product of Dirichlets and Normal-Wisharts + + w 2 w 3 w 5 w 6 w 4 w N(X 2 ; µ 3, Λ 1 3 ) N(X 2 ; µ 4, Λ 1 1 4 ) N(X 1 ; µ 1, Λ 1 1 )N(X 1 ; µ 2, Λ 2 1 ) + Likelihood : P x ; w, µ, Λ =SPN x ; w, µ, Λ Posterior : P w, µ, Λ; data P w, µ, Λ n SPN x n ; w, µ, Λ

Bayesian Moment Matching Algorithm Exact Bayesian Update x t Product of Dirichlets and Normal-Wisharts Projection Moment Matching Mixture of Product of Dirichlets and Normal-Wisharts

Overall Algorithm Recursive computation of all constants - Linear complexity in the size of SPN Moment Matching - System of linear equations - Linear complexity in the size of SPN Streaming Data - Posterior update : constant time w.r.t. amount of data

Empirical Results Dataset Flow Size Quake Banknote Abalone Kinematics CA Sensorless Drive Attributes 3 4 4 8 8 22 48 obmm (random) oem (random) obmm (GMM) oem (GMM) Avg. log-likelihood and standard error based on 10-fold cross-validation - - - -1.82 0.19 - - - -11.36 0.19 4.80 0.67-0.49 3.29-3.84 0.16-5.50 0.41-4.81 0.13-4.81 0.13-1.21 0.36-3.53 1.68-11.19 0.03-11.35 0.03-11.24 0.04-11.35 0.03 obmm performs better than oem -2.47 0.56-31.34 1.07-1.78 0.59-21.39 1.58 1.58 1.28-3.40 6.06 - -

Empirical Results Dataset Flow Size Quake Banknote Abalone Kinematics CA Sensorless Drive Attributes 3 4 4 8 8 22 48 obmm (random) obmm (GMM) SRBM GenMMN Avg. log-likelihood and standard error based on 10-fold cross-validation - - - -1.82 0.19 4.80 0.67-0.79 0.01-0.40 0.01-3.84 0.16-2.38 0.01-3.83 0.01-4.81 0.13-2.76 0.01-1.70 0.03-1.21 0.36-2.28 0.01-3.29 0.10-11.19 0.03-11.24 0.04-2.47 0.56-1.78 0.59-5.55 0.02-4.95 0.01-11.36 0.02` obmm is competitive with SBRM and GenMMN -5.41 0.14 1.58 1.28 - -26.91 0.03-29.41 1.19

Consistent/Robust Tensor Decomposition And Spectral Learning Offline Bayesian Learning ADF, EP, SGD, oem Non convergence and local optima Online Mixture Models Can be distributed; Practical problems Bayesian Moment Matching algorithm Consistent? SPNs Bayesian Moment Matching for continuous SPNs Comparison with other deep networks obmm performs better than oem and comparable to other techniques

Conclusion and Future Work Contributions : - Online and distributed Bayesian Moment Matching algorithm - Performs better than oem w.r.t time and accuracy - Extended it to continuous SPNs - Comparative analysis with other deep learning methods Future Work : - Theoretical properties of BMM consistent? - Generalize obmm to exponential family - Extension to sequential data and transfer learning - Online structure learning of SPNs

Thank you!