Online Algorithms for Sum-Product

Size: px

Start display at page:

Download "Online Algorithms for Sum-Product"

Margaret McCarthy
5 years ago
Views:

1 Online Algorithms for Sum-Product Networks with Continuous Variables Priyank Jaini Ph.D. Seminar

2 Consistent/Robust Tensor Decomposition And Spectral Learning Offline Bayesian Learning ADF, EP, SGD, oem Non convergence and local optima Mixture Models Can be distributed; Practical problems Bayesian Moment Matching algorithm SPNs Bayesian Moment Matching for continuous SPNs Comparison with other deep networks Online

3 Streaming Data Activity Recognition Recommendation

4 Streaming Data Activity Recognition Recommendation Challenge : update model after each observation

5 Algorithm s characteristics

6 Algorithm s characteristics Online x t

7 Algorithm s characteristics Online x t Distributed x a:d x a:b x c:d

8 Algorithm s characteristics Online x t Distributed x a:d x a:b x c:d Consistent x 1:t Θ*

9 How can we learn mixture models robustly from streaming data?

10 Learning Algorithms

11 Learning Algorithms Robust : Tensor Decomposition(Anandkumar et.al, 2014),Spectral Learning(Hsu et al, 2012, Parikar and Xing, 2011); offline

12 Learning Algorithms Robust : Tensor Decomposition(Anandkumar et.al, 2014),Spectral Learning(Hsu et al, 2012, Parikar and Xing, 2011); offline Online : Assumed Density Filtering (Maybeck 1982; Lauritzen 1992; Opper & Winther 1999); not robust

13 Learning Algorithms Robust : Tensor Decomposition(Anandkumar et.al, 2014),Spectral Learning(Hsu et al, 2012, Parikar and Xing, 2011); offline Online : Assumed Density Filtering (Maybeck 1982; Lauritzen 1992; Opper & Winther 1999); not robust Expectation Propagation (Minka 2001); does not converge

14 Learning Algorithms Robust : Tensor Decomposition(Anandkumar et.al, 2014),Spectral Learning(Hsu et al, 2012, Parikar and Xing, 2011); offline Online : Assumed Density Filtering (Maybeck 1982; Lauritzen 1992; Opper & Winther 1999); not robust Expectation Propagation (Minka 2001); does not converge Stochastic Gradient Descent (Zhang 2004) online Expectation Maximization (Cappe 2012) SGD and oem : local optimum and cannot be distributed

15 Learning Algorithms Exact Bayesian Learning : Dirichlet Mixtures(Ghosal et al 1999), Gaussian Mixtures(Lijoi et al, 2005), Non-parametric Problems (Barron et al, 1999), (Freedman, 1999)

16 Learning Algorithms Exact Bayesian Learning : Dirichlet Mixtures(Ghosal et al 1999), Gaussian Mixtures(Lijoi et al, 2005), Non-parametric Problems (Barron et al, 1999), (Freedman, 1999) Exact Bayesian Learning Online

17 Learning Algorithms Exact Bayesian Learning : Dirichlet Mixtures(Ghosal et al 1999), Gaussian Mixtures(Lijoi et al, 2005), Non-parametric Problems (Barron et al, 1999), (Freedman, 1999) Exact Bayesian Learning Online Distributed

18 Learning Algorithms Exact Bayesian Learning : Dirichlet Mixtures(Ghosal et al 1999), Gaussian Mixtures(Lijoi et al, 2005), Non-parametric Problems (Barron et al, 1999), (Freedman, 1999) Exact Bayesian Learning Distributed Online Consistent

19 Learning Algorithms Exact Bayesian Learning : Dirichlet Mixtures(Ghosal et al 1999), Gaussian Mixtures(Lijoi et al, 2005), Non-parametric Problems (Barron et al, 1999), (Freedman, 1999) Exact Bayesian Learning Online Distributed Consistent In theory; practical problems!

21 Thomas Bayes (c )

22 Bayesian Learning Uses Bayes Theorem P Θ x = P(Θ)P(x Θ) P(x) Thomas Bayes (c ) Prior Belief New Information Bayes Theorem Updated Belief P(Θ) x P(Θ)P(x Θ) P Θ x P(x)

23 Bayesian Learning Mixture models

24 Bayesian Learning Mixture models Data : x 1:n where x i ~ M j=1 w j N(x i ; µ j, Σ j )

25 Bayesian Learning Mixture models Data : x 1:n where x i ~ M j=1 w j N(x i ; µ j, Σ j ) P n Θ = Pr Θ x 1:n

26 Bayesian Learning Mixture models Data : x 1:n where x i ~ M j=1 w j N(x i ; µ j, Σ j ) P n Θ = Pr Θ x 1:n P n 1 Θ Pr(x n Θ)

27 Bayesian Learning Mixture models Data : x 1:n where x i ~ M j=1 w j N(x i ; µ j, Σ j ) P n Θ = Pr Θ x 1:n P n 1 Θ Pr(x n Θ) P n 1 Θ M j=1 w j N(x i ; µ j, Σ j )

28 Bayesian Learning Mixture models Data : x 1:n where x i ~ M j=1 w j N(x i ; µ j, Σ j ) P n Θ = Pr Θ x 1:n P n 1 Θ Pr(x n Θ) P n 1 Θ M j=1 w j N(x i ; µ j, Σ j ) Intractable!!!

29 Bayesian Learning Mixture models Data : x 1:n where x i ~ M j=1 w j N(x i ; µ j, Σ j ) P n Θ = Pr Θ x 1:n P n 1 Θ Pr(x n Θ) P n 1 Θ M j=1 w j N(x i ; µ j, Σ j ) Intractable!!! Solution : Bayesian Moment Matching Algorithm

31 Karl Pearson (c )

32 Method of Moments Probability distributions defined by set of parameters Parameters can be estimated by a set of moments Karl Pearson (c ) X~ N(X; µ, σ 2 ) E X = µ E X µ 2 = σ 2

33 Make Bayesian Learning Great Again

34 Make Bayesian Learning Great Again Bayesian Learning And Method of Moments

35 Make Bayesian Learning Great Again Bayesian Learning And Method of Moments

36 Gaussian Mixture Models x i ~ M j=1 w j N(x i ; µ j, Σ j )

37 Gaussian Mixture Models x i ~ M j=1 w j N(x i ; µ j, Σ j ) Parameters : weights, means and precisions (inverse covariance matrices)

38 Bayesian Moment Matching for Gaussian Mixture Models

39 Bayesian Moment Matching for Gaussian Mixture Models P Θ x = P(Θ)P(x Θ) P(x) Parameters Parameters : weights, means and precisions (inverse covariance matrices)

40 Bayesian Moment Matching for Gaussian Mixture Models Prior P Θ x = P(Θ)P(x Θ) P(x) Parameters : weights, means and precisions (inverse covariance matrices) Prior : P w, µ, Λ ; product of Dirichlets and Normal-Wisharts Parameters

41 Bayesian Moment Matching for Gaussian Mixture Models Prior Likelihood Parameters : weights, means and precisions (inverse covariance matrices) P Θ x = P(Θ)P(x Θ) P(x) Parameters Prior : P w, µ, Λ ; product of Dirichlets and Normal-Wisharts Likelihood : P x ; w, µ, Λ = M j=1 w j N(x; µ j, Λ 1 j )

42 Bayesian Moment Matching Algorithm

43 Bayesian Moment Matching Algorithm Exact Bayesian Update x t Product of Dirichlets and Normal-Wisharts Mixture of Product of Dirichlets and Normal-Wisharts

44 Bayesian Moment Matching Algorithm Exact Bayesian Update x t Product of Dirichlets and Normal-Wisharts Projection Moment Matching Mixture of Product of Dirichlets and Normal-Wisharts

45 Sufficient Moments

46 Sufficient Moments Dirichlet : Dir w 1, w 2 w M ; α 1, α 2, α M

47 Sufficient Moments Dirichlet : Dir w 1, w 2 w M ; α 1, α 2, α M E w i = α i ; E w 2 α i = i (α i +1) j α j ( j α j )(1+ j α j )

48 Sufficient Moments Dirichlet : Dir w 1, w 2 w M ; α 1, α 2, α M E w i = α i ; E w 2 α i = i (α i +1) j α j ( j α j )(1+ j α j ) Normal-Wishart : NW µ, Λ ; µ 0, κ, W, v Λ ~ Wi(W, v) and µ Λ ~ N d µ 0, κλ 1

49 Sufficient Moments Dirichlet : Dir w 1, w 2 w M ; α 1, α 2, α M E w i = α i ; E w 2 α i = i (α i +1) j α j ( j α j )(1+ j α j ) Normal-Wishart : NW µ, Λ ; µ 0, κ, W, v Λ ~ Wi(W, v) and µ Λ ~ N d µ 0, κλ 1 E µ = µ 0 E (µ µ 0 )(µ µ 0 ) T = κ+1 κ(v d 1) W 1 E Λ = vw Var Λ ij = v(w ij 2 + W ii W jj )

50 Overall Algorithm Bayesian Step - Compute posterior P t Θ based on observation x t Sufficient Moments - Compute set of sufficient moments S for P t Θ Moment Matching - System of linear equations - Linear complexity in the number of components

51 Make Bayesian Learning Great Again Bayesian Moment Matching Algorithm - Uses Bayes Theorem + Method of Moments - Analytic solutions to Moment matching (unlike EP, ADF) - One pass over data

52 Make Bayesian Learning Great Again Bayesian Moment Matching Algorithm - Uses Bayes Theorem + Method of Moments - Analytic solutions to Moment matching (unlike EP, ADF) - One pass over data Bayesian Moment Matching Distributed Online Consistent?

53 Experiments Data Set Instances oem obmm Abalone Banknote Airfoil Arabic Transfusion CCPP Comp. Ac Kinematics Northridge Plastic

54 Experiments Data (Features) Instances oem obmm odmm Heterogeneity(16) 3,930, Magic (10) 19, Year MSD (91) 515, Miniboone (50) 130, Avg. Log-Likelihood Data (Features) Instances oem obmm odmm Heterogeneity(16) 3,930, Magic (10) 19, Year MSD (91) 515, Miniboone (50) 130, Running Time

55 Bayesian Moment Matching Discrete Data : Omar (2015, PhD Thesis) for Dirichlets; Rashwan, Zhao & Poupart (AISTATS 16) for SPNs; Hsu & Poupart (NIPS 16) for Topic Modelling Continuous Data : Jaini & Poupart, 2016 (arxiv); Jaini, Rashwan et al, (PGM 16) for SPNs; Poupart, Chen, Jaini et al (NetworksML 16) Sequence Data and Transfer Learning : Jaini, Poupart et al, (submitted to ICLR 17)

56 Consistent/Robust Tensor Decomposition And Spectral Learning Offline Bayesian Learning ADF, EP, SGD, oem Non convergence and local optima Mixture Models Can be distributed; Practical problems Bayesian Moment Matching algorithm Consistent? SPNs Bayesian Moment Matching for continuous SPNs Comparison with other deep networks Online

57 What is a Sum-Product Network? Proposed by Poon and Domingos (UAI 2011) equivalent to Arithmetic Circuits (Darwiche 2003) Deep architecture with clear semantics Tractable Probabilistic Graphical Models

58 What is a Sum-Product Network? P x 1, x 2 = P x 1, x 2 x 1 x 2 + P x 1, x 2 x 1 x 2 + P x 1, x 2 x 1 x 2 + P(x 1, x 2 )x 1 x 2 a x X x 1 x 1 x 2 x 2

59 What is a Sum-Product Network? P x 1, x 2 = P x 1, x 2 x 1 x 2 + P x 1, x 2 x 1 x 2 + P x 1, x 2 x 1 x 2 + P(x 1, x 2 )x 1 x 2 a x X SPNs - Directed Acyclic Graphs x 1 x 1 x 2 x 2

60 What is a Sum-Product Network? P x 1, x 2 = P x 1, x 2 x 1 x 2 + P x 1, x 2 x 1 x 2 + P x 1, x 2 x 1 x 2 + P(x 1, x 2 )x 1 x 2 + a x X x 1 x 1 x 2 x 2 SPNs - Directed Acyclic Graphs A valid SPN is - Complete : Each sum node has children with same scope

61 What is a Sum-Product Network? P x 1, x 2 = P x 1, x 2 x 1 x 2 + P x 1, x 2 x 1 x 2 + P x 1, x 2 x 1 x 2 + P(x 1, x 2 )x 1 x 2 + a x X x 1 x 1 x 2 x 2 SPNs - Directed Acyclic Graphs A valid SPN is - Complete : Each sum node has children with same scope - Decomposable : Each product node has children with disjoint scope

62 Probabilistic Inference - SPN SPN represents a joint distribution over a set of random variables Example : Query : Pr(X 1 = 1, X 2 = 0) Pr(X 1 = 1, X 2 = 0) = 34.8/constant

63 Probabilistic Inference - SPN SPN represents a joint distribution over a set of random variables Example : Query : Pr(X 1 = 1, X 2 = 0) Pr(X 1 = 1, X 2 = 0) = 34.8/ Linear Complexity two bottom passes for any query

64 Discrete Learning SPNS Parameter Learning : - Maximum Likelihood: SGD (Poon & Domingos, 2011) slow convergence, inaccurate; EM(Perharz, 2015) inaccurate; Signomial Programming (Zhao & Poupart, 2016) - Bayesian Learning: BMM(Rashwan et al., 2016) accurate; Collapsed Variational Inference (Zhao et al., 2016) accurate Online parameter learning for continuous SPNs (Jaini et al.2016) - extend obmm to Gaussian SPNS

65 Continuous SPNS Gaussian NN 11 (X (X 1 ) N 2 (X 1 ) N 3 (X 2 ) N 4 1 ) (X 2 )

66 Continuous SPNS Gaussian NN 11 (X (X 1 ) N 2 (X 1 ) N 3 (X 2 ) N 4 1 ) (X 2 ) mixture of Gaussians (Unnormalised)

67 Continuous SPNS Hierarchical mixture of Gaussians (Unnormalised) N 1 (X 1 ) N 2 (X 1 ) N 3 (X 2 ) N 4 (X 2 )

68 Bayesian Learning of SPNs + w 7 w Parameters : weights, means and precisions (inverse covariance matrices) Prior : P w, µ, Λ ; product of Dirichlets and Normal-Wisharts + + w 2 w 3 w 5 w 6 w 4 w N(X 2 ; µ 3, Λ 1 3 ) N(X 2 ; µ 4, Λ ) N(X 1 ; µ 1, Λ 1 1 )N(X 1 ; µ 2, Λ 2 1 ) + Likelihood : P x ; w, µ, Λ =SPN x ; w, µ, Λ Posterior : P w, µ, Λ; data P w, µ, Λ n SPN x n ; w, µ, Λ

69 Bayesian Moment Matching Algorithm Exact Bayesian Update x t Product of Dirichlets and Normal-Wisharts Projection Moment Matching Mixture of Product of Dirichlets and Normal-Wisharts

70 Overall Algorithm Recursive computation of all constants - Linear complexity in the size of SPN Moment Matching - System of linear equations - Linear complexity in the size of SPN Streaming Data - Posterior update : constant time w.r.t. amount of data

71 Empirical Results Dataset Flow Size Quake Banknote Abalone Kinematics CA Sensorless Drive Attributes obmm (random) oem (random) obmm (GMM) oem (GMM) Avg. log-likelihood and standard error based on 10-fold cross-validation obmm performs better than oem

72 Empirical Results Dataset Flow Size Quake Banknote Abalone Kinematics CA Sensorless Drive Attributes obmm (random) obmm (GMM) SRBM GenMMN Avg. log-likelihood and standard error based on 10-fold cross-validation ` obmm is competitive with SBRM and GenMMN

73 Consistent/Robust Tensor Decomposition And Spectral Learning Offline Bayesian Learning ADF, EP, SGD, oem Non convergence and local optima Online Mixture Models Can be distributed; Practical problems Bayesian Moment Matching algorithm Consistent? SPNs Bayesian Moment Matching for continuous SPNs Comparison with other deep networks obmm performs better than oem and comparable to other techniques

74 Conclusion and Future Work Contributions : - Online and distributed Bayesian Moment Matching algorithm - Performs better than oem w.r.t time and accuracy - Extended it to continuous SPNs - Comparative analysis with other deep learning methods Future Work : - Theoretical properties of BMM consistent? - Generalize obmm to exponential family - Extension to sequential data and transfer learning - Online structure learning of SPNs

75 Thank you!

Online Bayesian Transfer Learning for Sequential Data Modeling

Online Bayesian Transfer Learning for Sequential Data Modeling....? Priyank Jaini Machine Learning, Algorithms and Theory Lab Network for Aging Research 2 3 Data of personal preferences (years) Data (non-existent)