A Unified Approach for Learning the Parameters of Sum-Product Networks

Size: px

Start display at page:

Download "A Unified Approach for Learning the Parameters of Sum-Product Networks"

Carmel Berenice Ramsey
5 years ago
Views:

1 A Unifie Approach for Learning the Parameters of Sum-Prouct Networks Han Zhao Machine Learning Dept. Carnegie Mellon University Pascal Poupart School of Computer Science University of Waterloo Geoff Goron Machine Learning Dept. Carnegie Mellon University Abstract We present a unifie approach for learning the parameters of Sum-Prouct networks (SPNs). We prove that any complete an ecomposable SPN is equivalent to a mixture of trees where each tree correspons to a prouct of univariate istributions. Base on the mixture moel perspective, we characterize the objective function when learning SPNs base on the maximum likelihoo estimation (MLE) principle an show that the optimization problem can be formulate as a signomial program. We construct two parameter learning algorithms for SPNs by using sequential monomial approximations (SMA) an the concave-convex proceure (CCCP), respectively. The two propose methos naturally amit multiplicative upates, hence effectively avoiing the projection operation. With the help of the unifie framework, we also show that, in the case of SPNs, CCCP leas to the same algorithm as Expectation Maximization (EM) espite the fact that they are ifferent in general. 1 Introuction Sum-prouct networks (SPNs) are new eep graphical moel architectures that amit exact probabilistic inference in linear time in the size of the network [14]. Similar to traitional graphical moels, there are two main problems when learning SPNs: structure learning an parameter learning. Parameter learning is interesting even if we know the groun truth structure ahea of time; structure learning epens on parameter learning, so better parameter learning can often lea to better structure learning. Poon an Domingos [14] an Gens an Domingos [6] propose both generative an iscriminative learning algorithms for parameters in SPNs. At a high level, these approaches view SPNs as eep architectures an apply projecte graient escent (PGD) to optimize the ata log-likelihoo. There are several rawbacks associate with PGD. For example, the projection step in PGD hurts the convergence of the algorithm an it will often lea to solutions on the bounary of the feasible region. Also, PGD contains an aitional arbitrary parameter, the projection margin, which can be har to set well in practice. In [14, 6], the authors also mentione the possibility of applying EM algorithms to train SPNs by viewing sum noes in SPNs as hien variables. They presente an EM upate formula without etails. However, the upate formula for EM given in [14, 6] is incorrect, as first pointe out an correcte by [12]. In this paper we take a ifferent perspective an present a unifie framework, which treats [14, 6] as special cases, for learning the parameters of SPNs. We prove that any complete an ecomposable SPN is equivalent to a mixture of trees where each tree correspons to a prouct of univariate istributions. Base on the mixture moel perspective, we can precisely characterize the functional form of the objective function base on the network structure. We show that the optimization problem associate with learning the parameters of SPNs base on the MLE principle can be formulate as a signomial program (SP), where both PGD an exponentiate graient (EG) can be viewe as first orer approximations of the signomial program after suitable transformations of the objective 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

2 function. We also show that the signomial program formulation can be equivalently transforme into a ifference of convex functions (DCP) formulation, where the objective function of the program can be naturally expresse as a ifference of two convex functions. The DCP formulation allows us to evelop two efficient optimization algorithms for learning the parameters of SPNs base on sequential monomial approximations (SMA) an the concave-convex proceure (CCCP), respectively. Both propose approaches naturally amit multiplicative upates, hence effectively eal with the positivity constraints of the optimization. Furthermore, uner our unifie framework, we also show that CCCP leas to the same algorithm as EM espite that these two approaches are ifferent from each other in general. Although we mainly focus on MLE base parameter learning, the mixture moel interpretation of SPN also helps to evelop a Bayesian learning metho for SPNs [21]. PGD, EG, SMA an CCCP can all be viewe as ifferent levels of convex relaxation of the original SP. Hence the framework also provies an intuitive way to compare all four approaches. We conuct extensive experiments on 20 benchmark ata sets to compare the empirical performance of PGD, EG, SMA an CCCP. Experimental results valiate our theoretical analysis that CCCP is the best among all 4 approaches, showing that it converges consistently faster an with more stability than the other three methos. Furthermore, we use CCCP to boost the performance of LearnSPN [7], showing that it can achieve results comparable to state-of-the-art structure learning algorithms using SPNs with much smaller network sizes. 2 Backgroun 2.1 Sum-Prouct Networks To simplify the iscussion of the main iea of our unifie framework, we focus our attention on SPNs over Boolean ranom variables. However, the framework presente here is general an can be easily extene to other iscrete an continuous ranom variables. We first efine the notion of network polynomial. We use I x to enote an inicator variable that returns 1 when X = x an 0 otherwise. Definition 1 (Network Polynomial [4]). Let f( ) 0 be an unnormalize probability istribution over a Boolean ranom vector X 1:N. The network polynomial of f( ) is a multilinear function Px f(x) Q N n=1 I x n of inicator variables, where the summation is over all possible instantiations of the Boolean ranom vector X 1:N. A Sum-Prouct Network (SPN) over Boolean variables X 1:N is a roote DAG that computes the network polynomial over X 1:N. The leaves are univariate inicators of Boolean variables an internal noes are either sum or prouct. Each sum noe computes a weighte sum of its chilren an each prouct noe computes the prouct of its chilren. The scope of a noe in an SPN is efine as the set of variables that have inicators among the noe s escenants. For any noe v in an SPN, if v is a terminal noe, say, an inicator variable over X, then scope(v) ={X}, else scope(v) = S ṽ2ch(v) scope(ṽ). An SPN is complete iff each sum noe has chilren with the same scope. An SPN is ecomposable iff for every prouct noe v, scope(v i ) T scope(v j ) =? where v i,v j 2 Ch(v),i6= j. The scope of the root noe is {X 1,...,X N }. In this paper, we focus on complete an ecomposable SPNs. For a complete an ecomposable SPN S, each noe v in S efines a network polynomial f v ( ) which correspons to the sub-spn (subgraph) roote at v. The network polynomial of S, enote by f S, is the network polynomial efine by the root of S, which can be compute recursively from its chilren. The probability f S(x) Px f S(x) istribution inuce by an SPN S is efine as Pr S (x),. The normalization constant P Px f S(x) can be compute in O( S ) in SPNs by setting the values of all the leaf noes to be 1, i.e., x f S(x) =f S (1) [14]. This leas to efficient joint/marginal/conitional inference in SPNs. 2.2 Signomial Programming (SP) Before introucing SP, we first introuce geometric programming (GP), which is a strict subclass of SP. A monomial is efine as a function h : R n ++ 7! R: h(x) =x a1 1 xa2 2 xan n, where the omain is restricte to be the positive orthant (R n ++), the coefficient is positive an the exponents a i 2 R, 8i. Aposynomial is a sum of monomials: g(x) = P K k=1 kx a 1k 1 x a 2k 2 x a nk n. One of the key properties of posynomials is positivity, which allows us to transform any posynomial into the log 2

3 omain. A GP in stanar form is efine to be an optimization problem where both the objective function an the inequality constraints are posynomials an the equality constraints are monomials. There is also an implicit constraint that x 2 R n ++. A GP in its stanar form is not a convex program since posynomials are not convex functions in general. However, we can effectively transform it into a convex problem by using the logarithmic transformation trick on x, the multiplicative coefficients of each monomial an also each objective/constraint function [3, 1]. An SP has the same form as GP except that the multiplicative constant insie each monomial is not restricte to be positive, i.e., can take any real value. Although the ifference seems to be small, there is a huge ifference between GP an SP from the computational perspective. The negative multiplicative constant in monomials invaliates the logarithmic transformation trick frequently use in GP. As a result, SPs cannot be reuce to convex programs an are believe to be har to solve in general [1]. 3 Unifie Approach for Learning In this section we will show that the parameter learning problem of SPNs base on the MLE principle can be formulate as an SP. We will use a sequence of optimal monomial approximations combine with backtracking line search an the concave-convex proceure to tackle the SP. Due to space constraints, we refer intereste reaers to the supplementary material for all the proof etails. 3.1 Sum-Prouct Networks as a Mixture of Trees We introuce the notion of inuce trees from SPNs an use it to show that every complete an ecomposable SPN can be interprete as a mixture of inuce trees, where each inuce tree correspons to a prouct of univariate istributions. From this perspective, an SPN can be unerstoo as a huge mixture moel where the effective number of components in the mixture is etermine by its network structure. The metho we escribe here is not the first metho for interpreting an SPN (or the relate arithmetic circuit) as a mixture istribution [20, 5, 2]; but, the new metho can result in an exponentially smaller mixture, see the en of this section for more etails. Definition 2 (Inuce SPN). Given a complete an ecomposable SPN S over X 1:N, let T = (T V, T E ) be a subgraph of S. T is calle an inuce SPN from S if 1. Root(S) 2T V. 2. If v 2T V is a sum noe, then exactly one chil of v in S is in T V, an the corresponing ege is in T E. 3. If v 2T V is a prouct noe, then all the chilren of v in S are in T V, an the corresponing eges are in T E. Theorem 1. If T is an inuce SPN from a complete an ecomposable SPN S, then T is a tree that is complete an ecomposable. As a result of Thm. 1, we will use the terms inuce SPNs an inuce trees interchangeably. With some abuse of notation, we use T (x) to mean the value of the network polynomial of T with input vector x. Theorem 2. If T is an inuce tree from S over X 1:N, then T (x) = Q (v i,v j)2t E w ij Q N n=1 I x n, where w ij is the ege weight of (v i,v j ) if v i is a sum noe an w ij =1if v i is a prouct noe. Remark. Although we focus our attention on Boolean ranom variables for the simplicity of iscussion an illustration, Thm. 2 can be extene to the case where the univariate istributions at the leaf noes are continuous or iscrete istributions with countably infinitely many values, e.g., Gaussian istributions or Poisson istributions. We can simply replace the prouct of univariate istributions term, Q N n=1 I x n, in Thm. 2 to be the general form Q N n=1 p n(x n ), where p n (X n ) is a univariate istribution over X n. Also note that it is possible for two unique inuce trees to share the same prouct of univariate istributions, but in this case their weight terms Q (v i,v i)2t E w ij are guarantee to be ifferent. As we will see shortly, Thm. 2 implies that the joint istribution over {X n } N n=1 represente by an SPN is essentially a mixture moel with potentially exponentially many components in the mixture. 3

4 Definition 3 (Network carinality). The network carinality S of an SPN S is the number of unique inuce trees. Theorem 3. S = f S (1 1), where f S (1 1) is the value of the network polynomial of S with input vector 1 an all ege weights set to be 1. Theorem 4. S(x) = P S t=1 T t(x), where T t is the tth unique inuce tree of S. Remark. The above four theorems prove the fact that an SPN S is an ensemble or mixture of trees, where each tree computes an unnormalize istribution over X 1:N. The total number of unique trees in S is the network carinality S, which only epens on the structure of S. Each component is a simple prouct of univariate istributions. We illustrate the theorems above with a simple example in Fig. 1. w1 w3 w2 = w 1 +w 2 +w X1 X1 X2 X2 X1 X2 X1 X2 X1 X2 Figure 1: A complete an ecomposable SPN is a mixture of inuce trees. Double circles inicate univariate istributions over X 1 an X 2. Different colors are use to highlight unique inuce trees; each inuce tree is a prouct of univariate istributions over X 1 an X 2. Zhao et al. [20] show that every complete an ecomposable SPN is equivalent to a bipartite Bayesian network with a layer of hien variables an a layer of observable ranom variables. The number of hien variables in the bipartite Bayesian network is equal to the number of sum noes in S. A naive expansion of such Bayesian network to a mixture moel will lea to a huge mixture moel with 2 O(M) components, where M is the number of sum noes in S. Here we complement their theory an show that each complete an ecomposable SPN is essentially a mixture of trees an the effective number of unique inuce trees is given by S. Note that S = f S (1 1) epens only on the network structure, an can often be much smaller than 2 O(M). Without loss of generality, assuming that in S layers of sum noes are alternating with layers of prouct noes, then f S (1 1) = (2 h ), where h is the height of S. However, the exponentially many trees are recursively merge an combine in S such that the overall network size is still tractable. 3.2 Maximum Likelihoo Estimation as SP Let s consier the likelihoo function compute by an SPN S over N binary ranom variables with moel parameters w an input vector x 2{0, 1} N. Here the moel parameters in S are ege weights from every sum noe, an we collect them together into a long vector w 2 R D ++, where D correspons to the number of eges emanating from sum noes in S. By efinition, the probability istribution inuce by S can be compute by Pr S (x w), P f S(x w) x f S(x w) = f S(x w) f S (1 w). Corollary 5. Let S be an SPN with weights w 2 R D ++ over input vector x 2{0, 1} N, the network polynomial f S (x w) is a posynomial: f S (x w) = P f S (1 1) Q N Q D t=1 n=1 I(t) x n =1 wi w 2T t, where I w 2T t is the inicator variable whether w is in the t-th inuce tree T t or not. Each monomial correspons exactly to a unique inuce tree SPN from S. The above statement is a irect corollary of Thm. 2, Thm. 3 an Thm. 4. From the efinition of network polynomial, we know that f S is a multilinear function of the inicator variables. Corollary 5 works as a complement to characterize the functional form of a network polynomial in terms of w. It follows that the likelihoo function L S (w), Pr S (x w) can be expresse as the ratio of two posynomial functions. We now show that the optimization problem base on MLE is an SP. Using the efinition of Pr(x w) an Corollary 5, let = f S (1 1), the MLE problem can be rewritten as maximize w f S (x w) f S (1 w) = subject to w 2 R D ++ P t=1 Q N n=1 I(t) x n Q D =1 wi w 2T t P t=1 Q D =1 wi w 2T t (1) 4

5 Proposition 6. The MLE problem for SPNs is a signomial program. Being nonconvex in general, SP is essentially har to solve from a computational perspective [1, 3]. However, espite the harness of SP in general, the objective function in the MLE formulation of SPNs has a special structure, i.e., it is the ratio of two posynomials, which makes the esign of efficient optimization algorithms possible. 3.3 Difference of Convex Functions Both PGD an EG are first-orer methos an they can be viewe as approximating the SP after applying a logarithmic transformation to the objective function only. Although (1) is a signomial program, its objective function is expresse as the ratio of two posynomials. Hence, we can still apply the logarithmic transformation trick use in geometric programming to its objective function an to the variables to be optimize. More concretely, let w =exp(y ), 8 an take the log of the objective function; it becomes equivalent to maximize the following new objective without any constraint on y: 0 (x) X DX!1!! X DX maximize exp y I y 2T t A log exp y I y 2T t (2) t=1 =1 Note that in the first term of Eq. 2 the upper inex (x) apple, f S (1 1) epens on the current input x. By transforming into the log-space, we naturally guarantee the positivity of the solution at each iteration, hence transforming a constraine optimization problem into an unconstraine optimization problem without any sacrifice. Both terms in Eq. 2 are convex functions in y after the transformation. Hence, the transforme objective function is now expresse as the ifference of two convex functions, which is calle a DC function [9]. This helps us to esign two efficient algorithms to solve the problem base on the general iea of sequential convex approximations for nonlinear programming Sequential Monomial Approximation Let s consier the linearization of both terms in Eq. 2 in orer to apply first-orer methos in the transforme space. To compute the graient with respect to ifferent components of y, we view each noe of an SPN as an intermeiate function of the network polynomial an apply the chain rule to back-propagate the graient. The ifferentiation of f S (x w) with respect to the root noe of the network is set to be 1. The ifferentiation of the network polynomial with respect to a partial function at each noe can then be compute in two passes of the network: the bottom-up pass evaluates the values of all partial functions given the current input x an the top-own pass ifferentiates the network polynomial with respect to each partial function. Following the evaluation-ifferentiation passes, the graient of the objective function in (2) can be compute in O( S ). Furthermore, although the computation is conucte in y, the results are fully expresse in terms of w, which suggests that in practice we o not nee to explicitly construct y from w. Let f(y) =logf S (x exp(y)) log f S (1 exp(y)). It follows that approximating f(y) with the best linear function is equivalent to using the best monomial approximation of the signomial program (1). This leas to a sequential monomial approximations of the original SP formulation: at each iteration y (k), we linearize both terms in Eq. 2 an form the optimal monomial function in terms of w (k). The aitive upate of y (k) leas to a multiplicative upate of w (k) since w (k) =exp(y (k) ), an we use a backtracking line search to etermine the step size of the upate in each iteration Concave-convex Proceure Sequential monomial approximation fails to use the structure of the problem when learning SPNs. Here we propose another approach base on the concave-convex proceure (CCCP) [18] to use the fact that the objective function is expresse as the ifference of two convex functions. At a high level CCCP solves a sequence of concave surrogate optimizations until convergence. In many cases, the maximum of a concave surrogate function can only be solve using other convex solvers an as a result the efficiency of the CCCP highly epens on the choice of the convex solvers. However, we show that by a suitable transformation of the network we can compute the maximum of the concave surrogate in close form in time that is linear in the network size, which leas to a very efficient t=1 =1 5

6 algorithm for learning the parameters of SPNs. We also prove the convergence properties of our algorithm. Consier the objective function to be maximize in DCP: f(y) = log f S (x exp(y)) log f S (1 exp(y)), f 1 (y) +f 2 (y) where f 1 (y), log f S (x exp(y)) is a convex function an f 2 (y), log f S (1 exp(y)) is a concave function. We can linearize only the convex part f 1 (y) to obtain a surrogate function ˆf(y, z) =f 1 (z)+r z f 1 (z) T (y z)+f 2 (y) (3) for 8y, z 2 R D. Now ˆf(y, z) is a concave function in y. Due to the convexity of f 1 (y) we have f 1 (y) f 1 (z)+r z f 1 (z) T (y z), 8y, z an as a result the following two properties always hol for 8y, z: ˆf(y, z) apple f(y) an ˆf(y, y) =f(y). CCCP upates y at each iteration k by solving y (k) 2 arg max y ˆf(y, y (k 1) ) unless we alreay have y (k 1) 2 arg max y ˆf(y, y (k 1) ), in which case a generalize fixe point y (k 1) has been foun an the algorithm stops. It is easy to show that at each iteration of CCCP we always have f(y (k) ) f(y (k 1) ). Note also that f(y) is computing the log-likelihoo of input x an therefore it is boune above by 0. By the monotone convergence theorem, lim k!1 f(y (k) ) exists an the sequence {f(y (k) )} converges. We now iscuss how to compute a close form solution for the maximization of the concave surrogate ˆf(y, y (k 1) ). Since ˆf(y, y (k 1) ) is ifferentiable an concave for any fixe y (k 1), a sufficient an necessary conition to fin its maximum is r y ˆf(y, y (k 1) )=r y (k 1)f 1 (y (k 1) )+r y f 2 (y) =0 (4) In the above equation, if we consier only the partial erivative with respect to y ij (w ij ), we obtain (k 1) w ij f vj (x w (k 1) S (x w (k 1) ) f S (x w (k 1) vi (x w (k 1) ) = w ijf vj (1 S (1 w) f S (1 vi (1 w) Eq. 5 leas to a system of D nonlinear equations, which is har to solve in close form. However, if P we o a change of variable by consiering locally normalize weights wij 0 (i.e., w0 ij 0 an j w0 ij =18i), then a solution can be easily compute. As escribe in [13, 20], any SPN can be transforme into an equivalent normal SPN with locally normalize weights in a bottom up pass as follows: wij 0 = w ijf vj (1 w) P j w (6) ijf vj (1 w) We can then replace w ij f vj (1 w) in the above equation by the expression it is equal to in Eq. 5 to obtain a close form solution: w 0 ij / w (k 1) ij f vj (x w (k 1) S (x w (k 1) ) f S (x w (k 1) vi (x w (k 1) ) Note that in the above erivation both f vi (1 w)/f S (1 w) S (1 w)/@f vi (1 w) can be treate as constants an hence absorbe since wij 0, 8j are constraine to be locally normalize. In orer to obtain a solution to Eq. 5, for each ege weight w ij, the sufficient statistics inclue only three terms, (k 1) i.e, the evaluation value at v j, the ifferentiation value at v i an the previous ege weight w ij, all of which can be obtaine in two passes of the network for each input x. Thus the computational complexity to obtain a maximum of the concave surrogate is O( S ). Interestingly, Eq. 7 leas to the same upate formula as in the EM algorithm [12] espite the fact that CCCP an EM start from ifferent perspectives. We show that all the limit points of the sequence {w (k) } 1 k=1 are guarantee to be stationary points of DCP in (2). Theorem 7. Let {w (k) } 1 k=1 be any sequence generate using Eq. 7 from any positive initial point, then all the limiting points of {w (k) } 1 k=1 are stationary points of the DCP in (2). In aition, lim k!1 f(y (k) )=f(y ), where y is some stationary point of (2). We summarize all four algorithms an highlight their connections an ifferences in Table 1. Although we mainly iscuss the batch version of those algorithms, all of the four algorithms can be easily aapte to work in stochastic an/or parallel settings. (5) (7) 6

7 Table 1: Summary of PGD, EG, SMA an CCCP. Var. means the optimization variables. Algo Var. Upate Type Upate Formula n o PGD w Aitive w (k+1) P R ++ w (k) + (r w f 1(w (k) ) r w f 2(w (k) )) EG w Multiplicative w (k+1) w (k) exp{ (r w f 1(w (k) ) r w f 2(w (k) ))} SMA log w Multiplicative w (k+1) w (k) exp{ w (k) (r w f 1(w (k) ) r w f 2(w (k) ))} CCCP log w Multiplicative w (k+1) ij / w (k) ij r vi f S(w (k) ) f vj (w (k) ) 4 Experiments 4.1 Experimental Setting We conuct experiments on 20 benchmark ata sets from various omains to compare an evaluate the convergence performance of the four algorithms: PGD, EG, SMA an CCCP (EM). These 20 ata sets are wiely use in [7, 15] to assess ifferent SPNs for the task of ensity estimation. All the features in the 20 ata sets are binary features. All the SPNs that are use for comparisons of PGD, EG, SMA an CCCP are traine using LearnSPN [7]. We iscar the weights returne by LearnSPN an use ranom weights as initial moel parameters. The ranom weights are etermine by the same ranom see in all four algorithms. Detaile information about these 20 atasets an the SPNs use in the experiments are provie in the supplementary material. 4.2 Parameter Learning We implement all four algorithms in C++. For each algorithm, we set the maximum number of iterations to 50. If the absolute ifference in the training log-likelihoo at two consecutive steps is less than 0.001, the algorithms are stoppe. For PGD, EG an SMA, we combine each of them with backtracking line search an use a weight shrinking coefficient set at 0.8. The learning rates are initialize to 1.0 for all three methos. For PGD, we set the projection margin to There is no learning rate an no backtracking line search in CCCP. We set the smoothing parameter to in CCCP to avoi numerical issues. We show in Fig. 2 the average log-likelihoo scores on 20 training ata sets to evaluate the convergence spee an stability of PGD, EG, SMA an CCCP. Clearly, CCCP wins by a large margin over PGD, EG an SMA, both in convergence spee an solution quality. Furthermore, among the four algorithms, CCCP is the most stable one ue to its guarantee that the log-likelihoo (on training ata) will not ecrease after each iteration. As shown in Fig. 2, the training curves of CCCP are more smooth than the other three methos in almost all the cases. These 20 experiments also clearly show that CCCP often converges in a few iterations. On the other han, PGD, EG an SMA are on par with each other since they are all first-orer methos. SMA is more stable than PGD an EG an often achieves better solutions than PGD an EG. On large ata sets, SMA also converges faster than PGD an EG. Surprisingly, EG performs worse than PGD in some cases an is quite unstable espite the fact that it amits multiplicative upates. The hook shape curves of PGD in some ata sets, e.g. Kosarak an KDD, are ue to the projection operations. Table 2: Average log-likelihoos on test ata. Highest log-likelihoos are highlighte in bol. " shows statistically better log-likelihoos than CCCP an # shows statistically worse log-likelihoos than CCCP. The significance is measure base on the Wilcoxon signe-rank test. Data set CCCP LearnSPN ID-SPN Data set CCCP LearnSPN ID-SPN NLTCS # # DNA # " MSNBC # Kosarak # KDD 2k # # MSWeb # Plants # " Book # " Auio # EachMovie # " Jester # # WebKB # " Netflix # " Reuters # " Accients # " Newsgrp # " Retail # BBC # # Pumsb-star # " A # #

8 Figure 2: Negative log-likelihoo values versus number of iterations for PGD, EG, SMA an CCCP. The computational complexity per upate is O( S ) in all four algorithms. CCCP often takes less time than the other three algorithms because it takes fewer iterations to converge. We list etaile running time statistics for all four algorithms on the 20 ata sets in the supplementary material. 4.3 Fine Tuning We combine CCCP as a fine tuning proceure with the structure learning algorithm LearnSPN an compare it to the state-of-the-art structure learning algorithm ID-SPN [15]. More concretely, we keep the moel parameters learne from LearnSPN an use them to initialize CCCP. We then upate the moel parameters globally using CCCP as a fine tuning technique. This normally helps to obtain a better generative moel since the original parameters are learne greeily an locally uring the structure learning algorithm. We use the valiation set log-likelihoo score to avoi overfitting. The algorithm returns the set of parameters that achieve the best valiation set log-likelihoo score as output. Experimental results are reporte in Table. 2. As shown in Table 2, the use of CCCP after LearnSPN always helps to improve the moel performance. By optimizing moel parameters on these 20 ata sets, we boost LearnSPN to achieve better results than state-of-the-art ID-SPN on 7 ata sets, where the original LearnSPN only outperforms ID-SPN on 1 ata set. Note that the sizes of the SPNs returne by LearnSPN are much smaller than those prouce by ID-SPN. Hence, it is remarkable that by fine tuning the parameters with CCCP, we can achieve better performance espite the fact that the moels are smaller. For a fair comparison, we also list the size of the SPNs returne by ID-SPN in the supplementary material. 5 Conclusion We show that the network polynomial of an SPN is a posynomial function of the moel parameters, an that parameter learning yiels a signomial program. We propose two convex relaxations to solve the SP. We analyze the convergence properties of CCCP for learning SPNs. Extensive experiments are conucte to evaluate the propose approaches an current methos. We also recommen combining CCCP with structure learning algorithms to boost the moeling accuracy. Acknowlegments HZ an GG gratefully acknowlege support from ONR contract N HZ also thanks Ryan Tibshirani for the helpful iscussion about CCCP. 8

9 References [1] S. Boy, S.-J. Kim, L. Vanenberghe, an A. Hassibi. A tutorial on geometric programming. Optimization an Engineering, 8(1):67 127, [2] H. Chan an A. Darwiche. On the robustness of most probable explanations. In In Proceeings of the Twenty Secon Conference on Uncertainty in Artificial Intelligence. [3] M. Chiang. Geometric programming for communication systems. Now Publishers Inc, [4] A. Darwiche. A ifferential approach to inference in Bayesian networks. Journal of the ACM (JACM), 50(3): , [5] A. Dennis an D. Ventura. Greey structure search for sum-prouct networks. In International Joint Conference on Artificial Intelligence, volume 24, [6] R. Gens an P. Domingos. Discriminative learning of sum-prouct networks. In Avances in Neural Information Processing Systems, pages , [7] R. Gens an P. Domingos. Learning the structure of sum-prouct networks. In Proceeings of The 30th International Conference on Machine Learning, pages , [8] A. Gunawarana an W. Byrne. Convergence theorems for generalize alternating minimization proceures. The Journal of Machine Learning Research, 6: , [9] P. Hartman et al. On functions representable as a ifference of convex functions. Pacific J. Math, 9(3): , [10] J. Kivinen an M. K. Warmuth. Exponentiate graient versus graient escent for linear preictors. Information an Computation, 132(1):1 63, [11] G. R. Lanckriet an B. K. Sriperumbuur. On the convergence of the concave-convex proceure. pages , [12] R. Peharz. Founations of Sum-Prouct Networks for Probabilistic Moeling. PhD thesis, Graz University of Technology, [13] R. Peharz, S. Tschiatschek, F. Pernkopf, an P. Domingos. On theoretical properties of sumprouct networks. In AISTATS, [14] H. Poon an P. Domingos. Sum-prouct networks: A new eep architecture. In Proc. 12th Conf. on Uncertainty in Artificial Intelligence, pages , [15] A. Rooshenas an D. Low. Learning sum-prouct networks with irect an inirect variable interactions. In ICML, [16] R. Salakhutinov, S. Roweis, an Z. Ghahramani. On the convergence of boun optimization algorithms. UAI, [17] C. J. Wu. On the convergence properties of the EM algorithm. The Annals of Statistics, pages , [18] A. L. Yuille, A. Rangarajan, an A. Yuille. The concave-convex proceure (CCCP). Avances in Neural Information Processing Systems, 2: , [19] W. I. Zangwill. Nonlinear programming: a unifie approach, volume 196. Prentice-Hall Englewoo Cliffs, NJ, [20] H. Zhao, M. Melibari, an P. Poupart. On the Relationship between Sum-Prouct Networks an Bayesian Networks. In ICML, [21] H. Zhao, T. Ael, G. Goron, an B. Amos. Collapse variational inference for sum-prouct networks. In ICML,

Proof of SPNs as Mixture of Trees

A Proof of SPNs as Mixture of Trees Theorem 1. If T is an inuce SPN from a complete an ecomposable SPN S, then T is a tree that is complete an ecomposable. Proof. Argue by contraiction that T is not a