Proof of SPNs as Mixture of Trees

A Proof of SPNs as Mixture of Trees Theorem 1. If T is an inuce SPN from a complete an ecomposable SPN S, then T is a tree that is complete an ecomposable. Proof. Argue by contraiction that T is not a tree, then there must exist a noe v 2T such that v has more than one parent in T. This means that there exist at least two paths R, p 1,...,van R, q 1,...,v that connect the root of S(T ), which we enote by R, an v. Let t be the last noe in R, p 1,...,v an R, q 1,...,vsuch that R,..., t are common prefix of these two paths. By construction we know that such t must exist since these two paths start from the same root noe R (R will be one caniate of such t). Also, we claim that t 6= v otherwise these two paths overlap with each other, which contraicts the assumption that v has multiple parents. This shows that these two paths can be represente as R,..., t, p,..., v an R,..., t, q,..., v where R,..., t are the common prefix share by these two paths an p 6= q since t is the last common noe. From the construction process efine in Def. 2, we know that both p an q are chilren of t in S. Recall that for each sum noe in S, Def. 2 takes at most one chil, hence we claim that t must be a prouct noe, since both p an q are chilren of t. Then the paths that t! p v an t! q v inicate that scope(v) scope(p) scope(t) an scope(v) scope(q) scope(t), leaing to? 6= scope(v) scope(p) \ scope(q), which is a contraiction of the ecomposability of the prouct noe t. Hence as long as S is complete an ecomposable, T must be a tree. The completeness of T is trivially satisfie because each sum noe has only one chil in T. It is also straightforwar to verify that T satisfies the ecomposability as T is an inuce subgraph of S, which is ecomposable. Theorem 2. If T is an inuce tree from S over X 1:N, then T (x) = Q (v i,v j)2t E w ij Q N n=1 I x n, where w ij is the ege weight of (v i,v j ) if v i is a sum noe an w ij =1if v i is a prouct noe. Proof. First, the scope of T is the same as the scope of S because the root of S is also the root of T. This shows that for each X i there is at least one inicator I xi in the leaves otherwise the scope of the root noe of T will be a strict subset of the scope of the root noe of S. Furthermore, for each variable X i there is at most one inicator I xi in the leaves. This is observe by the fact that there is at most one chil collecte from a sum noe into T an if I xi an I xi appear simultaneously in the leaves, then their least common ancestor must be a prouct noe. Note that the least common ancestor of I xi an I xi is guarantee to exist because of the tree structure of T. However, this leas to a contraiction of the fact that S is ecomposable. As a result, there is exactly one inicator I xi for each variable X i in T. Hence the multiplicative constant of the monomial amits the form Q n i=1 I x i, which is a prouct of univariate istributions. More specifically, it is a prouct of inicator variables in the case of Boolean input variables. We have alreay shown that T is a tree an only prouct noes in T can have multiple chilren. It follows that the functional form of f T (x) must be a monomial, an only ege weights that are in T contribute to the monomial. Combing all the above, we know that f T (x) = Q (v i,v i)2t E w ij Q N n=1 I x n. Theorem. S = f S (1 1), where f S (1 1) is the value of the network polynomial of S with input vector 1 an all ege weights set to be 1. Theorem 4. S(x) = P S t=1 T t(x), where T t is the tth unique inuce tree of S. Proof. We prove by inuction on the height of S. If the height of S is 2, then epening on the type of the root noe, we have two cases: 1. If the root is a sum noe with K chilren, then there are CK 1 = K ifferent subgraphs that satisfy Def. 2, which is exactly the value of the network by setting all the inicators an ege weights from the root to be 1. 2. If the root is a prouct noe then there is only 1 subgraph which is the graph itself. Again, this equals to the value of S by setting all inicators to be 1. Assume the theorem is true for SPNs with height apple h. Consier an SPN S with height h +1. Again, epening on the type of the root noe, we nee to iscuss two cases: 10

1. If the root is a sum noe with K chilren, where the kth sub-spn has f Sk (1 1) unique inuce trees, then by Def. 2 the total number of unique inuce trees of S is P K k=1 f S k (1 1) = P K k=1 1 f S k (1 1) =f S (1 1). 2. If the root is a prouct noe with K chilren, then the total number of unique inuce trees of S can then be compute by Q K k=1 f S k (1 1) =f S (1 1). The secon part of the theorem follows by using istributive law between multiplication an aition to combine unique trees that share the same prefix in bottom-up orer. B MLE as Signomial Programming Proposition 6. The MLE problem for SPNs is a signomial program. Proof. Using the efinition of Pr(x w) an Corollary 5, let = f S (1 1), the MLE problem can be rewritten as P Q N Q D f S (x w) maximize w f S (1 w) = t=1 n=1 I(t) x n =1 wi w 2T t P Q D t=1 =1 wi w 2T t (8) subject to w 2 R D ++ which we claim is equivalent to: minimize w,z subject to z X DY z t=1 =1 w I w 2T t w 2 R D ++,z >0 X NY I (t) x n l=1 n=1 DY =1 w I w 2T t apple 0 It is easy to check that both the objective function an constraint function in (9) are signomials. To see the equivalence of (8) an (9), let p be the optimal value of (8) achieve at w. Choose z = p an w = w in (9), then z is also the optimal solution of (9) otherwise we can fin feasible (z 0, w 0 ) in (9) which has z 0 < z, z 0 >z. Combine with the constraint function in (9), we have p = z<z 0 apple f S(x w 0 ) f S (1 w 0 ), which contraicts the optimality of p. In the other irection, let z, w be the solution that achieves optimal value of (9), then we claim that z is also the optimal value of (8), otherwise there exists a feasible w in (8) such that z, f S(x w) f S (1 w) >z. Since (w,z) is also feasible in (9) with z< z, this contraicts the optimality of z. (9) The transformation from (8) to (9) oes not make the problem any easier to solve. Rather, it estroys the structure of (8), i.e., the objective function of (8) is the ratio of two posynomials. However, the equivalent transformation oes reveal some insights about the intrinsic complexity of the optimization problem, which inicates that it is har to solve (8) efficiently with the guarantee of achieving a globally optimal solution. C Convergence of CCCP for SPNs We iscusse before that the sequence of function values {f(y (k) )} converges to a limiting point. However, this fact alone oes not necessarily inicate that {f(y (k) )} converges to f(y ) where y is a stationary point of f( ) nor oes it imply that the sequence {y (k) } converges as k!1. Zangwill s global convergence theory [19] has been successfully applie to stuy the convergence properties of many iterative algorithms frequently use in machine learning, incluing EM [17], generalize alternating minimization [8] an also CCCP [11]. Here we also apply Zangwill s theory an combine the analysis from [11] to show the following theorem: Theorem 7. Let {w (k) } 1 k=1 be any sequence generate using Eq. 7 from any positive initial point, then all the limiting points of {w (k) } 1 k=1 are stationary points of the DCP in (2). In aition, lim k!1 f(y (k) )=f(y ), where y is some stationary point of (2). 11

Proof. We will use Zangwill s global convergence theory for iterative algorithms [19] to show the convergence in our case. Before showing the proof, we nee to first introuce the notion of point-toset mapping, where the output of the mapping is efine to be a set. More formally, a point-to-set map from a set X to Y is efine as :X 7! P(Y), where P(Y) is the power set of Y. Suppose X an Y are equippe with the norm X an Y, respectively. A point-to-set map is sai to be close at x 2Xif x k 2X, {x k } 1 k=1! x an y k 2Y, {y k } 1 k=1! y,y k 2 (x k ) imply that y 2 (x ). A point-to-set map is sai to be close on S Xif is close at every point in S. The concept of closeness in the point-to-set map setting reuces to continuity if we restrict that the output of to be a set of singleton for every possible input, i.e., when is a point-to-point mapping. Theorem 8 (Global Convergence Theorem [19]). Let the sequence {x k } 1 k=1 be generate by x k+1 2 (x k ), where ( ) is a point-to-set map from X to X. Let a solution set Xbe given, an suppose that: 1. all points x k are containe in a compact set S X. 2. is close over the complement of.. there is a continuous function on X such that: (a) if x 62, (x 0 ) > (x) for 8x 0 2 (x). (b) if x 2, (x 0 ) (x) for 8x 0 2 (x). Then all the limit points of {x k } 1 k=1 are in the solution set an (x ) for some x 2. (x k) converges monotonically to Let w 2 R D +. Let (w (k 1) ) = exp(arg max y ˆf(y, y (k 1) )) an let (w) =f(log w) =f(y) = log f S (x exp(y)) log f S (1 exp(y)). Here we use w an y interchangeably as w =exp(y) or each component is a one-to-one mapping. Note that since the arg max y ˆf(y, y (k 1) (k 1) ) given y is achievable, ( ) is a well efine point-to-set map for w 2 R D +. Specifically, in our case given w (k w 0 ij = i.e., the point-to-set mapping is given by 1), at each iteration of Eq. 7 we have w ijf vj (1 w) P j w ijf vj (1 w) / w(k 1) f vj (x w (k 1) ) @f S (x w (k 1) ) ij f S (x w (k 1) ) @f vi (x w (k 1) ) ij(w (k 1) )= (k 1) w ij f vj (x w (k 1) ) @f S(x w (k 1) ) @f vi (x w (k 1) ) 1) Pj0 w(k ij f 0 v j 0 (x w (k 1) ) @f S(x w (k 1) ) @f vi (x w (k 1) ) Let S =[0, 1] D, the D imensional hyper cube. Then the above upate formula inicates that (w (k 1) ) 2 S. Furthermore, if we assume w (1) 2 S, which can be obtaine by local normalization before any upate, we can guarantee that {w k } 1 k=1 S, which is a compact set in RD +. The solution to max y ˆf(y, y (k 1) ) is not unique. In fact, there are infinitely many solutions to this nonlinear equations. However, as we efine above, (w (k 1) ) returns one solution to this convex program in the D imensional hyper cube. Hence in our case ( ) reuces to a point-to-point map, where the efinition of closeness of a point-to-set map reuces to the notion of continuity of a point-to-point map. Define ={w w is a stationary point of ( )}. Hence we only nee to verify the continuity of (w) when w 2 S. To show this, we first characterize the functional form of @f vi (x w) as it is use insie ( ). We claim that for each noe v i, @f S(x w) @f vi (x w) is again, a posynomial function of w. A graphical illustration is given in Fig. to explain the process. This can also be erive from the sum rules an prouct rules use uring top-own ifferentiation. More specifically, if v i is a prouct noe, let v j,j =1,...,J be its parents in the network, which are assume to be sum noes, the ifferentiation of f S with respect to f vi is given by @f S(x w) @f vi (x w) = P J j=1 We reach JX @f vi (x w) = j=1 w ij @f vj (x w) @f vj (x w) @f vj (x w) @f vi (x w). (10) 12

+ w 1 v i + + + + Figure : Graphical illustration of @f S(x w) @f vi (x w). The partial erivative of f S with respect to f vi (in re) is a posynomial that is a prouct of ege weights lying on the path from root to v i an network polynomials from noes that are chilren of prouct noes on the path (highlighte in blue). Similarly, if v i is a sum noe an its parents v j,j =1,...,J are assume to be prouct noes, we have @f S(x w) JX @f vi (x w) = @f S(x w) f vj (x w) (11) @f vj (x w) f vi (x w) j=1 Since v j is a prouct noe an v j is a parent of v i, so the last term in Eq. 11 can be equivalently expresse as f vj (x w) f vi (x w) = Y f vh (x w) h6=i where the inex is range from all the chilren of v j except v i. Combining the fact that the partial ifferentiation of f S with respect to the root noe is 1 an that each f v is a posynomial function, it follows by inuction in top-own orer that @f S(x w) @f vi (x w) is also a posynomial function of w. We have shown that both the numerator an the enominator of ( ) are posynomial functions of w. Because posynomial functions are continuous functions, in orer to show that ( ) is also continuous on S\, we nee to guarantee that the enominator is not a egenerate posynomial function, i.e., the enominator of (w) 6= 0for all possible input vector x. Recall that = {w w is a stationary point of ( )}, hence 8w 2 S\, w 62 b S, where b S is the bounary of the D imensional hyper cube S. Hence we have 8w 2 S\ ) w 2 int S ) w > 0 for each component. This immeiately leas to f v (x w) > 0, 8v. As a result, (w) is continuous on S\ since it is the ratio of two strictly positive posynomial functions. We now verify the thir property in Zangwill s global convergence theory. At each iteration of CCCP, we have the following two cases to consier: 1. If w (k 1) 62, i.e., w (k 1) is not a stationary point of (w), then y (k 1) 62 arg max y ˆf(y, y (k 1) ), so we have (w (k) ) = f(y (k) ) ˆf(y (k), y (k 1) ) > ˆf(y (k 1), y (k 1) )=f(y (k 1) )= (w (k 1) ). 2. If w (k 1) 2, i.e., w (k 1) is a stationary point of (w), then y (k 1) 2 arg max y ˆf(y, y (k 1) ), so we have (w (k) ) = f(y (k) ) ˆf(y (k), y (k 1) ) = ˆf(y (k 1), y (k 1) )=f(y (k 1) )= (w (k 1) ). By Zangwill s global convergence theory, we now conclue that all the limit points of {w k } 1 k=1 are in an (w k ) converges monotonically to (w ) for some stationary point w 2. Remark 1. Technically we nee to choose w 0 2 int S to ensure the continuity of ( ). This initial conition combine with the fact that insie each iteration of CCCP the algorithm only applies 1

+ w 1 w 2 w 2 + + 1 2 1 X 1 X1 X 2 X2 Figure 4: A counterexample of SPN over two binary ranom variables where the weights w 1,w 2,w are symmetric an inistinguishable. positive multiplicative upate an renormalization, ensure that after any finite k steps, w k 2 ints. Theoretically, only in the limit it is possible that some components of w 1 may become 0. However in practice, ue to the numerical precision of float numbers on computers, it is possible that after some finite upate steps some of the components in w k become 0. So in practical implementation we recommen to use a small positive number to smooth out such 0 components in w k uring the iterations of CCCP. Such smoothing may hurt the monotonic property of CCCP, but this can only happens when w k is close to w an we can use early stopping to obtain a solution in the interior of S. Remark 2. Thm. 7 only implies that any limiting point of the sequence {w k } 1 k=1 ({y k} 1 k=1 ) must be a stationary point of the log-likelihoo function an {f(y) k } 1 k=1 must converge to some f(y ) where y is a stationary point. Thm. 7 oes not imply that the sequence {w k } 1 k=1 ({y k} 1 k=1 ) is guarantee to converge. [11] stuies the convergence property of general CCCP proceure. Uner more strong conitions, i.e., the strict concavity of the surrogate function or that () to be a contraction mapping, it is possible to show that the sequence {w k } 1 k=1 ({y k} 1 k=1 ) also converges. However, none of such conitions hol in our case. In fact, in general there are infinitely many fixe points of ( ), i.e., the equation (w) =w has infinitely many solutions in S. Also, for a fixe value t, if (w) =t has at least one solution, then there are infinitely many solutions. Such properties of SPNs make it generally very har to guarantee the convergence of the sequence {w k } 1 k=1 ({y k} 1 k=1 ). We give a very simple example below to illustrate the harness in SPNs in Fig. 4. Consier applying the CCCP proceure to learn the parameters on the SPN given in Fig. 4 with three instances {(0, 1), (1, 0), (1, 1)}. Then if we choose the initial parameter w 0 such that the weights over the inicator variables are set as shown in Fig. 4, then any assignment of (w 1,w 2,w ) in the probability simplex will be equally optimal in terms of likelihoo on inputs. In this example, there are uncountably infinite equal solutions, which invaliates the finite solution set requirement given in [11] in orer to show the convergence of {w k } 1 k=1. However, we emphasize that the convergence of the sequence {w k} 1 k=1 is not as important as the convergence of { (w) k } 1 k=1 to esire locations on the log-likelihoo surface as in practice any w with equally goo log-likelihoo may suffice for the inference/preiction task. It is worth to point out that the above theorem oes not imply the convergence of the sequence {w (k) } 1 k=1. Thm. 7 only inicates that all the limiting points of {w(k) } 1 k=1, i.e., the limits of subsequences of {w (k) } 1 k=1, are stationary points of the DCP in (2). We also present a negative example in Fig. 4 that invaliates the application of Zangwill s global convergence theory on the analysis in this case. The convergence rate of general CCCP is still an open problem [11]. [16] stuie the convergence rate of unconstraine boun optimization algorithms with ifferentiable objective functions, of which our problem is a special case. The conclusion is that epening on the curvature of f 1 an f 2 (which are functions of the training ata), CCCP will exhibit either a quasi-newton behavior with superlinear convergence or first-orer convergence. We show in experiments that CCCP normally exhibits a fast, 14

superlinear convergence rate compare with PGD, EG an SMA. Both CCCP an EM are special cases of a more general framework known as Majorization-Maximization. We show that in the case of SPNs these two algorithms coincie with each other, i.e., they lea to the same upate formulas espite the fact that they start from totally ifferent perspectives. D Experiment Details D.1 Methos We will briefly review the current approach for training SPNs using projecte graient escent (PGD). Another relate approach is to use exponentiate graient (EG) [10] to optimize (8). PGD optimizes the log-likelihoo by projecting the intermeiate solution back to the positive orthant after each graient upate. Since the constraint in (8) is an open set, we nee to manually create a close set on which the projection operation can be well efine. One feasible choice is to project on to R D, {w 2 R D ++ w, 8} where >0 is assume to be very small. To avoi the projection, one irect solution is to use the exponentiate graient (EG) metho[10], which was first applie in an online setting an latter successfully extene to batch settings when training with convex moels. EG amits a multiplicative upate at each iteration an hence avois the nee for projection in PGD. However, EG is mostly applie in convex setting an it is not clear whether the convergence guarantee still hols or not in nonconvex setting. D.2 Experimental Setup The sizes of ifferent SPNs prouce by LearnSPN an ID-SPN are shown in Table. Table : Sizes of SPNs prouce by LearnSPN an ID-SPN. Data set LearnSPN ID-SPN NLTCS 1,7 24,690 MSNBC 54,89 579,64 KDD 2k 48,279 1,286,657 Plants 12,959 2,06,708 Auio 79,525 2,64,948 Jester 14,01 4,225,471 Netflix 161,655 7,958,088 Accients 204,501 2,27,186 Retail 56,91 60,961 Pumsb-star 140,9 1,751,092 DNA 108,021,228,616 Kosarak 20,21 1,272,981 MSWeb 68,85 1,886,777 Book 190,625 1,445,501 EachMovie 522,75 2,440,864 WebKB 1,49,751 2,605,141 Reuters-52 2,210,25 4,56,861 20 Newsgrp 14,561,965,485,029 BBC 1,879,921 2,426,602 A 4,1,421 2,087,25 We list here the etaile statistics of the 20 ata sets use in the experiments in Table 4. Table 5 shows the etaile running time of PGD, EG, SMA an CCCP on 20 ata sets, measure in secons. 15

Table 4: Statistics of ata sets an moels. N is the number of variables moele by the network, S is the size of the network an D is the number of parameters to be estimate in the network. N V/D means the ratio of training instances times the number of variables to the number parameters. Data set N S D Train Vali Test N V/D NLTCS 16 1,7 1,716 16,181 2,157,26 150.871 MSNBC 17 54,89 24,452 291,26 8,84 58,265 202.541 KDD 2k 64 48,279 14,292 180,092 19,907 4,955 806.457 Plants 69 12,959 58,85 17,412 2,21,482 20.414 Auio 100 79,525 196,10 15,000 2,000,000 7.649 Jester 100 14,01 180,750 9,000 1,000 4,116 4.979 Netflix 100 161,655 51,601 15,000 2,000,000 29.069 Accients 111 204,501 74,804 12,758 1,700 2,551 18.91 Retail 15 56,91 22,11 22,041 2,98 4,408 14.560 Pumsb-star 16 140,9 6,17 12,262 1,65 2,452 1.68 DNA 180 108,021 52,121 1,600 400 1,186 5.526 Kosarak 190 20,21 5,204,75 4,450 6,675 119.187 MSWeb 294 68,85 20,46 29,441,270 5,000 425.42 Book 500 190,625 41,122 8,700 1,159 1,79 105.78 EachMovie 500 522,75 188,87 4,524 1,002 591 12.007 WebKB 89 1,49,751 879,89 2,80 558 88 2.67 Reuters-52 889 2,210,25 1,45,90 6,52 1,028 1,540.995 20 Newsgrp 910 14,561,965 8,295,407 11,29,764,764 1.29 BBC 1058 1,879,921 1,222,56 1,670 225 0 1.445 A 1556 4,1,421 1,80,676 2,461 27 491 2.774 Table 5: Running time of 4 algorithms on 20 ata sets, measure in secons. Data set PGD EG SMA CCCP NLTCS 48.5 718.98 458.99 206.10 MSNBC 2720.7 2917.72 8078.41 2008.07 KDD 2k 4688.60 22154.10 27101.50 29541.20 Plants 12595.60 10752.10 7604.09 1049.80 Auio 19647.90 40.69 12801.70 1407.0 Jester 6099.44 6272.80 4082.65 191.41 Netflix 2957.10 2791.50 15080.50 8400.20 Accients 14266.50 41.82 5776.00 2045.90 Retail 28669.50 7729.89 9866.94 5200.20 Pumsb-star 115.58 1872.80 4864.72 277.54 DNA 599.9 199.6 727.56 180.6 Kosarak 122204.00 11227.00 49120.50 42809.0 MSWeb 16524.00 1478.10 65221.20 4512.0 Book 19098.00 6487.84 6970.50 2076.40 EachMovie 0071.60 279.60 17751.10 60184.00 WebKB 12088.00 50290.90 44004.50 168142.00 Reuters-52 1092.10 548.5 2060.70 1194.1 20 Newsgrp 15124.00 96025.80 17921.00 1101.80 BBC 20920.60 18065.00 6952.20 440.7 A 12246.40 218.08 1246.70 71.48 16