A Unified Approach for Learning the Parameters of Sum-Product Networks
|
|
- Carmel Berenice Ramsey
- 5 years ago
- Views:
Transcription
1 A Unifie Approach for Learning the Parameters of Sum-Prouct Networks Han Zhao Machine Learning Dept. Carnegie Mellon University Pascal Poupart School of Computer Science University of Waterloo Geoff Goron Machine Learning Dept. Carnegie Mellon University Abstract We present a unifie approach for learning the parameters of Sum-Prouct networks (SPNs). We prove that any complete an ecomposable SPN is equivalent to a mixture of trees where each tree correspons to a prouct of univariate istributions. Base on the mixture moel perspective, we characterize the objective function when learning SPNs base on the maximum likelihoo estimation (MLE) principle an show that the optimization problem can be formulate as a signomial program. We construct two parameter learning algorithms for SPNs by using sequential monomial approximations (SMA) an the concave-convex proceure (CCCP), respectively. The two propose methos naturally amit multiplicative upates, hence effectively avoiing the projection operation. With the help of the unifie framework, we also show that, in the case of SPNs, CCCP leas to the same algorithm as Expectation Maximization (EM) espite the fact that they are ifferent in general. 1 Introuction Sum-prouct networks (SPNs) are new eep graphical moel architectures that amit exact probabilistic inference in linear time in the size of the network [14]. Similar to traitional graphical moels, there are two main problems when learning SPNs: structure learning an parameter learning. Parameter learning is interesting even if we know the groun truth structure ahea of time; structure learning epens on parameter learning, so better parameter learning can often lea to better structure learning. Poon an Domingos [14] an Gens an Domingos [6] propose both generative an iscriminative learning algorithms for parameters in SPNs. At a high level, these approaches view SPNs as eep architectures an apply projecte graient escent (PGD) to optimize the ata log-likelihoo. There are several rawbacks associate with PGD. For example, the projection step in PGD hurts the convergence of the algorithm an it will often lea to solutions on the bounary of the feasible region. Also, PGD contains an aitional arbitrary parameter, the projection margin, which can be har to set well in practice. In [14, 6], the authors also mentione the possibility of applying EM algorithms to train SPNs by viewing sum noes in SPNs as hien variables. They presente an EM upate formula without etails. However, the upate formula for EM given in [14, 6] is incorrect, as first pointe out an correcte by [12]. In this paper we take a ifferent perspective an present a unifie framework, which treats [14, 6] as special cases, for learning the parameters of SPNs. We prove that any complete an ecomposable SPN is equivalent to a mixture of trees where each tree correspons to a prouct of univariate istributions. Base on the mixture moel perspective, we can precisely characterize the functional form of the objective function base on the network structure. We show that the optimization problem associate with learning the parameters of SPNs base on the MLE principle can be formulate as a signomial program (SP), where both PGD an exponentiate graient (EG) can be viewe as first orer approximations of the signomial program after suitable transformations of the objective 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.
2 function. We also show that the signomial program formulation can be equivalently transforme into a ifference of convex functions (DCP) formulation, where the objective function of the program can be naturally expresse as a ifference of two convex functions. The DCP formulation allows us to evelop two efficient optimization algorithms for learning the parameters of SPNs base on sequential monomial approximations (SMA) an the concave-convex proceure (CCCP), respectively. Both propose approaches naturally amit multiplicative upates, hence effectively eal with the positivity constraints of the optimization. Furthermore, uner our unifie framework, we also show that CCCP leas to the same algorithm as EM espite that these two approaches are ifferent from each other in general. Although we mainly focus on MLE base parameter learning, the mixture moel interpretation of SPN also helps to evelop a Bayesian learning metho for SPNs [21]. PGD, EG, SMA an CCCP can all be viewe as ifferent levels of convex relaxation of the original SP. Hence the framework also provies an intuitive way to compare all four approaches. We conuct extensive experiments on 20 benchmark ata sets to compare the empirical performance of PGD, EG, SMA an CCCP. Experimental results valiate our theoretical analysis that CCCP is the best among all 4 approaches, showing that it converges consistently faster an with more stability than the other three methos. Furthermore, we use CCCP to boost the performance of LearnSPN [7], showing that it can achieve results comparable to state-of-the-art structure learning algorithms using SPNs with much smaller network sizes. 2 Backgroun 2.1 Sum-Prouct Networks To simplify the iscussion of the main iea of our unifie framework, we focus our attention on SPNs over Boolean ranom variables. However, the framework presente here is general an can be easily extene to other iscrete an continuous ranom variables. We first efine the notion of network polynomial. We use I x to enote an inicator variable that returns 1 when X = x an 0 otherwise. Definition 1 (Network Polynomial [4]). Let f( ) 0 be an unnormalize probability istribution over a Boolean ranom vector X 1:N. The network polynomial of f( ) is a multilinear function Px f(x) Q N n=1 I x n of inicator variables, where the summation is over all possible instantiations of the Boolean ranom vector X 1:N. A Sum-Prouct Network (SPN) over Boolean variables X 1:N is a roote DAG that computes the network polynomial over X 1:N. The leaves are univariate inicators of Boolean variables an internal noes are either sum or prouct. Each sum noe computes a weighte sum of its chilren an each prouct noe computes the prouct of its chilren. The scope of a noe in an SPN is efine as the set of variables that have inicators among the noe s escenants. For any noe v in an SPN, if v is a terminal noe, say, an inicator variable over X, then scope(v) ={X}, else scope(v) = S ṽ2ch(v) scope(ṽ). An SPN is complete iff each sum noe has chilren with the same scope. An SPN is ecomposable iff for every prouct noe v, scope(v i ) T scope(v j ) =? where v i,v j 2 Ch(v),i6= j. The scope of the root noe is {X 1,...,X N }. In this paper, we focus on complete an ecomposable SPNs. For a complete an ecomposable SPN S, each noe v in S efines a network polynomial f v ( ) which correspons to the sub-spn (subgraph) roote at v. The network polynomial of S, enote by f S, is the network polynomial efine by the root of S, which can be compute recursively from its chilren. The probability f S(x) Px f S(x) istribution inuce by an SPN S is efine as Pr S (x),. The normalization constant P Px f S(x) can be compute in O( S ) in SPNs by setting the values of all the leaf noes to be 1, i.e., x f S(x) =f S (1) [14]. This leas to efficient joint/marginal/conitional inference in SPNs. 2.2 Signomial Programming (SP) Before introucing SP, we first introuce geometric programming (GP), which is a strict subclass of SP. A monomial is efine as a function h : R n ++ 7! R: h(x) =x a1 1 xa2 2 xan n, where the omain is restricte to be the positive orthant (R n ++), the coefficient is positive an the exponents a i 2 R, 8i. Aposynomial is a sum of monomials: g(x) = P K k=1 kx a 1k 1 x a 2k 2 x a nk n. One of the key properties of posynomials is positivity, which allows us to transform any posynomial into the log 2
3 omain. A GP in stanar form is efine to be an optimization problem where both the objective function an the inequality constraints are posynomials an the equality constraints are monomials. There is also an implicit constraint that x 2 R n ++. A GP in its stanar form is not a convex program since posynomials are not convex functions in general. However, we can effectively transform it into a convex problem by using the logarithmic transformation trick on x, the multiplicative coefficients of each monomial an also each objective/constraint function [3, 1]. An SP has the same form as GP except that the multiplicative constant insie each monomial is not restricte to be positive, i.e., can take any real value. Although the ifference seems to be small, there is a huge ifference between GP an SP from the computational perspective. The negative multiplicative constant in monomials invaliates the logarithmic transformation trick frequently use in GP. As a result, SPs cannot be reuce to convex programs an are believe to be har to solve in general [1]. 3 Unifie Approach for Learning In this section we will show that the parameter learning problem of SPNs base on the MLE principle can be formulate as an SP. We will use a sequence of optimal monomial approximations combine with backtracking line search an the concave-convex proceure to tackle the SP. Due to space constraints, we refer intereste reaers to the supplementary material for all the proof etails. 3.1 Sum-Prouct Networks as a Mixture of Trees We introuce the notion of inuce trees from SPNs an use it to show that every complete an ecomposable SPN can be interprete as a mixture of inuce trees, where each inuce tree correspons to a prouct of univariate istributions. From this perspective, an SPN can be unerstoo as a huge mixture moel where the effective number of components in the mixture is etermine by its network structure. The metho we escribe here is not the first metho for interpreting an SPN (or the relate arithmetic circuit) as a mixture istribution [20, 5, 2]; but, the new metho can result in an exponentially smaller mixture, see the en of this section for more etails. Definition 2 (Inuce SPN). Given a complete an ecomposable SPN S over X 1:N, let T = (T V, T E ) be a subgraph of S. T is calle an inuce SPN from S if 1. Root(S) 2T V. 2. If v 2T V is a sum noe, then exactly one chil of v in S is in T V, an the corresponing ege is in T E. 3. If v 2T V is a prouct noe, then all the chilren of v in S are in T V, an the corresponing eges are in T E. Theorem 1. If T is an inuce SPN from a complete an ecomposable SPN S, then T is a tree that is complete an ecomposable. As a result of Thm. 1, we will use the terms inuce SPNs an inuce trees interchangeably. With some abuse of notation, we use T (x) to mean the value of the network polynomial of T with input vector x. Theorem 2. If T is an inuce tree from S over X 1:N, then T (x) = Q (v i,v j)2t E w ij Q N n=1 I x n, where w ij is the ege weight of (v i,v j ) if v i is a sum noe an w ij =1if v i is a prouct noe. Remark. Although we focus our attention on Boolean ranom variables for the simplicity of iscussion an illustration, Thm. 2 can be extene to the case where the univariate istributions at the leaf noes are continuous or iscrete istributions with countably infinitely many values, e.g., Gaussian istributions or Poisson istributions. We can simply replace the prouct of univariate istributions term, Q N n=1 I x n, in Thm. 2 to be the general form Q N n=1 p n(x n ), where p n (X n ) is a univariate istribution over X n. Also note that it is possible for two unique inuce trees to share the same prouct of univariate istributions, but in this case their weight terms Q (v i,v i)2t E w ij are guarantee to be ifferent. As we will see shortly, Thm. 2 implies that the joint istribution over {X n } N n=1 represente by an SPN is essentially a mixture moel with potentially exponentially many components in the mixture. 3
4 Definition 3 (Network carinality). The network carinality S of an SPN S is the number of unique inuce trees. Theorem 3. S = f S (1 1), where f S (1 1) is the value of the network polynomial of S with input vector 1 an all ege weights set to be 1. Theorem 4. S(x) = P S t=1 T t(x), where T t is the tth unique inuce tree of S. Remark. The above four theorems prove the fact that an SPN S is an ensemble or mixture of trees, where each tree computes an unnormalize istribution over X 1:N. The total number of unique trees in S is the network carinality S, which only epens on the structure of S. Each component is a simple prouct of univariate istributions. We illustrate the theorems above with a simple example in Fig. 1. w1 w3 w2 = w 1 +w 2 +w X1 X1 X2 X2 X1 X2 X1 X2 X1 X2 Figure 1: A complete an ecomposable SPN is a mixture of inuce trees. Double circles inicate univariate istributions over X 1 an X 2. Different colors are use to highlight unique inuce trees; each inuce tree is a prouct of univariate istributions over X 1 an X 2. Zhao et al. [20] show that every complete an ecomposable SPN is equivalent to a bipartite Bayesian network with a layer of hien variables an a layer of observable ranom variables. The number of hien variables in the bipartite Bayesian network is equal to the number of sum noes in S. A naive expansion of such Bayesian network to a mixture moel will lea to a huge mixture moel with 2 O(M) components, where M is the number of sum noes in S. Here we complement their theory an show that each complete an ecomposable SPN is essentially a mixture of trees an the effective number of unique inuce trees is given by S. Note that S = f S (1 1) epens only on the network structure, an can often be much smaller than 2 O(M). Without loss of generality, assuming that in S layers of sum noes are alternating with layers of prouct noes, then f S (1 1) = (2 h ), where h is the height of S. However, the exponentially many trees are recursively merge an combine in S such that the overall network size is still tractable. 3.2 Maximum Likelihoo Estimation as SP Let s consier the likelihoo function compute by an SPN S over N binary ranom variables with moel parameters w an input vector x 2{0, 1} N. Here the moel parameters in S are ege weights from every sum noe, an we collect them together into a long vector w 2 R D ++, where D correspons to the number of eges emanating from sum noes in S. By efinition, the probability istribution inuce by S can be compute by Pr S (x w), P f S(x w) x f S(x w) = f S(x w) f S (1 w). Corollary 5. Let S be an SPN with weights w 2 R D ++ over input vector x 2{0, 1} N, the network polynomial f S (x w) is a posynomial: f S (x w) = P f S (1 1) Q N Q D t=1 n=1 I(t) x n =1 wi w 2T t, where I w 2T t is the inicator variable whether w is in the t-th inuce tree T t or not. Each monomial correspons exactly to a unique inuce tree SPN from S. The above statement is a irect corollary of Thm. 2, Thm. 3 an Thm. 4. From the efinition of network polynomial, we know that f S is a multilinear function of the inicator variables. Corollary 5 works as a complement to characterize the functional form of a network polynomial in terms of w. It follows that the likelihoo function L S (w), Pr S (x w) can be expresse as the ratio of two posynomial functions. We now show that the optimization problem base on MLE is an SP. Using the efinition of Pr(x w) an Corollary 5, let = f S (1 1), the MLE problem can be rewritten as maximize w f S (x w) f S (1 w) = subject to w 2 R D ++ P t=1 Q N n=1 I(t) x n Q D =1 wi w 2T t P t=1 Q D =1 wi w 2T t (1) 4
5 Proposition 6. The MLE problem for SPNs is a signomial program. Being nonconvex in general, SP is essentially har to solve from a computational perspective [1, 3]. However, espite the harness of SP in general, the objective function in the MLE formulation of SPNs has a special structure, i.e., it is the ratio of two posynomials, which makes the esign of efficient optimization algorithms possible. 3.3 Difference of Convex Functions Both PGD an EG are first-orer methos an they can be viewe as approximating the SP after applying a logarithmic transformation to the objective function only. Although (1) is a signomial program, its objective function is expresse as the ratio of two posynomials. Hence, we can still apply the logarithmic transformation trick use in geometric programming to its objective function an to the variables to be optimize. More concretely, let w =exp(y ), 8 an take the log of the objective function; it becomes equivalent to maximize the following new objective without any constraint on y: 0 (x) X DX!1!! X DX maximize exp y I y 2T t A log exp y I y 2T t (2) t=1 =1 Note that in the first term of Eq. 2 the upper inex (x) apple, f S (1 1) epens on the current input x. By transforming into the log-space, we naturally guarantee the positivity of the solution at each iteration, hence transforming a constraine optimization problem into an unconstraine optimization problem without any sacrifice. Both terms in Eq. 2 are convex functions in y after the transformation. Hence, the transforme objective function is now expresse as the ifference of two convex functions, which is calle a DC function [9]. This helps us to esign two efficient algorithms to solve the problem base on the general iea of sequential convex approximations for nonlinear programming Sequential Monomial Approximation Let s consier the linearization of both terms in Eq. 2 in orer to apply first-orer methos in the transforme space. To compute the graient with respect to ifferent components of y, we view each noe of an SPN as an intermeiate function of the network polynomial an apply the chain rule to back-propagate the graient. The ifferentiation of f S (x w) with respect to the root noe of the network is set to be 1. The ifferentiation of the network polynomial with respect to a partial function at each noe can then be compute in two passes of the network: the bottom-up pass evaluates the values of all partial functions given the current input x an the top-own pass ifferentiates the network polynomial with respect to each partial function. Following the evaluation-ifferentiation passes, the graient of the objective function in (2) can be compute in O( S ). Furthermore, although the computation is conucte in y, the results are fully expresse in terms of w, which suggests that in practice we o not nee to explicitly construct y from w. Let f(y) =logf S (x exp(y)) log f S (1 exp(y)). It follows that approximating f(y) with the best linear function is equivalent to using the best monomial approximation of the signomial program (1). This leas to a sequential monomial approximations of the original SP formulation: at each iteration y (k), we linearize both terms in Eq. 2 an form the optimal monomial function in terms of w (k). The aitive upate of y (k) leas to a multiplicative upate of w (k) since w (k) =exp(y (k) ), an we use a backtracking line search to etermine the step size of the upate in each iteration Concave-convex Proceure Sequential monomial approximation fails to use the structure of the problem when learning SPNs. Here we propose another approach base on the concave-convex proceure (CCCP) [18] to use the fact that the objective function is expresse as the ifference of two convex functions. At a high level CCCP solves a sequence of concave surrogate optimizations until convergence. In many cases, the maximum of a concave surrogate function can only be solve using other convex solvers an as a result the efficiency of the CCCP highly epens on the choice of the convex solvers. However, we show that by a suitable transformation of the network we can compute the maximum of the concave surrogate in close form in time that is linear in the network size, which leas to a very efficient t=1 =1 5
6 algorithm for learning the parameters of SPNs. We also prove the convergence properties of our algorithm. Consier the objective function to be maximize in DCP: f(y) = log f S (x exp(y)) log f S (1 exp(y)), f 1 (y) +f 2 (y) where f 1 (y), log f S (x exp(y)) is a convex function an f 2 (y), log f S (1 exp(y)) is a concave function. We can linearize only the convex part f 1 (y) to obtain a surrogate function ˆf(y, z) =f 1 (z)+r z f 1 (z) T (y z)+f 2 (y) (3) for 8y, z 2 R D. Now ˆf(y, z) is a concave function in y. Due to the convexity of f 1 (y) we have f 1 (y) f 1 (z)+r z f 1 (z) T (y z), 8y, z an as a result the following two properties always hol for 8y, z: ˆf(y, z) apple f(y) an ˆf(y, y) =f(y). CCCP upates y at each iteration k by solving y (k) 2 arg max y ˆf(y, y (k 1) ) unless we alreay have y (k 1) 2 arg max y ˆf(y, y (k 1) ), in which case a generalize fixe point y (k 1) has been foun an the algorithm stops. It is easy to show that at each iteration of CCCP we always have f(y (k) ) f(y (k 1) ). Note also that f(y) is computing the log-likelihoo of input x an therefore it is boune above by 0. By the monotone convergence theorem, lim k!1 f(y (k) ) exists an the sequence {f(y (k) )} converges. We now iscuss how to compute a close form solution for the maximization of the concave surrogate ˆf(y, y (k 1) ). Since ˆf(y, y (k 1) ) is ifferentiable an concave for any fixe y (k 1), a sufficient an necessary conition to fin its maximum is r y ˆf(y, y (k 1) )=r y (k 1)f 1 (y (k 1) )+r y f 2 (y) =0 (4) In the above equation, if we consier only the partial erivative with respect to y ij (w ij ), we obtain (k 1) w ij f vj (x w (k 1) S (x w (k 1) ) f S (x w (k 1) vi (x w (k 1) ) = w ijf vj (1 S (1 w) f S (1 vi (1 w) Eq. 5 leas to a system of D nonlinear equations, which is har to solve in close form. However, if P we o a change of variable by consiering locally normalize weights wij 0 (i.e., w0 ij 0 an j w0 ij =18i), then a solution can be easily compute. As escribe in [13, 20], any SPN can be transforme into an equivalent normal SPN with locally normalize weights in a bottom up pass as follows: wij 0 = w ijf vj (1 w) P j w (6) ijf vj (1 w) We can then replace w ij f vj (1 w) in the above equation by the expression it is equal to in Eq. 5 to obtain a close form solution: w 0 ij / w (k 1) ij f vj (x w (k 1) S (x w (k 1) ) f S (x w (k 1) vi (x w (k 1) ) Note that in the above erivation both f vi (1 w)/f S (1 w) S (1 w)/@f vi (1 w) can be treate as constants an hence absorbe since wij 0, 8j are constraine to be locally normalize. In orer to obtain a solution to Eq. 5, for each ege weight w ij, the sufficient statistics inclue only three terms, (k 1) i.e, the evaluation value at v j, the ifferentiation value at v i an the previous ege weight w ij, all of which can be obtaine in two passes of the network for each input x. Thus the computational complexity to obtain a maximum of the concave surrogate is O( S ). Interestingly, Eq. 7 leas to the same upate formula as in the EM algorithm [12] espite the fact that CCCP an EM start from ifferent perspectives. We show that all the limit points of the sequence {w (k) } 1 k=1 are guarantee to be stationary points of DCP in (2). Theorem 7. Let {w (k) } 1 k=1 be any sequence generate using Eq. 7 from any positive initial point, then all the limiting points of {w (k) } 1 k=1 are stationary points of the DCP in (2). In aition, lim k!1 f(y (k) )=f(y ), where y is some stationary point of (2). We summarize all four algorithms an highlight their connections an ifferences in Table 1. Although we mainly iscuss the batch version of those algorithms, all of the four algorithms can be easily aapte to work in stochastic an/or parallel settings. (5) (7) 6
7 Table 1: Summary of PGD, EG, SMA an CCCP. Var. means the optimization variables. Algo Var. Upate Type Upate Formula n o PGD w Aitive w (k+1) P R ++ w (k) + (r w f 1(w (k) ) r w f 2(w (k) )) EG w Multiplicative w (k+1) w (k) exp{ (r w f 1(w (k) ) r w f 2(w (k) ))} SMA log w Multiplicative w (k+1) w (k) exp{ w (k) (r w f 1(w (k) ) r w f 2(w (k) ))} CCCP log w Multiplicative w (k+1) ij / w (k) ij r vi f S(w (k) ) f vj (w (k) ) 4 Experiments 4.1 Experimental Setting We conuct experiments on 20 benchmark ata sets from various omains to compare an evaluate the convergence performance of the four algorithms: PGD, EG, SMA an CCCP (EM). These 20 ata sets are wiely use in [7, 15] to assess ifferent SPNs for the task of ensity estimation. All the features in the 20 ata sets are binary features. All the SPNs that are use for comparisons of PGD, EG, SMA an CCCP are traine using LearnSPN [7]. We iscar the weights returne by LearnSPN an use ranom weights as initial moel parameters. The ranom weights are etermine by the same ranom see in all four algorithms. Detaile information about these 20 atasets an the SPNs use in the experiments are provie in the supplementary material. 4.2 Parameter Learning We implement all four algorithms in C++. For each algorithm, we set the maximum number of iterations to 50. If the absolute ifference in the training log-likelihoo at two consecutive steps is less than 0.001, the algorithms are stoppe. For PGD, EG an SMA, we combine each of them with backtracking line search an use a weight shrinking coefficient set at 0.8. The learning rates are initialize to 1.0 for all three methos. For PGD, we set the projection margin to There is no learning rate an no backtracking line search in CCCP. We set the smoothing parameter to in CCCP to avoi numerical issues. We show in Fig. 2 the average log-likelihoo scores on 20 training ata sets to evaluate the convergence spee an stability of PGD, EG, SMA an CCCP. Clearly, CCCP wins by a large margin over PGD, EG an SMA, both in convergence spee an solution quality. Furthermore, among the four algorithms, CCCP is the most stable one ue to its guarantee that the log-likelihoo (on training ata) will not ecrease after each iteration. As shown in Fig. 2, the training curves of CCCP are more smooth than the other three methos in almost all the cases. These 20 experiments also clearly show that CCCP often converges in a few iterations. On the other han, PGD, EG an SMA are on par with each other since they are all first-orer methos. SMA is more stable than PGD an EG an often achieves better solutions than PGD an EG. On large ata sets, SMA also converges faster than PGD an EG. Surprisingly, EG performs worse than PGD in some cases an is quite unstable espite the fact that it amits multiplicative upates. The hook shape curves of PGD in some ata sets, e.g. Kosarak an KDD, are ue to the projection operations. Table 2: Average log-likelihoos on test ata. Highest log-likelihoos are highlighte in bol. " shows statistically better log-likelihoos than CCCP an # shows statistically worse log-likelihoos than CCCP. The significance is measure base on the Wilcoxon signe-rank test. Data set CCCP LearnSPN ID-SPN Data set CCCP LearnSPN ID-SPN NLTCS # # DNA # " MSNBC # Kosarak # KDD 2k # # MSWeb # Plants # " Book # " Auio # EachMovie # " Jester # # WebKB # " Netflix # " Reuters # " Accients # " Newsgrp # " Retail # BBC # # Pumsb-star # " A # #
8 Figure 2: Negative log-likelihoo values versus number of iterations for PGD, EG, SMA an CCCP. The computational complexity per upate is O( S ) in all four algorithms. CCCP often takes less time than the other three algorithms because it takes fewer iterations to converge. We list etaile running time statistics for all four algorithms on the 20 ata sets in the supplementary material. 4.3 Fine Tuning We combine CCCP as a fine tuning proceure with the structure learning algorithm LearnSPN an compare it to the state-of-the-art structure learning algorithm ID-SPN [15]. More concretely, we keep the moel parameters learne from LearnSPN an use them to initialize CCCP. We then upate the moel parameters globally using CCCP as a fine tuning technique. This normally helps to obtain a better generative moel since the original parameters are learne greeily an locally uring the structure learning algorithm. We use the valiation set log-likelihoo score to avoi overfitting. The algorithm returns the set of parameters that achieve the best valiation set log-likelihoo score as output. Experimental results are reporte in Table. 2. As shown in Table 2, the use of CCCP after LearnSPN always helps to improve the moel performance. By optimizing moel parameters on these 20 ata sets, we boost LearnSPN to achieve better results than state-of-the-art ID-SPN on 7 ata sets, where the original LearnSPN only outperforms ID-SPN on 1 ata set. Note that the sizes of the SPNs returne by LearnSPN are much smaller than those prouce by ID-SPN. Hence, it is remarkable that by fine tuning the parameters with CCCP, we can achieve better performance espite the fact that the moels are smaller. For a fair comparison, we also list the size of the SPNs returne by ID-SPN in the supplementary material. 5 Conclusion We show that the network polynomial of an SPN is a posynomial function of the moel parameters, an that parameter learning yiels a signomial program. We propose two convex relaxations to solve the SP. We analyze the convergence properties of CCCP for learning SPNs. Extensive experiments are conucte to evaluate the propose approaches an current methos. We also recommen combining CCCP with structure learning algorithms to boost the moeling accuracy. Acknowlegments HZ an GG gratefully acknowlege support from ONR contract N HZ also thanks Ryan Tibshirani for the helpful iscussion about CCCP. 8
9 References [1] S. Boy, S.-J. Kim, L. Vanenberghe, an A. Hassibi. A tutorial on geometric programming. Optimization an Engineering, 8(1):67 127, [2] H. Chan an A. Darwiche. On the robustness of most probable explanations. In In Proceeings of the Twenty Secon Conference on Uncertainty in Artificial Intelligence. [3] M. Chiang. Geometric programming for communication systems. Now Publishers Inc, [4] A. Darwiche. A ifferential approach to inference in Bayesian networks. Journal of the ACM (JACM), 50(3): , [5] A. Dennis an D. Ventura. Greey structure search for sum-prouct networks. In International Joint Conference on Artificial Intelligence, volume 24, [6] R. Gens an P. Domingos. Discriminative learning of sum-prouct networks. In Avances in Neural Information Processing Systems, pages , [7] R. Gens an P. Domingos. Learning the structure of sum-prouct networks. In Proceeings of The 30th International Conference on Machine Learning, pages , [8] A. Gunawarana an W. Byrne. Convergence theorems for generalize alternating minimization proceures. The Journal of Machine Learning Research, 6: , [9] P. Hartman et al. On functions representable as a ifference of convex functions. Pacific J. Math, 9(3): , [10] J. Kivinen an M. K. Warmuth. Exponentiate graient versus graient escent for linear preictors. Information an Computation, 132(1):1 63, [11] G. R. Lanckriet an B. K. Sriperumbuur. On the convergence of the concave-convex proceure. pages , [12] R. Peharz. Founations of Sum-Prouct Networks for Probabilistic Moeling. PhD thesis, Graz University of Technology, [13] R. Peharz, S. Tschiatschek, F. Pernkopf, an P. Domingos. On theoretical properties of sumprouct networks. In AISTATS, [14] H. Poon an P. Domingos. Sum-prouct networks: A new eep architecture. In Proc. 12th Conf. on Uncertainty in Artificial Intelligence, pages , [15] A. Rooshenas an D. Low. Learning sum-prouct networks with irect an inirect variable interactions. In ICML, [16] R. Salakhutinov, S. Roweis, an Z. Ghahramani. On the convergence of boun optimization algorithms. UAI, [17] C. J. Wu. On the convergence properties of the EM algorithm. The Annals of Statistics, pages , [18] A. L. Yuille, A. Rangarajan, an A. Yuille. The concave-convex proceure (CCCP). Avances in Neural Information Processing Systems, 2: , [19] W. I. Zangwill. Nonlinear programming: a unifie approach, volume 196. Prentice-Hall Englewoo Cliffs, NJ, [20] H. Zhao, M. Melibari, an P. Poupart. On the Relationship between Sum-Prouct Networks an Bayesian Networks. In ICML, [21] H. Zhao, T. Ael, G. Goron, an B. Amos. Collapse variational inference for sum-prouct networks. In ICML,
Proof of SPNs as Mixture of Trees
A Proof of SPNs as Mixture of Trees Theorem 1. If T is an inuce SPN from a complete an ecomposable SPN S, then T is a tree that is complete an ecomposable. Proof. Argue by contraiction that T is not a
More informationSum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017
Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth
More informationLeast-Squares Regression on Sparse Spaces
Least-Squares Regression on Sparse Spaces Yuri Grinberg, Mahi Milani Far, Joelle Pineau School of Computer Science McGill University Montreal, Canaa {ygrinb,mmilan1,jpineau}@cs.mcgill.ca 1 Introuction
More informationOnline and Distributed Bayesian Moment Matching for Parameter Learning in Sum-Product Networks
Online and Distributed Bayesian Moment Matching for Parameter Learning in Sum-Product Networks Abdullah Rashwan Han Zhao Pascal Poupart Computer Science University of Waterloo arashwan@uwaterloo.ca Machine
More informationLower Bounds for the Smoothed Number of Pareto optimal Solutions
Lower Bouns for the Smoothe Number of Pareto optimal Solutions Tobias Brunsch an Heiko Röglin Department of Computer Science, University of Bonn, Germany brunsch@cs.uni-bonn.e, heiko@roeglin.org Abstract.
More informationA Course in Machine Learning
A Course in Machine Learning Hal Daumé III 12 EFFICIENT LEARNING So far, our focus has been on moels of learning an basic algorithms for those moels. We have not place much emphasis on how to learn quickly.
More informationOn the Relationship between Sum-Product Networks and Bayesian Networks
On the Relationship between Sum-Product Networks and Bayesian Networks International Conference on Machine Learning, 2015 Han Zhao Mazen Melibari Pascal Poupart University of Waterloo, Waterloo, ON, Canada
More informationNOTES ON EULER-BOOLE SUMMATION (1) f (l 1) (n) f (l 1) (m) + ( 1)k 1 k! B k (y) f (k) (y) dy,
NOTES ON EULER-BOOLE SUMMATION JONATHAN M BORWEIN, NEIL J CALKIN, AND DANTE MANNA Abstract We stuy a connection between Euler-MacLaurin Summation an Boole Summation suggeste in an AMM note from 196, which
More informationRobust Low Rank Kernel Embeddings of Multivariate Distributions
Robust Low Rank Kernel Embeings of Multivariate Distributions Le Song, Bo Dai College of Computing, Georgia Institute of Technology lsong@cc.gatech.eu, boai@gatech.eu Abstract Kernel embeing of istributions
More informationIntroduction to Machine Learning
How o you estimate p(y x)? Outline Contents Introuction to Machine Learning Logistic Regression Varun Chanola April 9, 207 Generative vs. Discriminative Classifiers 2 Logistic Regression 2 3 Logistic Regression
More informationCascaded redundancy reduction
Network: Comput. Neural Syst. 9 (1998) 73 84. Printe in the UK PII: S0954-898X(98)88342-5 Cascae reunancy reuction Virginia R e Sa an Geoffrey E Hinton Department of Computer Science, University of Toronto,
More information'HVLJQ &RQVLGHUDWLRQ LQ 0DWHULDO 6HOHFWLRQ 'HVLJQ 6HQVLWLYLW\,1752'8&7,21
Large amping in a structural material may be either esirable or unesirable, epening on the engineering application at han. For example, amping is a esirable property to the esigner concerne with limiting
More informationA PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks
A PAC-Bayesian Approach to Spectrally-Normalize Margin Bouns for Neural Networks Behnam Neyshabur, Srinah Bhojanapalli, Davi McAllester, Nathan Srebro Toyota Technological Institute at Chicago {bneyshabur,
More informationLinear First-Order Equations
5 Linear First-Orer Equations Linear first-orer ifferential equations make up another important class of ifferential equations that commonly arise in applications an are relatively easy to solve (in theory)
More informationRobust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k
A Proof of Lemma 2 B Proof of Lemma 3 Proof: Since the support of LL istributions is R, two such istributions are equivalent absolutely continuous with respect to each other an the ivergence is well-efine
More informationLecture 2: Correlated Topic Model
Probabilistic Moels for Unsupervise Learning Spring 203 Lecture 2: Correlate Topic Moel Inference for Correlate Topic Moel Yuan Yuan First of all, let us make some claims about the parameters an variables
More informationLinear Time Computation of Moments in Sum-Product Networks
Linear Time Computation of Moments in Sum-Product Netorks Han Zhao Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213 han.zhao@cs.cmu.edu Geoff Gordon Machine Learning Department
More informationAdmin BACKPROPAGATION. Neural network. Neural network 11/3/16. Assignment 7. Assignment 8 Goals today. David Kauchak CS158 Fall 2016
Amin Assignment 7 Assignment 8 Goals toay BACKPROPAGATION Davi Kauchak CS58 Fall 206 Neural network Neural network inputs inputs some inputs are provie/ entere Iniviual perceptrons/ neurons Neural network
More informationA. Exclusive KL View of the MLE
A. Exclusive KL View of the MLE Lets assume a change-of-variable moel p Z z on the ranom variable Z R m, such as the one use in Dinh et al. 2017: z 0 p 0 z 0 an z = ψz 0, where ψ is an invertible function
More informationAn Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback
Journal of Machine Learning Research 8 07) - Submitte /6; Publishe 5/7 An Optimal Algorithm for Banit an Zero-Orer Convex Optimization with wo-point Feeback Oha Shamir Department of Computer Science an
More informationLATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION
The Annals of Statistics 1997, Vol. 25, No. 6, 2313 2327 LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION By Eva Riccomagno, 1 Rainer Schwabe 2 an Henry P. Wynn 1 University of Warwick, Technische
More informationDifferentiation ( , 9.5)
Chapter 2 Differentiation (8.1 8.3, 9.5) 2.1 Rate of Change (8.2.1 5) Recall that the equation of a straight line can be written as y = mx + c, where m is the slope or graient of the line, an c is the
More informationJUST THE MATHS UNIT NUMBER DIFFERENTIATION 2 (Rates of change) A.J.Hobson
JUST THE MATHS UNIT NUMBER 10.2 DIFFERENTIATION 2 (Rates of change) by A.J.Hobson 10.2.1 Introuction 10.2.2 Average rates of change 10.2.3 Instantaneous rates of change 10.2.4 Derivatives 10.2.5 Exercises
More informationEstimation of the Maximum Domination Value in Multi-Dimensional Data Sets
Proceeings of the 4th East-European Conference on Avances in Databases an Information Systems ADBIS) 200 Estimation of the Maximum Domination Value in Multi-Dimensional Data Sets Eleftherios Tiakas, Apostolos.
More informationA Review of Multiple Try MCMC algorithms for Signal Processing
A Review of Multiple Try MCMC algorithms for Signal Processing Luca Martino Image Processing Lab., Universitat e València (Spain) Universia Carlos III e Mari, Leganes (Spain) Abstract Many applications
More informationA Note on Exact Solutions to Linear Differential Equations by the Matrix Exponential
Avances in Applie Mathematics an Mechanics Av. Appl. Math. Mech. Vol. 1 No. 4 pp. 573-580 DOI: 10.4208/aamm.09-m0946 August 2009 A Note on Exact Solutions to Linear Differential Equations by the Matrix
More informationSurvey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013
Survey Sampling Kosuke Imai Department of Politics, Princeton University February 19, 2013 Survey sampling is one of the most commonly use ata collection methos for social scientists. We begin by escribing
More informationON THE OPTIMALITY SYSTEM FOR A 1 D EULER FLOW PROBLEM
ON THE OPTIMALITY SYSTEM FOR A D EULER FLOW PROBLEM Eugene M. Cliff Matthias Heinkenschloss y Ajit R. Shenoy z Interisciplinary Center for Applie Mathematics Virginia Tech Blacksburg, Virginia 46 Abstract
More informationFast Resampling Weighted v-statistics
Fast Resampling Weighte v-statistics Chunxiao Zhou Mar O. Hatfiel Clinical Research Center National Institutes of Health Bethesa, MD 20892 chunxiao.zhou@nih.gov Jiseong Par Dept of Math George Mason Univ
More informationd dx But have you ever seen a derivation of these results? We ll prove the first result below. cos h 1
Lecture 5 Some ifferentiation rules Trigonometric functions (Relevant section from Stewart, Seventh Eition: Section 3.3) You all know that sin = cos cos = sin. () But have you ever seen a erivation of
More informationLower bounds on Locality Sensitive Hashing
Lower bouns on Locality Sensitive Hashing Rajeev Motwani Assaf Naor Rina Panigrahy Abstract Given a metric space (X, X ), c 1, r > 0, an p, q [0, 1], a istribution over mappings H : X N is calle a (r,
More informationThe derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x)
Y. D. Chong (2016) MH2801: Complex Methos for the Sciences 1. Derivatives The erivative of a function f(x) is another function, efine in terms of a limiting expression: f (x) f (x) lim x δx 0 f(x + δx)
More informationA Novel Decoupled Iterative Method for Deep-Submicron MOSFET RF Circuit Simulation
A Novel ecouple Iterative Metho for eep-submicron MOSFET RF Circuit Simulation CHUAN-SHENG WANG an YIMING LI epartment of Mathematics, National Tsing Hua University, National Nano evice Laboratories, an
More informationCalculus and optimization
Calculus an optimization These notes essentially correspon to mathematical appenix 2 in the text. 1 Functions of a single variable Now that we have e ne functions we turn our attention to calculus. A function
More informationGaussian processes with monotonicity information
Gaussian processes with monotonicity information Anonymous Author Anonymous Author Unknown Institution Unknown Institution Abstract A metho for using monotonicity information in multivariate Gaussian process
More informationSeparation of Variables
Physics 342 Lecture 1 Separation of Variables Lecture 1 Physics 342 Quantum Mechanics I Monay, January 25th, 2010 There are three basic mathematical tools we nee, an then we can begin working on the physical
More informationProblems Governed by PDE. Shlomo Ta'asan. Carnegie Mellon University. and. Abstract
Pseuo-Time Methos for Constraine Optimization Problems Governe by PDE Shlomo Ta'asan Carnegie Mellon University an Institute for Computer Applications in Science an Engineering Abstract In this paper we
More informationKNN Particle Filters for Dynamic Hybrid Bayesian Networks
KNN Particle Filters for Dynamic Hybri Bayesian Networs H. D. Chen an K. C. Chang Dept. of Systems Engineering an Operations Research George Mason University MS 4A6, 4400 University Dr. Fairfax, VA 22030
More informationLecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012
CS-6 Theory Gems November 8, 0 Lecture Lecturer: Alesaner Mąry Scribes: Alhussein Fawzi, Dorina Thanou Introuction Toay, we will briefly iscuss an important technique in probability theory measure concentration
More informationCollapsed Variational Inference for Sum-Product Networks
Han hao HAN.HAO@CS.CMU.EDU Tameem Adel T.M.A.A.HESHAM@UVA.NL Geoff Gordon GGORDON@CS.CMU.EDU Brandon Amos BAMOS@CS.CMU.EDU School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA Machine
More informationComputing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions
Working Paper 2013:5 Department of Statistics Computing Exact Confience Coefficients of Simultaneous Confience Intervals for Multinomial Proportions an their Functions Shaobo Jin Working Paper 2013:5
More informationHybrid Fusion for Biometrics: Combining Score-level and Decision-level Fusion
Hybri Fusion for Biometrics: Combining Score-level an Decision-level Fusion Qian Tao Raymon Velhuis Signals an Systems Group, University of Twente Postbus 217, 7500AE Enschee, the Netherlans {q.tao,r.n.j.velhuis}@ewi.utwente.nl
More informationWEIGHTING A RESAMPLED PARTICLE IN SEQUENTIAL MONTE CARLO. L. Martino, V. Elvira, F. Louzada
WEIGHTIG A RESAMPLED PARTICLE I SEQUETIAL MOTE CARLO L. Martino, V. Elvira, F. Louzaa Dep. of Signal Theory an Communic., Universia Carlos III e Mari, Leganés (Spain). Institute of Mathematical Sciences
More informationSYNCHRONOUS SEQUENTIAL CIRCUITS
CHAPTER SYNCHRONOUS SEUENTIAL CIRCUITS Registers an counters, two very common synchronous sequential circuits, are introuce in this chapter. Register is a igital circuit for storing information. Contents
More informationTime-of-Arrival Estimation in Non-Line-Of-Sight Environments
2 Conference on Information Sciences an Systems, The Johns Hopkins University, March 2, 2 Time-of-Arrival Estimation in Non-Line-Of-Sight Environments Sinan Gezici, Hisashi Kobayashi an H. Vincent Poor
More informationLecture 6: Calculus. In Song Kim. September 7, 2011
Lecture 6: Calculus In Song Kim September 7, 20 Introuction to Differential Calculus In our previous lecture we came up with several ways to analyze functions. We saw previously that the slope of a linear
More informationLectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs
Lectures - Week 10 Introuction to Orinary Differential Equations (ODES) First Orer Linear ODEs When stuying ODEs we are consiering functions of one inepenent variable, e.g., f(x), where x is the inepenent
More informationA Modification of the Jarque-Bera Test. for Normality
Int. J. Contemp. Math. Sciences, Vol. 8, 01, no. 17, 84-85 HIKARI Lt, www.m-hikari.com http://x.oi.org/10.1988/ijcms.01.9106 A Moification of the Jarque-Bera Test for Normality Moawa El-Fallah Ab El-Salam
More informationError Floors in LDPC Codes: Fast Simulation, Bounds and Hardware Emulation
Error Floors in LDPC Coes: Fast Simulation, Bouns an Harware Emulation Pamela Lee, Lara Dolecek, Zhengya Zhang, Venkat Anantharam, Borivoje Nikolic, an Martin J. Wainwright EECS Department University of
More informationEstimating Causal Direction and Confounding Of Two Discrete Variables
Estimating Causal Direction an Confouning Of Two Discrete Variables This inspire further work on the so calle aitive noise moels. Hoyer et al. (2009) extene Shimizu s ientifiaarxiv:1611.01504v1 [stat.ml]
More informationarxiv: v2 [cs.ds] 11 May 2016
Optimizing Star-Convex Functions Jasper C.H. Lee Paul Valiant arxiv:5.04466v2 [cs.ds] May 206 Department of Computer Science Brown University {jasperchlee,paul_valiant}@brown.eu May 3, 206 Abstract We
More informationSummary: Differentiation
Techniques of Differentiation. Inverse Trigonometric functions The basic formulas (available in MF5 are: Summary: Differentiation ( sin ( cos The basic formula can be generalize as follows: Note: ( sin
More informationLevel Construction of Decision Trees in a Partition-based Framework for Classification
Level Construction of Decision Trees in a Partition-base Framework for Classification Y.Y. Yao, Y. Zhao an J.T. Yao Department of Computer Science, University of Regina Regina, Saskatchewan, Canaa S4S
More informationarxiv: v4 [math.pr] 27 Jul 2016
The Asymptotic Distribution of the Determinant of a Ranom Correlation Matrix arxiv:309768v4 mathpr] 7 Jul 06 AM Hanea a, & GF Nane b a Centre of xcellence for Biosecurity Risk Analysis, University of Melbourne,
More informationSituation awareness of power system based on static voltage security region
The 6th International Conference on Renewable Power Generation (RPG) 19 20 October 2017 Situation awareness of power system base on static voltage security region Fei Xiao, Zi-Qing Jiang, Qian Ai, Ran
More informationIntroduction to Markov Processes
Introuction to Markov Processes Connexions moule m44014 Zzis law Gustav) Meglicki, Jr Office of the VP for Information Technology Iniana University RCS: Section-2.tex,v 1.24 2012/12/21 18:03:08 gustav
More informationTutorial on Maximum Likelyhood Estimation: Parametric Density Estimation
Tutorial on Maximum Likelyhoo Estimation: Parametric Density Estimation Suhir B Kylasa 03/13/2014 1 Motivation Suppose one wishes to etermine just how biase an unfair coin is. Call the probability of tossing
More informationarxiv: v1 [math.co] 29 May 2009
arxiv:0905.4913v1 [math.co] 29 May 2009 simple Havel-Hakimi type algorithm to realize graphical egree sequences of irecte graphs Péter L. Erős an István Miklós. Rényi Institute of Mathematics, Hungarian
More informationarxiv: v5 [cs.lg] 28 Mar 2017
Equilibrium Propagation: Briging the Gap Between Energy-Base Moels an Backpropagation Benjamin Scellier an Yoshua Bengio * Université e Montréal, Montreal Institute for Learning Algorithms March 3, 217
More informationTopic 7: Convergence of Random Variables
Topic 7: Convergence of Ranom Variables Course 003, 2016 Page 0 The Inference Problem So far, our starting point has been a given probability space (S, F, P). We now look at how to generate information
More information05 The Continuum Limit and the Wave Equation
Utah State University DigitalCommons@USU Founations of Wave Phenomena Physics, Department of 1-1-2004 05 The Continuum Limit an the Wave Equation Charles G. Torre Department of Physics, Utah State University,
More informationSTATISTICAL LIKELIHOOD REPRESENTATIONS OF PRIOR KNOWLEDGE IN MACHINE LEARNING
STATISTICAL LIKELIHOOD REPRESENTATIONS OF PRIOR KNOWLEDGE IN MACHINE LEARNING Mark A. Kon Department of Mathematics an Statistics Boston University Boston, MA 02215 email: mkon@bu.eu Anrzej Przybyszewski
More informationRamsey numbers of some bipartite graphs versus complete graphs
Ramsey numbers of some bipartite graphs versus complete graphs Tao Jiang, Michael Salerno Miami University, Oxfor, OH 45056, USA Abstract. The Ramsey number r(h, K n ) is the smallest positive integer
More informationGeneralizing Kronecker Graphs in order to Model Searchable Networks
Generalizing Kronecker Graphs in orer to Moel Searchable Networks Elizabeth Boine, Babak Hassibi, Aam Wierman California Institute of Technology Pasaena, CA 925 Email: {eaboine, hassibi, aamw}@caltecheu
More informationSchrödinger s equation.
Physics 342 Lecture 5 Schröinger s Equation Lecture 5 Physics 342 Quantum Mechanics I Wenesay, February 3r, 2010 Toay we iscuss Schröinger s equation an show that it supports the basic interpretation of
More informationMaximum A Posteriori Inference in Sum-Product Networks
Maximum A Posteriori Inference in Sum-Product Networks Jun Mei, Yong Jiang, Kewei Tu ShanghaiTech University {meijun,jiangyong,tukw}@shanghaitech.edu.cn Abstract Sum-product networks (SPNs) are a class
More informationP(x) = 1 + x n. (20.11) n n φ n(x) = exp(x) = lim φ (x) (20.8) Our first task for the chain rule is to find the derivative of the exponential
20. Derivatives of compositions: the chain rule At the en of the last lecture we iscovere a nee for the erivative of a composition. In this lecture we show how to calculate it. Accoringly, let P have omain
More informationarxiv: v4 [cs.ds] 7 Mar 2014
Analysis of Agglomerative Clustering Marcel R. Ackermann Johannes Blömer Daniel Kuntze Christian Sohler arxiv:101.697v [cs.ds] 7 Mar 01 Abstract The iameter k-clustering problem is the problem of partitioning
More information7.1 Support Vector Machine
67577 Intro. to Machine Learning Fall semester, 006/7 Lecture 7: Support Vector Machines an Kernel Functions II Lecturer: Amnon Shashua Scribe: Amnon Shashua 7. Support Vector Machine We return now to
More informationMake graph of g by adding c to the y-values. on the graph of f by c. multiplying the y-values. even-degree polynomial. graph goes up on both sides
Reference 1: Transformations of Graphs an En Behavior of Polynomial Graphs Transformations of graphs aitive constant constant on the outsie g(x) = + c Make graph of g by aing c to the y-values on the graph
More informationQF101: Quantitative Finance September 5, Week 3: Derivatives. Facilitator: Christopher Ting AY 2017/2018. f ( x + ) f(x) f(x) = lim
QF101: Quantitative Finance September 5, 2017 Week 3: Derivatives Facilitator: Christopher Ting AY 2017/2018 I recoil with ismay an horror at this lamentable plague of functions which o not have erivatives.
More informationFinal Exam Study Guide and Practice Problems Solutions
Final Exam Stuy Guie an Practice Problems Solutions Note: These problems are just some of the types of problems that might appear on the exam. However, to fully prepare for the exam, in aition to making
More informationConservation Laws. Chapter Conservation of Energy
20 Chapter 3 Conservation Laws In orer to check the physical consistency of the above set of equations governing Maxwell-Lorentz electroynamics [(2.10) an (2.12) or (1.65) an (1.68)], we examine the action
More informationOn the Relationship between Sum-Product Networks and Bayesian Networks
On the Relationship between Sum-Product Networks and Bayesian Networks by Han Zhao A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of
More information1 dx. where is a large constant, i.e., 1, (7.6) and Px is of the order of unity. Indeed, if px is given by (7.5), the inequality (7.
Lectures Nine an Ten The WKB Approximation The WKB metho is a powerful tool to obtain solutions for many physical problems It is generally applicable to problems of wave propagation in which the frequency
More informationIntroduction to the Vlasov-Poisson system
Introuction to the Vlasov-Poisson system Simone Calogero 1 The Vlasov equation Consier a particle with mass m > 0. Let x(t) R 3 enote the position of the particle at time t R an v(t) = ẋ(t) = x(t)/t its
More informationInfluence of weight initialization on multilayer perceptron performance
Influence of weight initialization on multilayer perceptron performance M. Karouia (1,2) T. Denœux (1) R. Lengellé (1) (1) Université e Compiègne U.R.A. CNRS 817 Heuiasyc BP 649 - F-66 Compiègne ceex -
More informationCapacity Analysis of MIMO Systems with Unknown Channel State Information
Capacity Analysis of MIMO Systems with Unknown Channel State Information Jun Zheng an Bhaskar D. Rao Dept. of Electrical an Computer Engineering University of California at San Diego e-mail: juzheng@ucs.eu,
More informationThe Exact Form and General Integrating Factors
7 The Exact Form an General Integrating Factors In the previous chapters, we ve seen how separable an linear ifferential equations can be solve using methos for converting them to forms that can be easily
More informationMath Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors
Math 18.02 Notes on ifferentials, the Chain Rule, graients, irectional erivative, an normal vectors Tangent plane an linear approximation We efine the partial erivatives of f( xy, ) as follows: f f( x+
More informationLDA Collapsed Gibbs Sampler, VariaNonal Inference. Task 3: Mixed Membership Models. Case Study 5: Mixed Membership Modeling
Case Stuy 5: Mixe Membership Moeling LDA Collapse Gibbs Sampler, VariaNonal Inference Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox May 8 th, 05 Emily Fox 05 Task : Mixe
More informationSwitching Time Optimization in Discretized Hybrid Dynamical Systems
Switching Time Optimization in Discretize Hybri Dynamical Systems Kathrin Flaßkamp, To Murphey, an Sina Ober-Blöbaum Abstract Switching time optimization (STO) arises in systems that have a finite set
More informationINDEPENDENT COMPONENT ANALYSIS VIA
INDEPENDENT COMPONENT ANALYSIS VIA NONPARAMETRIC MAXIMUM LIKELIHOOD ESTIMATION Truth Rotate S 2 2 1 0 1 2 3 4 X 2 2 1 0 1 2 3 4 4 2 0 2 4 6 4 2 0 2 4 6 S 1 X 1 Reconstructe S^ 2 2 1 0 1 2 3 4 Marginal
More informationAll s Well That Ends Well: Supplementary Proofs
All s Well That Ens Well: Guarantee Resolution of Simultaneous Rigi Boy Impact 1:1 All s Well That Ens Well: Supplementary Proofs This ocument complements the paper All s Well That Ens Well: Guarantee
More informationOptimized Schwarz Methods with the Yin-Yang Grid for Shallow Water Equations
Optimize Schwarz Methos with the Yin-Yang Gri for Shallow Water Equations Abessama Qaouri Recherche en prévision numérique, Atmospheric Science an Technology Directorate, Environment Canaa, Dorval, Québec,
More informationAN INTRODUCTION TO NUMERICAL METHODS USING MATHCAD. Mathcad Release 14. Khyruddin Akbar Ansari, Ph.D., P.E.
AN INTRODUCTION TO NUMERICAL METHODS USING MATHCAD Mathca Release 14 Khyruin Akbar Ansari, Ph.D., P.E. Professor of Mechanical Engineering School of Engineering an Applie Science Gonzaga University SDC
More informationHyperbolic Moment Equations Using Quadrature-Based Projection Methods
Hyperbolic Moment Equations Using Quarature-Base Projection Methos J. Koellermeier an M. Torrilhon Department of Mathematics, RWTH Aachen University, Aachen, Germany Abstract. Kinetic equations like the
More informationQuantum Mechanics in Three Dimensions
Physics 342 Lecture 20 Quantum Mechanics in Three Dimensions Lecture 20 Physics 342 Quantum Mechanics I Monay, March 24th, 2008 We begin our spherical solutions with the simplest possible case zero potential.
More informationPolynomial Inclusion Functions
Polynomial Inclusion Functions E. e Weert, E. van Kampen, Q. P. Chu, an J. A. Muler Delft University of Technology, Faculty of Aerospace Engineering, Control an Simulation Division E.eWeert@TUDelft.nl
More informationMath 1B, lecture 8: Integration by parts
Math B, lecture 8: Integration by parts Nathan Pflueger 23 September 2 Introuction Integration by parts, similarly to integration by substitution, reverses a well-known technique of ifferentiation an explores
More informationBalancing Expected and Worst-Case Utility in Contracting Models with Asymmetric Information and Pooling
Balancing Expecte an Worst-Case Utility in Contracting Moels with Asymmetric Information an Pooling R.B.O. erkkamp & W. van en Heuvel & A.P.M. Wagelmans Econometric Institute Report EI2018-01 9th January
More informationConnections Between Duality in Control Theory and
Connections Between Duality in Control heory an Convex Optimization V. Balakrishnan 1 an L. Vanenberghe 2 Abstract Several important problems in control theory can be reformulate as convex optimization
More informationTwo Dimensional Numerical Simulator for Modeling NDC Region in SNDC Devices
Journal of Physics: Conference Series PAPER OPEN ACCESS Two Dimensional Numerical Simulator for Moeling NDC Region in SNDC Devices To cite this article: Dheeraj Kumar Sinha et al 2016 J. Phys.: Conf. Ser.
More informationAnalyzing Tensor Power Method Dynamics in Overcomplete Regime
Journal of Machine Learning Research 18 (2017) 1-40 Submitte 9/15; Revise 11/16; Publishe 4/17 Analyzing Tensor Power Metho Dynamics in Overcomplete Regime Animashree Ananumar Department of Electrical
More informationLeaving Randomness to Nature: d-dimensional Product Codes through the lens of Generalized-LDPC codes
Leaving Ranomness to Nature: -Dimensional Prouct Coes through the lens of Generalize-LDPC coes Tavor Baharav, Kannan Ramchanran Dept. of Electrical Engineering an Computer Sciences, U.C. Berkeley {tavorb,
More informationRobustness and Perturbations of Minimal Bases
Robustness an Perturbations of Minimal Bases Paul Van Dooren an Froilán M Dopico December 9, 2016 Abstract Polynomial minimal bases of rational vector subspaces are a classical concept that plays an important
More informationTable of Common Derivatives By David Abraham
Prouct an Quotient Rules: Table of Common Derivatives By Davi Abraham [ f ( g( ] = [ f ( ] g( + f ( [ g( ] f ( = g( [ f ( ] g( g( f ( [ g( ] Trigonometric Functions: sin( = cos( cos( = sin( tan( = sec
More informationModelling and simulation of dependence structures in nonlife insurance with Bernstein copulas
Moelling an simulation of epenence structures in nonlife insurance with Bernstein copulas Prof. Dr. Dietmar Pfeifer Dept. of Mathematics, University of Olenburg an AON Benfiel, Hamburg Dr. Doreen Straßburger
More informationALGEBRAIC AND ANALYTIC PROPERTIES OF ARITHMETIC FUNCTIONS
ALGEBRAIC AND ANALYTIC PROPERTIES OF ARITHMETIC FUNCTIONS MARK SCHACHNER Abstract. When consiere as an algebraic space, the set of arithmetic functions equippe with the operations of pointwise aition an
More informationEquilibrium in Queues Under Unknown Service Times and Service Value
University of Pennsylvania ScholarlyCommons Finance Papers Wharton Faculty Research 1-2014 Equilibrium in Queues Uner Unknown Service Times an Service Value Laurens Debo Senthil K. Veeraraghavan University
More information