Mean Field Variational Approximation for Continuous-Time Bayesian Networks

Size: px

Start display at page:

Download "Mean Field Variational Approximation for Continuous-Time Bayesian Networks"

Clement Cunningham
6 years ago
Views:

1 Mean Fiel Variational Approximation for Continuous-Time Bayesian Networks Io Cohn Tal El-Hay Nir Frieman School of Computer Science The Hebrew University {io Raz Kupferman Institute of Mathematics The Hebrew University Abstract Continuous-time Bayesian networks is a natural structure representation language for multicomponent stochastic processes that evolve continuously over time. Despite the compact representation, inference in such moels is intractable even in relatively simple structure networks. Here we introuce a mean fiel variational approximation in which we use a prouct of inhomogeneous Markov processes to approximate a istribution over trajectories. This variational approach leas to a globally consistent istribution, which can be efficiently querie. Aitionally, it provies a lower boun on the probability of observations, thus making it attractive for learning tasks. We provie the theoretical founations for the approximation, an efficient implementation that exploits the wie range of highly optimize orinary ifferential equations (ODE) solvers, experimentally explore characterizations of processes for which this approximation is suitable, an show applications to a large-scale realworl inference problem. 1 Introuction Many real-life processes can be naturally thought of as evolving continuously in time. Examples cover a iverse range, incluing server availability, changes in socioeconomic status, an genetic sequence evolution. To realistically moel such processes, we nee to reason about systems that are compose of multiple components (e.g., many servers in a server farm, multiple resiues in a protein sequence) an evolve in continuous time. Continuous-time Bayesian networks (CTBNs) provie a representation language for such processes, which allows to naturally exploit sparse patterns of interactions to compactly represent the ynamics of such processes [9]. Inference in multi-component temporal moels is a notoriously har problem [1]. Similar to the situation in iscrete time processes, inference is exponential in the number of components, even in a CTBN with sparse interactions [9]. Thus, we have to resort to approximate inference methos. The recent literature has aapte several strategies from iscrete graphical moels to CTBNs. These inclue sampling-base approaches, where Fan an Shelton [5] introuce a likelihoo-weighte sampling scheme, an more recently we [4] introuce a Gibbs-sampling proceure. Such sampling-base approaches yiel more accurate answers with the investment of aitional computation. However, it is har to boun the require time in avance, tune the stopping criteria, or estimate the error of the approximation. An alternative class of approximations is base on variational principles. Recently, Noelman et al. [11] introuce an Expectation Propagation approach, which can be roughly escribe as a local message passing scheme, where each message escribes the ynamics of a single component over an interval. This message passing proceure can automatically refine the number of intervals accoring to the complexity of the unerlying system [14]. Nonetheless, it oes suffer from several caveats. On the formal level, the approximation has no convergence guaranties. Secon, upon convergence, the compute marginals o not necessarily form a globally consistent istribution. Thir, it is restricte to approximations in the form of piecewisehomogeneous messages on each interval. Thus, the refinement of the number of intervals epens on the fit of such homogeneous approximations to the target process. Finally, the approximation of Noelman et al oes not provie a provable approximation on the likelihoo of the observation a crucial component in learning proceures. Here, we evelop an alternative variational approximation, which provies a ifferent trae-off. We use the strategy of structure variational approximations in graphical moels [8], an specifically by the variational approach of Opper an Sanguinetti [12] for approximate inference in Markov Jump Processes, a relate class of moels (see below). The resulting proceure approximates the posterior istribution of the CTBN as a prouct of inepenent components, each of which is an inhomogeneous continuous-

2 time Markov process. As we show, by using a natural representation of these processes, we erive a variational proceure that is both efficient, an provies a goo approximation both for the likelihoo of the evience an for the expecte sufficient statistics. In particular, the approximation provies a lower-boun on the likelihoo, an thus is attractive for use in learning. 2 Continuous-Time Bayesian Networks Consier a D-component Markov process X (t) = (X (t) 1, X(t) 2,... X(t) D ) with state space S = S 1 S 2 S D. A notational convention: vectors are enote by bolface symbols, e.g., X, an matrices are enote by blackboar style characters, e.g., Q. The states in S are enote by vectors of inexes, x = (x 1,..., x D ). We use inexes 1 i, j D for enumerating components an X (t) an X (t) i to enote the ranom variable escribing the state of the process an its i th components at time t. The ynamics of a time-homogeneous continuous-time Markov process are fully etermine by the Markov transition function, p x,y (t) = Pr(X (t+s) = y X (s) = x), where time-homogeneity implies that the right-han sie oes not epen on s. These ynamics are fully capture by a matrix Q the rate matrix with non-negative offiagonal entries q x,y an iagonal q x,x = y x q x,y. This rate matrix efines the transition probabilities p x,y (h) = δ x,y + q x,y h + o(h) where δ x,y is a multivariate Kronecker elta an o( ) means ecay to zero faster than its argument. Using the rate matrix Q, we can express the Markov transition function as p x,y (t) = [exp(tq)] x,y where exp(tq) is a matrix exponential [2, 7]. A continuous-time Bayesian network is efine by assigning each component i a set of components Pa i {1,..., D} \ {i}, which are its parents in the network [9]. With each component i we then associate a set of conitional rate matrix Q i Pai u i for each state u i of Pa i. The off-iagonal entries q i Pai x i,y i u i represent the rate at which X i transitions from state x i to state y i given that its parents are in state u i. The ynamics of X (t) are efine by a rate matrix Q with entries q x,y, which amalgamates the conitional rate matrices as follows: q i Pai x i,y i u i δ(x, y) = {i} q x,y = i qi Pai x i,x i u i x = y (1) 0 otherwise, where δ(x, y) = {i x i y i }. This efinition implies that changes are one component at a time. Given a continuous-time Bayesian network, we woul like to evaluate the likelihoo of evience, to compute the probability of various events given the evience (e.g., that the state of the system at time t is x), an to compute conitional expectations (e.g., the expecte amount of time X i was in state x i ). Direct computations of these quantities involve matrix exponentials of the rate matrix Q, whose size is exponential in the number of components, making this approach infeasible beyon a moest number of components. We therefore have to resort to approximations. 3 Variational Principle for Continuous Time Markov Processes We start by efining a variational approximations principle in terms of a general continuous-time Markov process (that is, without assuming any network structure). For convenience we restrict our treatment to a time interval [0, T ] with en-point evience X (0) = e 0 an X (T ) = e T. We iscuss more general types of evience below. Here we aim to efine a lower boun on ln P Q (e T e 0 ) as well as to approximate the posterior probability P Q ( e 0, e T ). Marginal Density Representation Variational approximations cast inference as an optimization problem of a functional which approximates the log probability of the evience by introucing an auxiliary set of variational parameters. Here we efine the optimization problem over a set of mean parameters [15], representing possible values of expecte sufficient statistics. As iscusse above, the prior istribution of the process can be characterize by a time-inepenent rate matrix Q. It is easy to show that if the prior is a Markov process, then the posterior is also a Markov process, albeit not necessarily a homogeneous one. Such a process can be represente by a time-epenent rate matrix that escribes the instantaneous transition rates. Here, rather than representing the target istribution by a time-epenent rate matrix, we consier a representation that is more natural for variational approximations. Let Pr be the istribution of a Markov process. We efine a family of functions: µ x (t) = Pr(X (t) = x) Pr(X (t) = x, X (t+h) = y) γ x,y (t) = lim, x y h 0 h (2) γ x,x (t) = y x γ x,y (t). The function µ x (t) is the probability that X (t) = x. The function γ x,y (t) is the probability ensity that X transitions from state x to y at time t. Note that this parameter is not a transition rate, but rather a prouct of a pointwise probability with the point-wise transition rate of the approximating probability, i.e., γ x,y (t)/µ x (t) is the x, y entry of the time-epenent rate matrix. Hence, unlike the (inhomogeneous) rate matrix at time t, γ x,y (t) takes into account the probability of being in state x an not only the

3 rate of transitions. This efinition implies that Pr(X (t) = x, X (t+h) = y) = µ x (t)δ x,y +γ x,y (t)h+o(h), We aim to use the family of functions µ an γ as a representation of a Markov process. To o so, we nee to characterize the set of constraints that these functions shoul satisfy. Definition 3.1: A family η = {µ x (t), γ x,y (t) : 0 t T } of continuous functions is a Markov-consistent ensity set if the following constraints are fulfille: µ x (t) 0, µ x (0) = 1, γ x,y (t) 0 y x, γ x,x (t) = y x γ x,y (t), t µ x(t) = y x γ y,x (t). Let M be the set of all Markov-consistent ensities. Using stanar arguments we can show that there exists a corresponence between (generally inhomogeneous) Markov processes an ensity sets η. Specifically: Lemma 3.2: Let η = {µ x (t), γ x,y (t)}. If η M, then there exists a continuous-time Markov process P η for which µ x an γ x,y satisfy (2). The processes we are intereste in, however, have aitional structure, as they correspon to the posterior istribution of a time-homogeneous process with en-point evience. This aitional structure implies that we shoul only consier a subset of M: Lemma 3.3: Let Q be a rate matrix, an e 0, e T be states of X. Then the representation η corresponing to the posterior istribution P Q ( e 0, e T ) is in the set M e M that contains Markov-consistent ensity sets satisfying µ x (0) = δ x,e0, µ x (T ) = δ x,et. Thus, from now on we can restrict our attention to ensity sets from M e. The constraint that µ x (0) an µ x (T ) also has consequences on γ x,y at these points. Lemma 3.4: If η M e then γ x,y (0) = 0 for all x e 0 an γ x,y (T ) = 0 for all y e T. Variational Principle We can now state the variational principle for continuous processes, which closely tracks similar principles for iscrete processes. We efine a free energy functional, F(η; Q) = E(η; Q) + H(η), which, as we will see, measures the quality of η as an approximation of P Q ( e). (For succinctness, we will assume that the evience e is clear from the context.) The two terms in the continuous functional correspon to an entropy, H(η) = T 0 an an energy, E(η; Q) = γ x,y (t)[1 + ln µ x (t) ln γ x,y (t)]t, T 0 x y x µ x (t)q x,x + γ x,y (t) ln q x,y t. y x x Theorem 3.5: Let Q be a rate matrix, e = (e 0, e T ) be states of X, an η M e. Then F(η; Q) = ln P Q (e T e 0 ) ID(P η ( ) P Q ( e)) where ID(P η ( P Q ( e)) is the KL ivergence between the two processes. We conclue that F(η; Q) is a lower boun of the loglikelihoo of the evience, an that the closer the approximation to the target posterior, the tighter the boun. Proof Outline The basic iea is to consier iscrete approximations of the functional. Let K be an integer. We efine the K-sieve X K to be the set of ranom variables X (t0), X (t1),..., X (tk) where t k = kt K. We can use the variational principle [8] on the marginal istributions P Q (X K e) an P η (X K ). More precisely, efine [ F K (η; Q) = E Pη ln P ] Q(X K, e T e 0 ), P η (X K ) which can, by using simple arithmetic manipulations, be recast as F K (η; Q) = ln P Q (e T e 0 ) ID(P η (X K ) P Q (X K e)). We get the esire result by letting K. By efinition lim K ID(P η (X K ) P Q (X K e)) is ID(P η ( ) P Q ( e)). The crux of the proof is in proving the following lemma. Lemma 3.6: F(η; Q) = lim K F K (η; Q). Proof: Since both P Q an P η are Markov processes, F K (η; Q) = K 1 k=0 K 1 k=0 K 1 + k=1 ] E Pη [ln P Q (X (tk+1) X (tk) ) ] E Pη [ln P η (X (tk), X (tk+1) ) ] E Pη [ln P η (X (tk) ) We now express these terms as functions of µ x (t), γ x,y (t) an q x,y. By efinition, P η (X (t k) = x) = µ x (t k ). Each

4 of the expectations either epen on this term, or on the joint istribution P η (X (t k 1), X (t k) ). Using the continuity of γ x,y (t) we write P η (X (t k) = x, X (t k+1) = y) = δ x,y µ x (t k ) + K γ x,y (t k ) + o( K ) where K = T/K. Similarly, we can also write P Q (X (t k+1) = y X (t k) = x) = δ x,y + K q x,y +o( K ) Finally, using properties of logarithms we have that ln (1 + K z + o( K )) = K z + o( K ). Using these relations, we can rewrite after teious yet straightforwar manipulations, where E K (η; Q) = an e K (t) = x F K (η; Q) = E K (η; Q) + H K (η), where µ \i x (t) = j i µj x j (t) is the joint istribution at time t of all the components other than the i th. (It is not har to see that if η i M i e for all i, then η M e.) We efine the set M F e to contain all factore ensity sets. From now on we assume that η = η 1 η D M F e. Assuming that Q is efine by a CTBN, an that η is a factore ensity set, we can rewrite E(η; Q) = i an + i T 0 T 0 x i µ i [ ] x i (t)e µ \i (t) qxi,x i U i t γ i [ ] x i,y i (t)e µ \i (t) ln qxi,y i U i t, x i,y i x i H(η) = i H(η i ). This ecomposition involves only local terms that either K 1 K 1 inclue the i th component, or inclue the i th component K e K (t k ), H K (η) = K h K (t k ), an its parents [ in the CTBN ] efining Q. Note that terms k=0 k=0 such as E µ \i (t) qxi,x i U i involve only µ j (t) for j Pa i. To make the factore nature of the approximation explicit in the notation, we write henceforth, γ x,y (t)[1 + ln µ x (t) ln γ x,y (t)] + o( K ) y x F(η; Q) = F F (η 1,..., η D ; Q). h K (t) = x µ x (t)q xx + γ x,y (t) log q x,y + o( K ) y x Fixe Point Characterization We can now pose the optimization problem we wish to solve: Letting K we have that k k[f(t k ) + o( K )] T 0 f(t)t, hence E K(η; Q) an H K (η) converge to E(η; Q) an H(η), respectively. 4 Factore Approximation The variational principle we iscusse is base on a representation that is as complex as the original process the number of functions γ x,y (t) we consier is equal to the size of the original rate matrix Q. To get a tractable inference proceure we make aitional simplifying assumptions on the approximating istribution. Given a D-component process we consier approximations that factor into proucts of inepenent processes. More precisely, we efine M i e to be the continuous Markov-consistent ensity sets over the component X i, that are consistent with the evience on X i at times 0 an T. Given a collection of ensity sets η 1,..., η D for the ifferent components, the prouct ensity set η = η 1 η D is efine as µ x (t) = µ i x i (t) i γx i i,y i (t)µ \i x (t) δ(x, y) = {i} γ x,y (t) = i γi x i,x i (t)µ \i x (t) x = y 0 otherwise Fixing i, an given η 1,..., η i 1, η i+1,..., η D, in M 1 e,... M i 1 e, M i+1 e,..., M D e, respectively, fin arg max ηi M i F F (η 1,..., η D ; Q). e If for all i, we have a µ i M i e, which is a solution to this optimization problem with respect to each component, then we have a (local) stationary point of the energy functional within M F e. To solve this optimization problem, we efine a Lagrangian, which inclues the constraints in the form of Def The Lagrangian is a functional of the functions µ i x i (t) an γ i x i,y i (t) an Lagrange multipliers (which are functions of t as well). The stationary point of the functional satisfies the Euler-Lagrange equations, namely the functional erivatives of L vanish. Writing these equations in explicit form we get a fixe point characterization of the solution in term of the following set of ODEs: t µi x i (t) = y i x i ( γ i yi,x i (t) γ i x i,y i (t) ) t ρi x i (t) = ρ i x i (t)( q i x i,x i (t) + ψ i x i (t)) y i x i ρ i y i (t) q i x i,y i (t) where ρ i are the exponents of the Lagrange multipliers. (3)

5 In aition we have the following algebraic constraint ρ i x i (t)γ i x i,y i (t) = µ i x i (t) q i x i,y i (t)ρ i y i (t), x i y i. (4) In these equations we use the following shorthan notations for the average rates [ ] q x i i,y i (t) = E µ \i (t) q i Pai x i,y i U i [ ] q x i i,y i x j (t) = E µ \i (t) q i Pai x i,y i U i x j, Similarly, we have the following shorthan notations for the geometrically-average rates, { [ ]} q x i i,y i (t) = exp E µ \i (t) ln q i Pai x i,y i U i { [ ]} q x i i,y i x j (t) = exp E µ \i (t) ln q i Pai x i,y i U i x j, The last auxiliary term is ψx i i (t) = µ j x j (t) q j x j,x j x i (t)+ j Chilren i j Chilren i x j x j y j γ j x j,y j (t) ln q j x j,y j x i (t). The two ifferential equations (3) for µ i x i (t) an ρ i x i (t) escribe, respectively, the progression of µ i x i forwar, an the progression of ρ i x i backwar. To uniquely solve these equations we nee to set the bounary conitions. The bounary conition for µ i x i is efine explicitly in M F e as µ i x i (0) = δ xi,e i,0 (5) The bounary conition at T is slightly more involve. The constraints in M F e imply that µ i x i (T ) = δ xi,e i,t. As state by Lemma 3.4, we have that γe i i,t,x i (T ) = 0 when x i e i,t. Plugging these values into (4), an assuming that Q is irreucible we get that ρ xi (T ) = 0 for all x i e i,t. In aition, we notice that ρ ei,t (T ) 0, for otherwise the whole system of equations for ρ will collapse to 0. Finally, notice that the solution of (3) for µ i an γ i is insensitive to the multiplication of ρ i by a constant. Thus, we can arbitrarily set ρ ei,t (T ) = 1, an get the bounary conition ρ i x i (T ) = δ xi,e i,t. (6) Theorem 4.1: η i M i e is a stationary point (e.g., local maxima) of F F (η 1,..., η D ; Q) subject to the constraints of Def. 3.1 if an only if it satisfies (3 6). It is straightforwar to exten this result to show that at a maximum with respect to all the component ensities, this fixe-point characterization must hol for all components simultaneously. Example 4.2: Consier the case of a single component, for which our proceure shoul be exact, as no simplifying assumptions are mae on the ensity set. In this case, the average rates q i an the geometrically-average rates q i both reuce to the unaverage rates q, an ψ 0. Thus, the system of equations to be solve is t µ x(t) = (γ y,x (t) γ x,y (t)) y x t ρ x(t) = y along with the algebraic equation q x,y ρ y (t), ρ x (t)γ x,y (t) = q x,y µ x (t)ρ y (t), y x. In this case, it is straightforwar to show that the backwar propagation rule for ρ x implies that ρ x (t) = Pr(e T X (t) ). This system of ODEs is similar to forwar-backwar propagation, except that unlike classical forwar propagation (which woul use a function such as α x (t) = Pr(X (t) = x e 0 )), here the forwar propagation alreay takes into account the backwar messages, to irectly compute the posterior. Given this interpretation, it is clear that integrating ρ x (t) from T to 0 followe by integrating µ x (t) from 0 to T computes the exact posterior of the processes. This interpretation of ρ x (t) also allows us to unerstan the role of γ x,y (t). Recall that γ x,y (t)/µ x (t) is the instantaneous rate of transition from x to y at time t. Thus, γ x,y (t) µ x (t) = q ρ y (t) x,y ρ x (t). That is, the instantaneous rate combines the original rate with the relative likelihoo of the evience at T given y an x. If y is much more likely to lea to the final state, then the rates are biase towar y. Conversely, if y is unlikely to lea to the evience the rate of transitions to it are lower. This observation also explains why the forwar propagation of µ x will reach the observe µ x (T ) even though we i not impose it explicitly. Example 4.3: We efine an Ising chain to be a CTBN X 1 X 2 X D such that each binary component prefers to be in the same state as its neighbor. These moels are governe by two parameters: a coupling parameter β which etermines the strength of the coupling between two neighboring components, an a rate parameter τ that etermines the propensity of each component to change its state. More formally, we efine the conitional rate matrices as q i Pai x i,y i u i = τ (1 + e 2yiβ P ) 1 j Pai x j where x j { 1, 1}. As an example, we consier a two-component Ising chain with initial state X (0) 1 = 1 an X (0) 2 = 1, an a reverse state at the final time, X (T ) 1 = 1 an X (T ) 2 = 1. For a large value of β, this evience is unlikely as at

6 Figure 1: Numerical results for the two-component Ising chain escribe in Example 4.3 where the first component starts in state 1 an ens at time T = 1 in state 1. The secon component has the opposite behavior. (top) Two likely trajectories epicting the two moes of the moel. (mile) Exact (soli) an approximate (ashe/otte) marginals µ i 1(t). (bottom) The log ratio log ρ i 1(t)/ρ i 0(t). both en points the components are in a unesire configurations. The exact posterior is one that assigns higher probabilities to trajectories where one of the components switches relatively fast to match the other, an then towar the en of the interval, they separate to match the evience. Since the moel is symmetric, these trajectories are either ones in which both components are most of the time in state 1, or ones where both are most of the time in state 1 (Fig. 1 top). Due to symmetry, the marginal probability of each component is aroun 0.5 throughout most of the interval (Fig. 1 mile). The variational approximation cannot capture the epenency between the two components, an thus converges to one of two local maxima, corresponing to the two potential subsets of trajectories. Examining the value of ρ i, we see that close to the en of the interval they bias the instantaneous rates significantly (Fig. 1 bottom). This example also allows to examine the implications of moeling the posterior by inhomogeneous Markov processes. In principle, we might have use as an approximation Markov processes with homogeneous rates, an conitione on the evience. To examine whether our approximation behaves in this manner, we notice that in the single component case we have q x,y = ρ x(t)γ x,y (t) ρ y (t)µ x (t), which shoul be constant. Consier the analogous quantity in the multi-component case: q i x i,y i (t), the geometric average of the rate of X i, given the probability of parents state. Not surprisingly, this is exactly a mean fiel approximation, where the influence of interacting components is approximate by their average influence. Since the istribution of the parents (in the two-component system, the other component) changes in time, these rates change continuously, especially near the en of the time interval. This suggests that a piecewise homogeneous approximation cannot capture the ynamics without a loss in accuracy. Optimization Proceure If Q is irreucible, then ρ i x i an µ xi are non-zero throughout the open interval (0, T ). As a result, we can solve (4) to express γx i i,y i as a function of µ i an ρ i, thus eliminating it from (3) to get evolution equations solely in terms of µ i an ρ i. Abstracting the etails, we obtain a set of ODEs of the form t µi (t) = α(µ i (t), ρ i (t), µ \i (t)) t ρi (t) = β(ρ i (t), µ \i (t)) µ i (0) = given ρ i (T ) = given. where α an β can be inferre from (3) an (4). Since the evolution of ρ i oes not epen on µ i, we can integrate backwar from time T to solve for ρ i. Then, integrating forwar from time 0, we compute µ i. After performing a single iteration of backwar-forwar integration, we obtain a solution that satisfies the fixe-point equation (3) for the i th component. (This is not surprising once we have ientifie our proceure to be a variation of a stanar forwarbackwar algorithm for a single component.) Such a solution will be a local maximum of the functional w.r.t. to η i (reaching a local minimum or a sale point requires very specific initialization points). This suggests that we can use the stanar proceure of asynchronous upate, where we upate each component in a roun-robin fashion. Since each of these singlecomponent upates converges in one backwar-forwar step, an since it reaches a local maximum, each step improves the value of the free energy over the previous one. Since the free energy functional is boune by the probability of the evience, this proceure will always converge. Another issue is the initialization of this proceure. Since the iteration on the i th component epens on µ \i, we nee to initialize µ by some legal assignment. To o so, we create a fictional rate matrix Q i for each component an initialize µ i to be the posterior of the process given the evience e i,0 an e i,t. As a reasonable initial guess, we choose at ranom one of the conitional rates in Q to etermine the fictional rate matrix.

Figure 2: (a) Relative error as a function of the coupling parameter β (x-axis) an transition rates τ (y-axis) for an 8-component Ising chain. (b) Comparison of true vs.

7 Figure 2: (a) Relative error as a function of the coupling parameter β (x-axis) an transition rates τ (y-axis) for an 8-component Ising chain. (b) Comparison of true vs. estimate likelihoo as a function of the rate parameter τ. (c) Comparison of true vs. likelihoo as a function of the coupling parameter β. The continuous time upate equations allow us to use stanar ODE methos with an aaptive step size (here we use the Runge-Kutta-Fehlberg (4,5) metho). At the price of some overhea, these proceure automatically tune the trae-off between error an time granularity. 5 Perspective & Relate Works Variational approximations for ifferent types of continuous-time processes have been recently propose [12, 13]. Our approach is motivate by results of Opper an Sanguinetti [12] who evelope a variational principle for a relate moel. Their moel, which they call a Markov jump process, is similar to an HMM, in which the hien chain is a continuous-time Markov process an there are (noisy) observations at iscrete points along the process. They escribe a variational principle an iscuss the form of the functional when the approximation is a prouct of inepenent processes. There are two main ifferences between the setting of Opper an Sanguinetti an ours. First, we show how to exploit the structure of the target CTBN to reuce the complexity of the approximation. These simplifications imply that the upate of the i th process epens only on its Markov blanket in the CTBN, allowing us to evelop efficient approximations for large moels. Secon, an more importantly, the structure of the evience in our setting is quite ifferent, as we assume eterministic evience at the en of intervals. This setting typically leas to a posterior Markov process in which the instantaneous rates use by Opper an Sanguinetti iverge towar the en point the rates of transition into the observe state go to infinity, leaing to numerical problems at the en points. We circumvent this problem by using the marginal ensity representation which is much more stable numerically. Taking the general perspective of Wainwright an Joran [15], the representation of the istribution uses the natural sufficient statistics. In the case of a continuous-time Markov process, the sufficient statistics are T x, the time spent in state x, an M x,y, the number of transitions from state x to y. In a iscrete-time moel, we can capture the statistics for every ranom variable. In a continuous-time moel, however, we nee to consier the time erivative of the statistics. Inee, it is not har to show that t E [T x(t)] = µ x (t) an t E [M x,y(t)] = γ x,y (t). Thus, our marginal ensity sets η provie what we consier a natural formulation for variational approaches to continuous-time Markov processes. Our presentation focuse on evience at two ens of an interval. Our formulation easily extens to eal with more elaborate types of evience: (1) If we o not observe the initial state of the i th component, we can set µ i x(0) to be the prior probability of X (0) = x. Similarly, if we o not observe X i at time T, we set ρ i x(t ) = 1 as initial ata for the backwar step. (2) In a CTBN where one (or more) components are fully observe, we simply set µ i for these components to be a istribution that assigns all the probability mass to the observe trajectory. Similarly, if we observe ifferent components at ifferent times, we may upate each component on a ifferent time interval. Consequently, maintaining for each component a marginal istribution µ i throughout the interval of interest, we can upate the other ones using their evience patterns. 6 Experimental Evaluation To gain better insight into the quality of our proceure, we performe numerical tests on moels that challenge the approximation. Specifically, we use Ising chains where we explore regimes efine by the egree of coupling between the components (the parameter β) an the rate of transitions (the parameter τ). We evaluate the error in two ways. The first is by the ifference between the true log-likelihoo an our estimate. The secon is by the average relative error in the estimate of ifferent expecte sufficient statistics efine by j ˆθ j θ j θ j where θ j is exact value of the j th ex-

8 pecte sufficient statistics an ˆθ j is the approximation. Applying our proceure on an Ising chain with 8 components, for which we can still perform exact inference, we evaluate the relative error for ifferent choices of β an τ. The evience in this experiment is e 0 = {+, +, +, +, +, +,, }, T = 0.64 an e T = {,,, +, +, +, +, +}. As shown in Fig. 2a, the error is larger when τ an β are large. In the case of a weak coupling (small β), the posterior is almost inepenent, an our approximation is accurate. In moels with few transitions (small τ), most of the mass of the posterior is concentrate on a few canonical types of trajectories that can be capture by the approximation (as in Example 4.3). At high transition rates, the components ten to transition often, an in a coorinate manner, which leas to a posterior that is har to approximate by a prouct istribution. Moreover, the resulting free energy lanscape is rough with many local maxima. Examining the error in likelihoo estimates (Fig. 2b,c) we see a similar tren. Next, we examine the run time of our approximation when using fairly stanar ODE solver with few optimizations an tunings. The run time is ominate by the time neee to perform the backwar-forwar integration when upating a single component, an by the number of such upates until convergence. Examining the run time for ifferent choices of β an τ (Fig. 3), we see that the run time of our proceure scales linearly with the number of components in the chain. Moreover, the run time is generally insensitive to the ifficulty of the problem in terms of β. It oes epen to some extent on the rate τ, suggesting that processes with more transitions require more iterations to converge. Inee, the number of iterations require to achieve convergence in the largest chains uner consieration are milly affecte by parameter choices. The scalability of the run time stans in contrast to the Gibbs sampling proceure [4], which scales roughly with the number in transitions in the sample trajectories. Comparing our metho to the Gibbs sampling proceure we see (Fig. 4) that the faster Mean Fiel approach ominates the Gibbs proceure over short run times. However, as oppose to Mean Fiel, the Gibbs proceure is asymptotically unbiase, an with longer run times it ultimately prevails. This evaluation also shows that aaptive integration proceure in our methos strikes a better trae-off than using a fixe time granularity integration. 7 Inference on Trees The abovementione experimental results inicate that our approximation is accurate when reasoning about weaklycouple components, or about time intervals involving few transitions (low transition rates). Unfortunately, in many omains we face strongly-couple components. For example, we are intereste in moeling the evolution of biological sequences (DNA, RNA, an proteins). In such systems, we have a phylogenetic tree that represents the branching Figure 3: Evaluation of the run time of the approximation versus the run time of exact inference as a function of the number of components. process that leas to current ay sequences (see Fig. 5a). It is common in sequence evolution to moel this process as a continuous-time Markov process over a tree [6]. More precisely, the evolution along each branch is a stanar continuous-time Markov process, an branching is moele by a replication, after which each replica evolves inepenently along its sub-branch. Common applications are force to assume that each character in the sequence evolves inepenently of the other. In some situations, assuming an inepenent evolution of each character is highly unreasonable. Consier the evolution of an RNA sequence that fols onto itself to form a functional structure. This foling is meiate by complementary base-pairing (A-U, C-G, etc) that stabilizes the structure. During evolution, we expect to see compensatory mutations. That is, if a A changes into C then its basepaire U will change into a G soon thereafter. To capture such coorinate changes, we nee to consier the joint evolution of the ifferent characters. In the case of RNA structure, the stability of the structure is etermine by stacking potentials that measure the stability of two ajacent pairs of interacting nucleoties. Thus, if we consier a factor network to represent the energy of a fol, it will have structure as shown in Fig. 5b. We can convert this factor graph into a CTBN using proceures that consier the energy function as a fitness criteria in evolution [3, 16]. Unfortunately, inference in such moels suffers from computational blowup, an so the few stuies that eal with it explicitly resort to sampling proceures [16]. To consier trees, we nee to exten our framework to eal with branching processes. In a linear-time moel, we view the process as a map from [0, T ] into ranom variables X (t). In the case of a tree, we view the process as a map from a point t = b, t on a tree T (efine by branch b an the time t within it) into a ranom variable X (t). Similarly, we generalize the efinition of the Markov-consistent ensity set η to inclue functions on trees. We efine continuity of functions on trees in the obvious manner. The variational approximation on trees is thus similar

Figure 4: Evaluation of the run time vs. accuracy trae-off for several choices of parameters for Mean Fiel an Gibbs sampling on the branching process of Fig. 5(a).

9 Figure 4: Evaluation of the run time vs. accuracy trae-off for several choices of parameters for Mean Fiel an Gibbs sampling on the branching process of Fig. 5(a). Figure 5: (a) An example of a phylogenetic tree. Branch lengths enote time intervals between events with the interval use for the comparison in Fig. 6a highlighte. (b) The form of the energy function for encoing RNA foling, superimpose on a fragment of a fole structure; each gray box enotes a term that involves four nucleoties. (c) Illustration of the ODE upates on a irecte tree. Figure 6: (a) Comparison of exact vs. approximate inference along the branch from C to D in the tree of Fig. 5(a) with an without aitional evience at other leaves. Exact marginals are shown in soli lines, whereas approximate marginal are in ashe lines. The two panels show two ifferent components. (b) Evaluation of the relative error in expecte sufficient statistics for an Ising chain in branching-time; compare to Fig. 2(a). (c) Evaluation of the estimate likelihoo on a tree; compare to Fig. 2(b). to the one on intervals. Within each branch, we eal with the same upate formulas as in linear time. We enote by µ i x i (b, t) an ρ i x i (b, t) the messages compute on branch b at time t. The only changes occur at vertices. Suppose we have a branch b 1 of length T 1 incoming into vertex v, an two outgoing branches b 2 an b 3 (see Fig. 5c). Then we use the following upates for µ i x i an ρ i x i µ i x i (b k, 0) = µ i x i (b 1, T 1 ) k = 2, 3, ρ i x i (b 1, T 1 ) = ρ i x i (b 2, 0)ρ i x i (b 3, 0). The forwar propagation of µ i simply uses the value at the en of the incoming branch as initial value for the outgoing branches. In backwar propagation of ρ i the value at the en of b 1 is the prouct of the values at the start of the two outgoing branches. This is the natural operation when we recall the interpretation of ρ i as the probability of the ownstream evience given the current state. When switching to trees, we increase the amount of evience about intermeiate states. Consier for example the tree of Fig. 5a. We can view the span from C to D as an interval with evience at its en. When we a evience at the tip of other branches we gain more information about intermeiate points between C an D. To evaluate the impact of these changes on our approximation, we consiere the tree of Fig. 5a, an compare it to inference in the backbone between C an D (Fig. 2). Comparing the true marginal to the approximate one along the main backbone (see Fig. 6a) we see a major ifference in the quality of the approximation. The evience in the tree leas to a much tighter approximation of the marginal istribution. A more systematic comparison (Fig. 6b,c) emonstrates that the aitional evience reuces the magnitue of the error throughout the parameter space. As a more emaning test, we applie our inference proceure to the moel introuce by Yu an Thorne [16] for a stem of 18 interacting RNA nucleoties in 8 species in the phylogeny of Fig. 5a. We compare our estimate of the expecte sufficient statistics of this moel to these obtaine

10 our variational proceure to generate initial istribution for Gibbs sampling skip the initial burn-in phase an prouce accurate samples. Another attractive aspect of this new variational approximation is its potential use for learning moel parameters from ata. It can be easily combine with the EM proceure for CTBNs [10], to obtain a Variational-EM proceure for CTBNs, which monotonically increases the likelihoo by alternating between steps that improve the approximation η (the upates iscusse here) an steps that improve the moel parameters θ. Figure 7: Comparison of estimates of expecte sufficient statistics in the evolution of 18 interacting nucleoties, using a realistic moel of RNA evolution. Each point is an expecte statistic value; the x-axis is the estimate by the variational proceure, whereas the y-axis is the estimate by Gibbs sampling. by the Gibbs sampling proceure. The results, shown in Fig. 7, emonstrate that over all, the two approximate inference proceures are in goo agreement about the value of the expecte sufficient statistics. 8 Discussion In this paper we formulate a general variational principle for continuous-time Markov processes (by reformulating an extening the one propose by Opper an Sanguinetti [12]), an use it to erive an efficient proceure for inference in CTBNs. In this mean fiel-type approximation, we use a prouct of inepenent inhomogeneous processes to approximate the multi-component posterior. Our proceure enjoys the same benefits encountere in iscrete time mean fiel proceure [8]: it provies a lower-boun on the likelihoo of the evience an its run time scales linearly with the number of components. Using asynchronous upates it is guarantee to converge, an the approximation represents a consistent joint istribution. It also suffers from expecte shortcomings: there are multiple local maxima, an it cannot captures certain complex interactions in the posterior. By using a time-inhomogeneous representation, our approximation oes capture complex patterns in the temporal progression of the marginal istribution of each component. Importantly, the continuous time parametrization enables straightforwar implementation using stanar ODE integration packages that automatically tune the trae-off between time granularity an approximation quality. We show how to exten it to perform inference on phylogenetic trees, an show that it provies fairly accurate answers in the context of a real application. One of the key evelopments here is the shift from (piecewise) homogeneous parametric representations to continuously inhomogeneous representations base on marginal ensity sets. This shift increases the flexibility of the approximation an, somewhat surprisingly, also significantly simplifies the resulting formulation. A possible extension of the ieas set here is to use Acknowlegments We thank the anonymous reviewers for helpful remarks on previous versions of the manuscript. This research was supporte in part by a grant from the Israel Science Founation. Tal El-Hay is supporte by the Eshkol fellowship from the Israeli Ministry of Science. References [1] X. Boyen an D. Koller. Tractable inference for complex stochastic processes. In UAI, [2] K.L. Chung. Markov chains with stationary transition probabilities [3] T. El-Hay, N. Frieman, D. Koller, an R. Kupferman. Continuous time markov networks. In UAI, [4] T. El-Hay, N. Frieman, an R. Kupferman. Gibbs sampling in factorize continuous-time markov processes. In UAI, [5] Y. Fan an C.R. Shelton. Sampling for approximate inference in continuous time Bayesian networks. In AI an Math, [6] J. Felsenstein. Inferring Phylogenies [7] C.W. Gariner. Hanbook of stochastic methos [8] M. I. Joran, Z. Ghahramani, T. Jaakkola, an L. K. Saul. An introuction to variational approximations methos for graphical moels. In Learning in Graphical Moels [9] U. Noelman, C.R. Shelton, an D. Koller. Continuous time Bayesian networks. In UAI, [10] U. Noelman, C.R. Shelton, an D. Koller. Expectation maximization an complex uration istributions for continuous time Bayesian networks. In UAI, [11] U. Noelman, C.R. Shelton, an D. Koller. Expectation propagation for continuous time Bayesian networks. In UAI, [12] M. Opper an G. Sanguinetti. Variational inference for Markov jump processes. In NIPS, [13] C. Archambeau, M. Opper, Y. Shen, D. Cornfor an J. Shawe-Taylor. Variational inference for Diffusion Processes. In NIPS, [14] S. Saria, U. Noelman, an D. Koller. Reasoning at the right time granularity. In UAI, [15] M. J. Wainwright an M. Joran. Graphical moels, exponential families, an variational inference. Foun. Trens Mach. Learn., 1:1 305, [16] J. Yu an J. L Thorne. Depenence among sites in RNA evolution. Mol. Biol. Evol., 23: , 2006.

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x)

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x) Y. D. Chong (2016) MH2801: Complex Methos for the Sciences 1. Derivatives The erivative of a function f(x) is another function, efine in terms of a limiting expression: f (x) f (x) lim x δx 0 f(x + δx)