Auxiliary Gibbs Sampling for Inference in Piecewise-Constant Conditional Intensity Models

Size: px

Start display at page:

Download "Auxiliary Gibbs Sampling for Inference in Piecewise-Constant Conditional Intensity Models"

Lewis Dennis
5 years ago
Views:

1 uxiiary Gibbs Samping for Inference in Piecewise-Constant Conditiona Intensity Modes Zhen Qin University of Caifornia, Riverside Christian R. Sheton University of Caifornia, Riverside bstract piecewise-constant conditiona intensity mode (PCIM) is a non-markovian mode of tempora stochastic dependencies in continuoustime event streams. It aows efficient earning and forecasting given compete trajectories. However, no genera inference agorithm has been deveoped for PCIMs. We propose an effective and efficient auxiiary Gibbs samper for inference in PCIM, based on the idea of thinning for inhomogeneous Poisson processes. The samper aternates between samping a finite set of auxiiary virtua events with adaptive rates, and performing an efficient forward-backward pass at discrete times to generate sampes. We show that our samper can successfuy perform inference tasks in both Markovian and non-markovian modes, and can be empoyed in Expectation-Maximization PCIM parameter estimation and structura earning with partiay observed data. 1 Introduction Modeing tempora dependencies in event streams has wide appications. For exampe, users behaviors in onine shopping and web searches, socia network activities, and machines responses in datacenter management can each be viewed as a stream of events over time. Modes that can successfuy earn the compex dependencies among events (both abe and timing) aow targeted onine advertising, automatic poicy seection in datacenter management, user behavior modeing, or event prediction and dependency understanding in genera. [Gunawardana et a., 2011] proposed the piecewiseconstant conditiona intensity mode (PCIM) which captures the dependencies among the types of events through a set of piecewise-constant conditiona intensity functions. PCIM is represented as a set of decision trees, which aow for efficient mode seection. Forecasting via forward samping is aso simpe by iterativey samping next events based on the current history. However, currenty mode seection and forecasting for PCIMs is ony effective given compete data. When there are missing data, an inference method is needed to answer genera queries or be empoyed in expectationmaximization (EM) agorithms for mode seection and parameter earning. Currenty, no inference agorithm has been proposed for PCIM that can condition on genera evidence. In this work, we propose the first genera inference agorithm for PCIMs, based on the idea of thinning for inhomogeneous Poisson process [Lewis and Sheder, 1979]. This resuts in an auxiiary Gibbs samper that aternates between samping a finite set of virtua event times given the current trajectory, and then samping a new trajectory given the set of evidences and event times (virtua and actua). Our method is convergent, does not invove approximations ike fixed time-discretization, and the sampes generated can answer any type of query. We propose an efficient state-vector representation to maintain ony necessary information for diverging trajectories, reducing the exponentiay increasing samping compexity to inear in most cases. We show empiricay our inference agorithm converges to the distribution, permits effective query answering, and aids mode seection with incompete data for PCIM modes with both Markovian and compex non-markovian dynamics. We aso show the connection between PCIMs and continuous-time ayesian networks (CTs), and compare our method with an existing method on such modes. 2 Previous Work dynamic ayesian network (D) [Dean and Kanazawa, 1988] modes tempora dependencies between variabes in discrete time. For systems that evove asynchronousy without a goba cock, it is

2 often not cear how timestamps shoud be discretized. Heath records, computer server ogs, and socia networks are exampes of asynchronous event data streams. For such systems, too sow a samping rate woud poory represent the data, whie too fast a samping rate makes earning and inference more costy. Continuous-time modes have drawn attention recenty in appications ranging from socia networks [Du et a., 2013, Saito et a., 2009, Linderman and dams, 2014] to genetics [Cohn et a., 2009] to biochemica networks [Goighty and Wikinson, 2011]. Continuous Time ayesian etworks (CT) [odeman et a., 2002] are homogeneous Markovian modes of the joint trajectories of discrete finite variabes, anaogous to Ds. on- Markovian continuous modes aow the rate of an event to be a function of the process s history. Poisson etworks [Rajaram et a., 2005] constrain this function to depend ony on the counts of the number of events during a finite time window. Poisson cascades [Simma and Jordan, 2010] define the rate function to be the sum of a kerne appied to each historic event, and requires the modeer to choose a parametric form for tempora dependencies. PCIM defines the intensity function as decision trees, with interna nodes tests mapping time and history to eaves. Each eaf is associated with a constant rate. PCIM is abe to mode non-markovian tempora dependencies, and is an order of magnitude faster to earn than Poisson networks. ppications incude modeing supercomputer event ogs and forecasting future interests of web search users. Whie PCIMs have been extended in a number of ways [Parikh et a., 2012, Weiss and Page, 2013], there is no genera inference agorithms. Inference agorithms deveoped for continuous systems are mainy for Markovian modes or specificay designed for a particuar appication. For CTs, there are variationa approaches such as expectation propagation [E-Hay et a., 2010] and mean fied [Cohn et a., 2009], which do not converge to the vaue as computation time increases. Samping based approaches incude importance samping [Fan et a., 2010] and Gibbs samping [Rao and Teh, 2011, Rao and Teh, 2013] that converge to the vaue. The atter is the current state-of-theart method designed for genera Markov Jump Processes (MJPs) and its extensions (incuding CTs). It uses the idea of uniformization [Grassmann, 1977] for Markov modes, simiar to thinning [Lewis and Sheder, 1979] for inhomogeneous Poisson processes. We note that our inference method generaizes theirs to non-markovian modes. 3 PCIM ackground ssume events are drawn from a finite abe set L. n event then can be represented by a time-stamp t and a abe. n event sequence x = {(t i, i )} n i=1, where 0 < t 1 < events in [t-1,t)? events in [t-2,t-1)? λ = 0.01 λ = 2 λ = 0.5 λ re there 1 events in [t-1,t)? re there 1 events in [t-2,t-1)? λ = 0.01 λ = 2 λ = 0.1 λ = 0.01 events in [t-5,t)? Figure 1: Decision tree representing S and θ for events of abes and. ote the dependency among event abes (the rate of depends on ). [Gunawardana et a., 2011]... < t n. We use h i = {(t j, j ) (t j, j ) x, t j < t i )} for the history of event i, when it is cear from context which x is meant. We define the ending time t(y) of an event sequence y as the time of the ast event in y, so that t(h i ) = t i 1. conditiona intensity mode (CIM) is a set of non-negative conditiona intensity functions indexed by abe {λ (t x; θ)} L =1. The data ikeihood is p(x θ) = n λ (t i h i ; θ) 1 ( i) e Λ (t i h i;θ) (1) L i=1 where Λ (t h; θ) = t t(h) λ (τ h; θ)dτ. The indicator function 1 ( ) is one if = and zero otherwise. λ (t h; θ) is the expected rate of event at time t given history h and mode parameters θ. Conditioning on the entire history causes the process to be non-markovian. The modeing assumptions for a CIM are quite weak, as any distribution for x in which the timestamps are continuous random variabes can be written in this form. Despite the weak assumptions, the per-abe conditiona factorization aows the modeing of abe-specific dependence on past events. PCIM is a particuar cass of CIM that restricts λ(h) to be piecewise constant (as a function of time) for any history, so the integra for Λ breaks down into a finite number of components and forward samping becomes feasibe. PCIM represents the conditiona intensity functions with decision trees. Each interna node in a tree is a binary test of the history, and each eaf contains an intensity. If the tests are piecewise-constant functions of time for any event history, the resuting function λ(t h) is piecewise-constant. Exampes of admissibe tests incude Was the most recent event of abe? Is the time of the day between 6am and 9am? Did an event with abe happen at east n times between 5 seconds ago and 2 seconds ago? Were the ast two events of the same abe? ote some tests are non-markovian in that they require knowedge of more than just which event was most recent. See Fig. 1 for an exampe of a PCIM mode. The decision tree for abe maps the time and history to a

3 eaf s Σ, where Σ is the set of eaves for. The resuting data ikeihood can be simpified: p(x S, θ) = λ cs(x) s e λ sd s (x). (2) L s Σ S is the PCIM structure represented by the decision trees; the mode parameters θ are rates at the eaves. c s (x) is the number of times abe occurs in x and is mapped to eaf s. d s (x) is the tota duration when the event trajectory for is mapped to s. c and d are the sufficient statistics for cacuating data ikeihood. [Gunawardana et a., 2011] showed that given the structure S, by using a product of Gamma distributions as a conjugate prior for θ, the margina ikeihood of the data can be given in cosed form, and thus parameter estimation can be done in cosed form. Furthermore, imposing a structura prior aows a cosed form ayesian score to be used for greedy tree earning. 4 uxiiary Gibbs Samping for PCIM In this section we introduce our new inference agorithm for PCIM, caed ThinnedGibbs, based on the idea of thinning for inhomogeneous Poisson processes. We hande incompete data in which there are intervas of time during which events for particuar abe(s) are not observed. 4.1 Why Inference in PCIM is Difficut Fiing in partiay observed trajectories for PCIM is hard due to the compex dependencies between unobserved events and both past and future events. See Fig. 2 for an exampe. Whie the history (the event at t) says it is ikey that there shoud be events in the unobserved area (with an expected rate of 2), future evidence (no events in R) is contradictory: If there were indeed events in the unobserved area, those events shoud stimuate events happening in R. Such a phenomenon might suggest existing agorithms such as the forward-fitering-backward-samping (FFS) agorithm for discrete-time Markov chains. However, there are two subteties here: First, we are deaing with non-markovian modes. Second, we are deaing with continuous-time systems, so the number of time steps over which to propagate is infinite. 4.2 Thinning Thinning [Lewis and Sheder, 1979] can be used to turn a continuous-time process into a discrete-time one, without using a fixed time-sice granuarity. We seect a rate λ greater than any in the inhomogeneous Poisson process and sampe from a homogeneous process with this rate. To get a sampe from the origina inhomogeneous process, an event at time t is thinned (dropped) with probabiity 1 λ(t) λ. λ re there 1 events in [t-2,t-1)? λ = 2 λ = 0.1 t t+1 t+2 Figure 2: simpe PCIM with a partiay observed trajectory. The vertica soid arrow indicates an evidence event. reas between parentheses are unobserved. History aone indicates there shoud be events fied in, whie the future (no events in R) provides contradictory evidence. This process can aso be reversed. If given the set of thinned event times (sampes from the inhomogeneous process), extra events can be added to a sampe from the origina constant-rate process by samping from a Poisson process with rate λ λ(t). The cyce can then repeat by thinning the new tota set of times (ignoring how they were generated). t each cyce, the times (after thinning) are drawn from the origina inhomogeneous process. It is this type of cyce we wi empoy in our samper. The difficuty is a PCIM is not an inhomogeneous Poisson process. Its intensity depends on the entire history of events, not just the current time. For thinning, this means that we cannot independenty sampe whether each event is to be thinned. Furthermore, we wish to sampe from the posterior process, conditioned on evidence. evidence (both past and future) affect the probabiity of a specific thinning configuration. 4.3 Overview of Our Method To overcome both of these probems, we extend thinning to an auxiiary Gibbs samper in the same way that [Rao and Teh, 2011, Rao and Teh, 2013] extended Markovian-mode uniformization [Grassmann, 1977] (a specific exampe of thinning in a Markov process) to a Gibbs samper. To do this we introduce auxiiary variabes representing the events that were dropped. We ca these events virtua events. s a standard Gibbs samper, our method cyces through each variabe in turn. In our case, a variabe corresponds to an event abe. For event abe, et x be the samped event sequence for this abe. Let be a evidence (for and other abes) and a (currenty fixed) sampes for other abes. Our goa is to sampe from p(x ). Let v be the virtua events (the auxiiary variabe) associated with and z = x v (a event times, virtua and non-virtua). Our method first sampes from p(v x, ) and then sampes from p(x z, ). The first step adds virtua events given the non-virtua events are correct. The second step treats a events as potentia events and drops or R

4 keeps events. The dropped events are removed competey. The kept events, x, remain as the new samped trajectory. The proof of correctness foows anaogousy to that of [Rao and Teh, 2013] for Markovian systems. However, the detais for samping from p(v x, ) and p(x z, ) differ. We describe them next. 4.4 Samping uxiiary Virtua Events with daptive Rates Samping from p(v x, ) amounts to adding just the virtua (dropped) events. s the fu trajectory (x for a ) is known, the rate at any time step for a virtua event is independent of any other virtua events. Therefore, the process is an inhomogeneous Poisson process for which the rate at t is equa to λ λ (t h) where h is fuy determined by x and. Reca that λ (t h) is piecewise-constant in time, so samping from such an inhomogeneous Poisson process is simpe. The auxiiary rate, λ, must be stricty greater than the maximum rate possibe for irreducibiity. We use an auxiiary rate of λ = 2 max(λ(t h)) to sampe virtua events in the unobserved intervas. This choice baances mixing time (better with higher λ ) and computationa compexity (better with ower λ ). naïve way to pick λ is to find λ max : the maximum rate in the eaves of PCIM, and use 2λ max. However, there coud be unobserved time intervas with a possibe maximum rate much smaer than λ max. Using λ max in those regions woud generate too many virtua events, most of which wi be dropped in the next step eading to computationa inefficiency. We therefore use an adaptive strategy. Our adaptive λ (t h) cannot depend on x (this woud break the simpicity of samping mentioned above). Therefore, we determine λ (t h) by passing (t, h) down the PCIM tree for λ. t each interna node, if the branch does not depend on x, we can directy take one branch. Otherwise, the test is reated to the samped events, and we take the maximum rate of taking both branches. This method resuts in λ (t h) as a piecewise-constant function of time (for the same reasons that λ (t h) is piecewise-constant). Consider Fig. 3 as an exampe. When samping event = on the interva [1, 5), we woud not take the eft branch at the root (no matter what events for have been samped), but must maximize over the other two eaves (as different x vaues woud resut in different eaves). This resuts in a λ = 4 over this interva, which is smaer than The aïve FFS gorithm Once these virtua events are added back in, we take z (the union of virtua events and rea samped events) as a sampe from the Poisson process with rate λ and ignore which λ = 3 re there 1 events in [t-5,t)? events in [t-1,t)? λ = 2 λ = 1 λ = 4 λ = Figure 3: daptive auxiiary rate exampe. When samping, the branch to take at the root does not depend on unobserved events for. If the test is reated to the samped event, we take the maximum rate from both branches. The red arrows indicate the branches to take between time [1, 5], and λ = 2 2 in that interva, instead of 6. were originay virtua and which were originay rea. We then thin this set to get a sampe from the conditiona margina over. The restriction to consider events ony at times in z transforms the continuous-time probem into a discrete one. Given z with m possibe event times (z,1, z,2,..., z,m ), et b = {b i } m i=1 be a set of binary variabes, one per event, where b i = 1 if event i is incuded in x (otherwise b i = 0 and the event is not incuded in x ). Thus samping b is equivaent to samping x (z is known) as it specifies which events in z are in x. Let i:j be the portion of between times z,i and z,j, and b i:j = {b k i k j} We wish to sampe b (and thereby x ) from p(b ) ( ) p( i 1:i, b i b 1:i 1, 1:i 1 ) p( m: b) (3) i where the fina m: signifies a of the evidence after the ast virtua event time z,m and can be handed simiary to the other terms. The most straight-forward method for such samping considers each possibe assignment to b (of which there are 2 m ). For each interva, we mutipy terms from Eq. 3 of the form p( i 1:i, b i b 1:i 1, 1:i 1 ) = p( i 1:i 5 b 1:i 1, 1:i 1 )p(b i b 1:i 1, 1:i ) (4) where the first term is the ikeihood of the trajectory interva from z,i 1 to z,i and the second term is the probabiity of the event being thinned, given the past history. The first can be computed by taying the sufficient statistics (counts and durations) and appying Eq. 2. ote that these sufficient statistics take into account b 1:i 1 which specifies events for during the unobserved region(s), and the ikeihood must aso be cacuated for abes for which λ (t h) depends on events from. The second term is equa to λ (t h) λ (t) if b i = 1 (and 1 λ (t h) λ (t) if b i = 0). The numerator s dependence on the fu history simiary dictates a dependence on b 1:i 1.

5 This might be formuated as a naïve FFS agorithm: To generate one sampe, we propagate possibe trajectories forward in time, mutipying in Eq. 4 at each inter-event interva to account for the evidence. Every time we see a virtua event, each possibe trajectory diverges into two (depending on whether the virtua event is to be thinned or not). y the end, we have a 2 m possibe trajectories, each with its probabiity (Eq. 3). We sampe one trajectory as the output, in proportion of the cacuated ikeihoods. s we expicity keep a possibe trajectories, the samped trajectory immediatey tes us which virtua events are kept, so no actua backward pass is needed. 4.6 n Efficient State-Vector Representation The naïve FFS agorithm is ceary not practica, as the number of possibe trajectories grows exponentiay with the number of auxiiary virtua events (m). We propose a more efficient state-vector representation to ony keep the necessary information for each possibe trajectory. The idea takes advantage of the structure of the PCIM and eads to state merges, simiar to what happens in FFS for hidden Markov modes (HMMs). The terms in Eq. 4 depend on b 1:i 1 ony through the tests in the interna nodes of the PCIM trees. Therefore, we do not have to keep track of a of b 1:i 1 to cacuate these ikeihoods, but ony the current state of such tests that depend on events with abe. For exampe, a test that asks Is the ast event of abe? ony needs to maintain a bit as the indicator. The test re there more than 3 events of abe q in the ast 5 seconds? for q has no state, as b 1:i 1 does not affect its choice. y contrast, a test such as Is the ast event of abe q? does depend on b, even if q. s we propagate forward, we merge b 1:i sequences that resut in the same set of states for a interna tests inside the PCIM. See Fig. 4 as a simpe exampe. Though there are 8 possibe trajectories, they merge to ony 2 states that we can sampe from. Simiar to FFS for HMM, we need to maintain the transition probabiities in the forward pass and use them in a backward samping pass to recover the fu trajectory, but such information is aso inear. ote that this conversion to a Markov system for samping is not possibe in the origina continuous-time system. Thinning has aows it by randomy seecting a few discrete time points, thereby restricting the possibe state space to be finite. The state space depends on the actua tests in the PCIM mode. See Tb. 1 for the tests we currenty support and their state representations. The LastStateTest and StateTest are used to support discrete finite variabe systems such as CT, as we wi use in Sec. 5 and in experiments. ote the EventCountTest was the ony supported test in the origina PCIM paper. We can see that for tests that ony depend Was the most recent event abe? λ = 2 λ = 0.1 b 1 =1 b 1 b 2 b 3 fase b 1 =0 b 2 =1 b 3 =1 b 3 =0 b3 =1 fase fase b 2 =0 b 3 =0 Figure 4: Dotted events are the virtua events that we sampe as binary variabes (b i is 1 if event i is kept). The state diagram beow the trajectory indicates the state of the test as we diverge (keep or drop a virtua event). Though there are 2 3 possibe configurations, state merges can reduce the exponentiay increasing compexity to inear in this case. on the current time (i.e. TimeTest), the diverging history does not affect them, so no state is needed. For Markovian tests (LastEventTest and LastStateTest), we ony need a ooean variabe. For the non-markovian test (Event- CountTest), the number of possibe states does grow exponentiay with the number of virtua events maintained in the queue. This is the best we can do and sti be exact. It is much better than growing with the number of a virtua events. However, note that commony ag2 = 0 and n is a sma number. In this case, the state space size at any point is bounded as ( ) m n, where m is the maximum number of samped events in any time interva of duration ag1 (which is upper bounded by m). If n is 1, this is inear in the number of sampes generated in during ag1 time units. s noted above, if the test is not reated to the samped event (for exampe, in samping event = with test are there 3 events in the ast 5 seconds ), the state of the test is nu. This is because the evidence and samped vaues for (not the current variabe for Gibbs samping) can answer this test without reference to sampes of. See g. 1 for the agorithm description for resamping event. The compete agorithm iterates this procedure for each event abe to get a new sampe. The heper function UpdateState(s,b,t) returns the new state given the od state (s), the new time (t), and whether an event occurs at t (b). SampProbMap(M) takes a mapping from objects to positive vaues (M) and randomy returns one of the objects with probabiity proportiona to the associate vaue. ddtoprobmap(m,o,p) checks to see if o is in M. If so, it adds p to the associated probabiity. Otherwise, it adds the mapping o p to M. 4.7 Extended Exampe Fig. 5 shows an exampe of resamping the events for abe on the unobserved interva [0.8, 3.5). On the far eft is

6 Tabe 1: Tests and their corresponding state representations. Test Exampe State Representation Property TimeTest Is the time between 6am and 9am? u independent of b LastEventTest Is the ast event? ooean Markovian EventCountTest re there >= n event in [t ag1, t ag2]? queue maintaining a the times of between [t ag2, t], and the most recent n events between [t ag1, t ag2]. on-markovian LastStateTest Is the ast subabe of var =0? ooean Markovian StateTest Is the current subabe of var = 0? u independent of b (a) λ : λ λ : (b) events in [t 2, t 1)? Is the most recent event of abe? λ = 0.1 (c) (d) λ = 1 λ = 1.5 [0.05,1.0] 1.5 [0.05,1.0] fase fase [1.0,2.3] [1.0] fase [2.3] [2.3,3.3] [2.3] [3.3] [0.05,1.0] 1.5 [0.05,1.0] fase fase [1.0,2.3] [1.0] fase [2.3] [2.3,3.3] [2.3] [3.3] [] fase [] fase [] fase [] fase Figure 5: Extended Exampe, see Section 4.7 the PCIM rate tree for event. ox (a) shows the sampe from previous iteration (singe event at 2.3). Dashed ines and λ show the piecewise-constant intensity function given the sampe. ox (b) shows the samping of virtua events. For this case λ = 3 for a time. λ λ is the rate for virtua events. The agorithm sampes from this process, resuting in two virtua events (dashed). In box (c) a events become potentia events. The state of the root test is a queue of recent events. The state of the other test is ooean (whether is more recent). On the bottom is the attice of joint states over time. Soid arrows indicate b i = 1 (the event is kept). Dash arrows indicate b i = 0 (the event is dropped). Each arrow s weight is as per Eq. 4. The probabiity of a node is the sum over a paths to the node of the product of the weights on the path (cacuated by dynamic programming). In box (d) a singe path is samped with backward samping, shown in bod. This path corresponds to keeping the first and ast virtua events and dropping the midde one. 5 Representing CTs as PCIMs non-markovian PCIM is more genera than the Markovian CT mode. We can, therefore represent a CT using a PCIM. In this way, we can extend PCIMs and we CT [ ] 1 1 Q =0 = 2 2 [ ] Q =1 = Generated PCIM Last Last Last Generated Generated Generated λ=0 λ=1 λ=2 λ=0 λ=0 λ=10 λ=20 λ=0 Figure 6: Expicit conversion from a CT to a PCIM by using specific tests. Ony rates for variabe shown. The coored arrows and boxes show one-to-one correspondence of a path in the tree and an entry in the rate matrix of CT. Diagona eements in the CT are redundant and do not need to be represented in the PCIM. can compare our PCIM method with existing methods for CTs. We associate a PCIM abe with each CT variabe. We aso augment the notion of a PCIM abe to incude a subabe. For each CT variabe, its PCIM abe has one subabe for each state of the CT variabe. Therefore, a

7 gorithm 1 Resamping event input: The previous trajectory (x, ) output: The newy samped x 1: for each unobserved interva for do 2: Find piecewise constant λ (t h) using 3: Find piecewise constant λ(t h) using x, 4: Sampe virtua events v with rate λ (t h) λ(t h) 5: Let z = x v, m = z, and s 0 be the initia state 6: ddtoprobmap(s 0,s 0,1.0) 7: for i 1 to m do 8: for each {(s i 1, ) p} in S i 1 do 9: p keep = p(e i 1:i, b i = 1 s i 1, E 1:i 1 ) 10: p drop = p(e i 1:i, b i = 0 s i 1, E 1:i 1 ) 11: s keep i UpdateState(s i 1,, z,i ) 12: s drop i UpdateState(s i 1, fase, z,i ) 13: ddtoprobmap(s i,(s keep i, z,i ), p p keep ) 14: ddtoprobmap(s i,(s drop i, ), p p drop ) 15: ddtoprobmap(t i (s keep i ), (s i 1, z,i ), p p keep ) 16: ddtoprobmap(t i (s drop i ), (s i 1, ), p p drop ) 17: Update S m by propagating unti ending time 18: x and (s m, t) SampProbMap(S m ) 19: if t then x x {t} 20: for i m 1 to 1 do 21: (s i, t) SampProbMap(T i+1(s i+1 )) 22: if t then x x {t} 23: return x PCIM event with abe X and subabe x corresponds to a transition of the CT variabe X from its previous vaue to the vaue x. The PCIM trees tests can aso check the subabe associated with the possibe event. We augment the auxiiary Gibbs samper to not ony sampe which virtua events are kept, but aso which subabe is associated with each. This invoves modifying the b i variabes from the previous section to be muti-vaued. Otherwise, the agorithm proceeds the same way. The ast two tests in Tb. 1 are expicity for this type of subabeed event mode. We can use them to turn a conditiona intensity matrix from the CT into a PCIM tree. Fig. 6 shows the conversion of the twonode mode. 6 Experiments We impement our method as part of an open source code base, and a the code and data wi be pubicy avaiabe. We perform two sets of experiments to vaidate our method. First we perform inference with our method on both Markovian and non-markovian modes, and compare the resut with the ground-truth statistics. For both we show our resut converges to the correct resut. Ours is the Figure 7: The toroid network and observed patterns [E-Hay et a., 2010]. KL Divergence # of sampes ThinnedGibbs uxgibbs Figure 8: umber of sampes versus KL divergence for the toroid network. oth axes are on a og scae. first that can successfuy perform inference tasks on non- Markovian PCIMs. For the second set of experiments, we use ThinnedGibbs in EM for both parameter estimation and structura earning for a non-markovian PCIM. Our inference agorithm can indeed hep producing modes that achieve higher data ikeihood on hodout test data than severa baseine methods. 6.1 Verification on the Ising Mode We first evauate our method, ThinnedGibbs, on a network with Ising mode dynamics. The Ising mode is a weknown interaction mode with appications in many fieds incuding statistica mechanics, genetics, and neuroscience. This is a Markovian mode and has been tested by severa existing inference methods designed for CTs. Using this mode, we generate a directed toroid network structure with cyces foowing [E-Hay et a., 2010]. odes can take vaues 1 and 1, and foow their parents states according to a couping strength parameter (β). rate parameter (τ) determines how fast nodes togge between states. We test with β = 0.5 and τ = 2. The net-

8 λ = 0.5 events in [t-0.5,t)? Is the most recent event abe? λ = 1 λ = events in [t-1,t)? λ = 2 λ Is the most recent event abe? events in [t-0.5,t)? λ = 1 λ = 5 λ = 1 λ = Is absoute time in the first haf of a time unit? Figure 9: on-markovian PCIM and evidence. The ending time is Expected number of events occurring # of sampes True E[#] ThinnedGibbs E[#] True E[#] ThinnedGibbs E[#] Figure 10: umber of sampes versus the inferred expected number of events. The horizonta axis is on a og scae. work and the evidence patterns are shown in Fig. 7. The nodes are not observed between t = 0 and t = 1. We query the margina distribution of nodes at t = 0.5 and measure the sum of the KL-divergences of a marginas against the ground truth. We compare with the state-of-theart CT uxiiary Gibbs method [Rao and Teh, 2013]. Other existing methods either produce simiar or worse resuts [Ceikkaya and Sheton, 2014]. We vary the sampe size between 50 and 5000, and set the burn-in period to be 10% of this vaue. We run the experiments for 100 times, and pot the means and standard deviations. Resuts in Fig. 8 verify that our inference method indeed produces resuts that converge to the distribution. Our method reduces to that of [Rao and Teh, 2013] in this Markovian mode. Differences between the two ines are due to sighty different initiaizations of the Gibbs Markov chain and not significant. 6.2 Verification on a on-markovian Mode We further verify our method on a much more chaenging non-markovian PCIM (Fig. 9). This mode contains severa non-markovian EventCountTests. We have observations for event at t = 0.4, 0.6, 1.8, 4.7 and for event at t = 0.1, 0.2, 3.4, 3.6, 3.7. Event is not observed on [2.0, 4.0) and event is not observed on [1.0, 3.0). In produce ground truth, we discretized time and converted the system to a Markovian system. ote that because the time since the ast event is part of the state, as the discretization becomes finer, the state space increases. For this sma exampe, this approach is just barey feasibe. We continued to refine the discretization unti the answer stabiized. The ground-truth expected tota number of events between [0, 5] is and the expected tota number of events is That is, there are about events and 6.62 events in the unobserved areas. ote that if the evidence is changed to have no events these numbers drop to and respectivey and if the evidence after the unobserved intervas is ignored the expectations are and respectivey. Therefore the evidence (both before and after the unobserved intervas) is important to incorporate in inference. We compare our inference method to the exact vaues, again varying the sampe size between 50 and 5000 and setting the burn-in period to be 10% of this vaue. We ran the experiments 100 times and report the mean and standard deviation of the two expectations. Our samper has very sma bias and therefore the average vaues match the vaue amost exacty. The variance decreases as expected, demonstrating the consistent nature of our method. See Fig. 10. We are not aware of existing methods that can perform inference on this type of mode to which we coud compare. 6.3 Parameter Estimation and Structura Learning We further test ThinnedGibbs by using it in EM, for both parameter estimation (given the tree structure, estimate the rates in the eaves), and structura earning (earn both the structure and rates). We use Monte Caro EM that iterates between two steps: First, given a mode we generate sampes conditioned on evidence with ThinnedGibbs. Second, given the sampes, we treat them as compete trajectories and perform parameter estimation and structura earning, which is efficient for PCIM. We initiaize the mode from the partia trajectories, assuming no events occur in the unobserved intervas. EM terminates when the parameters of PCIMs in two consecutive iterations are stabe (a rates change ess than 10% from the previous ones), or the num-

9 Log ikeihood on testing data ThinnedGibbs True Mode Partia Data Compete Data # of training exampes Figure 11: Parameter estimation. Testing og-ikeihood as a function of the number of training sampes. Log ikeihood on testing data ThinnedGibbs True Mode 2000 Partia Data Compete Data EMUP # of training exampes Figure 12: Structure and parameter estimation. Testing ogikeihood as a function of the number of training exampes. ber of iterations surpasses 10. For structura earning, the structure needs to be the same between iterations. We use the mode in Fig. 1 and generate compete trajectories for time range [0, 10). We vary the number of training sampes (5, 10, 15, and 20) and use a fixed set of 100 trajectories as the testing data. For each training size, we use the same the training data for a agorithms and runs. We randomy generate an unobserved interva with ength 0.6 T for both event abes. For each training sampe, ThinnedGibbs fis it in to generate a new sampe after burning in 10 steps. For each configuration, we run ThinnedGibbs for 5 times. We measure the data ikeihood of the hodout testing data on the earned modes. For parameter estimation, we compare with the mode that generated the data, the mode earned with ony partia data in which we assume no events happened during unseen intervas (Partia Data), and a mode earned with compete training data (Compete Data). The resuts are summarized in Fig. 11. We can see that the mode earned by EM agorithm using ThinnedGibbs can indeed produce significanty higher testing ikeihood than using ony partia data. Of course, we do not do as we as if none of the data had been hidden (Compete Data). If earning the structure, there is one other possibiity: We coud use the origina fast PCIM earning method, but indicate (by new event abes) when an unobserved interva starts and stops. We augment the bank of possibe decisions to incude testing if each pseudo-events have occurred most recenty. In this way, the PCIM directy modes the process that obscures the data. Of course, at test time, branches modeing such unobserved times are not used. Such mode shoud serve as a better baseine than earning from partiay observed data, because it can potentiay earn unobserved patterns and ony use the dependencies in the observed intervas for a better mode. We ca this mode EMUP (expicit modeing of unobserved patterns). For structure earning, we fix the bank of possibe PCIM tests as EventCountTests with (, n, ag1, ag2) {, } {1, 2} {2, 3, 4, 5, 6} {0, 1, 2} (omitting tests for which ag1 ag2). For EMUP we aso aow testing if currenty in unobserved interva. The resuts are summarized in Fig. 12. We can see that EMUP does outperform modes using ony partia data. However, Structura EM with ThinnedGibbs sti performs better. The performance gain is ess than that in the parameter estimation task, probaby because there are more oca optimums for structura EM, especiay with fewer training exampes. 7 Discussion and Future Work We proposed the first effective inference agorithm, ThinnedGibbs, for PCIM. Our auxiiary Gibbs samping method effectivey transforms a continuous-time probem into a discrete one. Our state-vector representation of diverging trajectories takes advantage of state merges and reduces compexity from exponentia to inear for most cases. We buid the connection between PCIM and CT, and show our method generaizes the state-of-art inference method for CT modes. In experiments we vaidate our idea on non-markovian PCIMs, which is the first to do so. Our method converges to the exact conditiona distribution. If the state of the mode indeed grows exponentiay, the compexity of ThinnedGibbs foows. We beieve this technique coud aso be appied to other non-markovian processes. The chaenge ies in computing the forwardpass ikeihoods when the rate function is not piecewiseconstant. cknowedgement This work was supported by DRP (F ).

10 References [Ceikkaya and Sheton, 2014] Ceikkaya, E.. and Sheton, C. R. (2014). Deterministic anytime inference for stochastic continuous-time Markov processes. In ICML. 8 [Cohn et a., 2009] Cohn, I., E-Hay, T., Kupferman, R., and Friedman,. (2009). Mean fied variationa approximation for continuous-time bayesian networks. In UI. 2 [Dean and Kanazawa, 1988] Dean, T. and Kanazawa, K. (1988). Probabiistic tempora reasoning. In I. 1 [Saito et a., 2009] Saito, K., Kimura, M., Ohara, K., and Motoda, H. (2009). Learning continuous-time information diffusion mode for socia behaviora data anaysis. In CML, pages [Simma and Jordan, 2010] Simma,. and Jordan, M. (2010). Modeing events with cascades of poisson processes. In UI. 2 [Weiss and Page, 2013] Weiss, J. and Page, D. (2013). Forestbased point processes for event prediction from eectronic heath records. In ECML-PKDD. 2 [Du et a., 2013] Du,., Song, L., Gomez-Rodriguez, M., and Zha, H. (2013). Scaabe infuence estimation in continuoustime diffusion networks. In IPS. 2 [E-Hay et a., 2010] E-Hay, T., Cohn, I., Friedman,., and Kupferman, R. (2010). Continuous-time beief propagation. In ICML. 2, 7 [Fan et a., 2010] Fan,., Xu, J., and Sheton, C. R. (2010). Importance samping for continuous time ayesian networks. Journa of Machine Learning Research, 11(ug): [Goighty and Wikinson, 2011] Goighty,. and Wikinson, D. J. (2011). ayesian parameter inference for stochastic biochemica network modes using partice Markov chain Monte Caro. Interface Focus. 2 [Grassmann, 1977] Grassmann, W. K. (1977). Transient soutions in Markovian queueing systems. Computers & Operations Research, 4(1): , 3 [Gunawardana et a., 2011] Gunawardana,., Meek, C., and Xu, P. (2011). mode for tempora dependencies in event streams. In IPS. 1, 2, 3 [Lewis and Sheder, 1979] Lewis, P.. W. and Sheder, G. S. (1979). Simuation of nonhomogeneous Poisson processes by thinning. ava Research Logistics Quartery, 26(3): , 2, 3 [Linderman and dams, 2014] Linderman, S. W. and dams, R. P. (2014). Discovering atent network structure in point process data. In ICML. 2 [odeman et a., 2002] odeman, U., Sheton, C. R., and Koer, D. (2002). Continuous time bayesian networks. In UI. 2 [Parikh et a., 2012] Parikh,., Gunawardana,., and Meek, C. (2012). Cojoint modeing of tempora dependencies in event streams. In UI Workshops. 2 [Rajaram et a., 2005] Rajaram, S., Graeoe, T., and Herbrich, R. (2005). Poisson-networks: mode for structured point process. In IStats. 2 [Rao and Teh, 2011] Rao, V. and Teh,. W. (2011). Fast MCMC samping for Markov jump processes and continuous time ayesian networks. In UI. 2, 3 [Rao and Teh, 2013] Rao, V. and Teh,. W. (2013). Fast MCMC samping for Markov jump processes and extensions. Journa of Machine Learning Research, 14: arxiv: , 3, 4, 8

Stochastic Variational Inference with Gradient Linearization

Stochastic Variational Inference with Gradient Linearization Stochastic Variationa Inference with Gradient Linearization Suppementa Materia Tobias Pötz * Anne S Wannenwetsch Stefan Roth Department of Computer Science, TU Darmstadt Preface In this suppementa materia,