Learning to Reject Sequential Importance Steps for Continuous-Time Bayesian Networks

Size: px

Start display at page:

Download "Learning to Reject Sequential Importance Steps for Continuous-Time Bayesian Networks"

Gary Reginald Holt
6 years ago
Views:

1 Learning to Reject Sequential Importance Steps for Continuous-Time Bayesian Networks Jeremy C. Weiss University of Wisconsin-Madison Madison, WI, US Sriraam Natarajan Indiana University Bloomington, IN, US C. David Page University of Wisconsin-Madison Madison, WI, US Abstract Applications of grapical models often require te use of approximate inference, suc as sequential importance sampling (SIS), for estimation of te model distribution given partial evidence, i.e., te target distribution. However, wen SIS proposal and target distributions are dissimilar, suc procedures lead to biased estimates or require a proibitive number of samples. We introduce ReBaSIS, a metod tat better approximates te target distribution by sampling variable by variable from existing importance samplers and accepting or rejecting eac proposed assignment in te sequence: a coice made based on anticipating upcoming evidence. We relate te per-variable proposal and model distributions by expected weigt ratios of sequence completions and sow tat we can learn accurate models of optimal acceptance probabilities from local samples. In a continuous-time domain, our metod improves upon previous importance samplers by transforming an SIS problem into a macine learning one. Introduction Sequential importance sampling (SIS) is a metod for approximating an intractable target distribution by sampling from a proposal distribution and weigting te sample by te ratio of target to proposal distribution probabilities at eac step of te sequence. It provides te basis for many distribution approximations wit applications including robotic environment mapping and speec recognition (Montemerlo et al. 2003; Wolfel and Faubel 2007). Te caracteristic sortcoming of importance sampling stems from te potentially ig weigt variance tat results from large differences in te target and proposal densities. SIS compounds tis problem by iteratively sampling over te steps of te sequence, resulting in sequence weigts equal to te product of step weigts. Te sequence weigt distribution is exponential, so only te ig-weigt tail of te sample distribution contributes substantially to te approximation. Two approaces to mitigate tis problem are filtering, e.g., (Doucet, Godsill, and Andrieu 2000; Fan, Xu, and Selton 200) and adaptive importance sampling, e.g., (Cornebise, Moulines, and Olsson 2008; Yuan and Druzdzel 2003; 2007a). Copyrigt c 205, Association for te Advancement of Artificial Intelligence ( All rigts reserved. One drawback of filtering is tat it does not efficiently account for future evidence, and in cases of severe proposalevidence mismatc, many resampling steps are required, leading to sample impoverisment. Akin to adaptive importance sampling, our metod addresses proposal-evidence mismatc by developing foresigt, i.e. adaptation to approacing evidence, to guide its proposals. It develops foresigt by learning a binary classifier dependent on approacing evidence to accept or reject eac proposal. Our procedure can be viewed as te construction of a second proposal distribution, learned to account for evidence and to better approximate te target distribution, and is a new form of adaptive importance sampling. In greater detail, our task is to recover a target distribution f, wic can be factored variable by variable into component conditional distributions fi for i... k. Te SIS framework provides a suboptimal surrogate distribution g, wic likewise can be factored into a set of conditional distributions g i. We propose a second surrogate distribution closer to f based on learning conditional acceptance probabilities a i of rejection samplers relating fi and g i. Tat is, to sample from, we iteratively (re-)sample from proposals g i and accept wit probability a i. Our key idea is to relate te proposal g i and target fi distributions by te ratio of expected weigts of sequence completions, i.e., a setting for eac variable from i to k, given acceptance and rejection of te sample from g i. Given te expected weigt ratio, we can recover te optimal acceptance probability a i and tus f i. Unfortunately, te calculation of te expected weigts, and tus te ratio, is typically intractable because of te exponential number of sequence completions. Instead, we can approximate it using macine learning. First, we sow tat te expected weigt ratio equals te odds of accepting a proposal from g i under te fi distribution. Ten, transforming te odds to a probability, we can learn a binary classifier for te probability of acceptance under fi given te sample proposal from g i. Finally, we sow ow to generate examples to train a classifier to make te optimal accept/reject decision. We specifically examine te application of our rejection-based SIS (ReBaSIS) algoritm to continuous-time Bayesian networks (CTBNs) (Nodelman, Selton, and Koller 2002), were inference (even approximate) is NP-ard (Sturlaugson and Seppard 204).

2 We proceed as follows. First we present an example and related work, and we outline te problem of SIS and te identification of good proposal distributions. Ten, we define rejection sampling witin SIS and sow ow to approximate te target distribution via binary classification. Finally, we extend our analysis to CTBNs and describe experiments tat sow te empirical advantages of our metod over previous CTBN importance samplers.. An Illustrative Example Figure describes our metod in te simplest relevant example: a binary-state Markov cain. For our example, let k = 3: ten we ave evidence tat z 3 =. One possible sample procedure could be: S, accept z, 2 reject z2, 2 accept z2, reject z3, 2 accept z3, T, giving us te pat: S, z, 2 z2, z3, T. Note tat if te proposal g 3 : z2 z3 were very improbable under g but not f (i.e., proposal-evidence mismatc), all samples running troug z2 would ave very large weigt. By introducing te possibility of rejection at eac step, our procedure can learn to reject samples to z3, 2 reducing te importance sampling weigt, and learn to enter states z2 and z2 2 proportionally to f( e), i.e., develop foresigt..2 Related Work As mentioned above, batc resampling tecniques based on rejection control (Liu, Cen, and Wong 998; Yuan and Druzdzel 2007b) or sequential Monte Carlo (SMC) (Doucet, Godsill, and Andrieu 2000; Fan, Xu, and Selton 200), i.e. particle filtering, can mitigate te SIS weigt variance problem, but tey can lead to reduced particle diversity, especially wen many resampling iterations are required. Particle smooting (Fan, Xu, and Selton 200) combats particle impoverisment, but te exponentially-large state spaces used in CTBNs limit its ability to find alternative, probable sample istories. Previous adaptive importance sampling metods rely on structural knowledge and oter inference metods, e.g., (Ceng and Druzdzel 2000; Yuan and Druzdzel 2003), to develop improved proposals, wereas our metod learns a classifier to elp guide samples troug regions of proposal-evidence mismatc. One interesting idea combining work in filtering and adaptive importance sampling is te SMC 2 algoritm (Copin et al. 20), wic maintains a sample distribution over bot particles and parameters determining te proposal distribution, resampling along eiter dimension as necessary. Te metod does not anticipate future evidence, so it may complement our work, wic can similarly be used in te SMC framework. Oter MCMC (Rao and Te 20) or particle MCMC (Andrieu, Doucet, and Holenstein 200) metods may ave trouble in large state spaces (e.g., CTBNs) wit multiple modes and low density regions in between, especially if tere is proposal-evidence mismatc. 2 Background We begin wit a review of importance sampling and ten introduce our surrogate distribution. Let f be a p.d.f. defined on an ordered set of random variables Z = {Z,..., Z k } f: g: f f 2 f 3 g g 2 g 3 S z z 2 z 2 z 2 2 (f( e), target) (g, proposal) Figure : A source-to-sink representation of a binary-state Markov cain wit evidence at z k (red). Distributions f and g are defined over pats from source S to sink T and are composed of element-wise distributions f i and g i. For a sample at state z 2 (dark blue), an assignment to z2 2 is proposed (ligt blue) according to g 2. To mimic sampling from f2 = f 2 ( e), te proposed assignment is accepted wit probability proportional to te ratio of expected weigts of pat completions from z2 2 and z 2 to T. over an event space Ω. We are interested in te conditional distribution f(z e), were evidence e is a set of observations about a subset κ of values {Z i = z i } i κ. For fixed e, we define our target p.d.f. f (z) = f(z e). Let g(z) be a surrogate distribution from wic we can sample suc tat if f (z) > 0 ten g(z) > 0. Ten for any subset Z Ω, we can approximate f wit n weigted samples from g: z Z f (z)dz = n n i= Z f (z) g(z) g(z)dz [z i Z] f (z i ) g(z i ) = n z k z k 2 T n [z i Z]w i i= were [z i Z] is te indicator function wit value if z i Z and 0 oterwise, and w i is te importance sample weigt. We design a second surrogate (z) wit te density corresponding to accepting a sample from g: ( (z) = g(z)a(z) g(ζ)a(ζ)dζ) () Ω were a(z) is te sample acceptance probability from g. Te last term in Equation is a normalizing functional (of g and a) to ensure tat (z) is a density. Procedurally, we sample from by (re-)sampling from g and accepting wit probability a. Te approximation of f wit is given by: f f (z) g(z) (z)dz = Z Z g(z) (z) (z)dz n [z i Z]wf i n gw i g i= wit weigts wf i = wi f g wi g. To ensure tat as te support of f, we require tat bot a and g are non-zero everywere f is non-zero. Now we can define our optimal resampling density (z) using te optimal coice of acceptance probability a (z) =

3 min(, f (z)/αg(z)), were α is a constant determining te familiar rejection sampler envelope : αg(z). Te density (z) is optimal in te sense tat, for appropriate coice of α suc tat f (z) < αg(z) for all z, (z) = f (z). Wen (z) = f (z), te importance weigts are exactly, and te effective sample size is n. In many applications te direct calculation of f (z) is intractable or impossible and tus we cannot directly recover a (z) or (z). However, we can still use tese ideas to find (z) troug sequential importance sampling (Liu, Cen, and Wong 998), wic we describe next. 2. Sequential Importance Sampling Sequential importance sampling (SIS) is used wen estimating te distribution f over te factorization of Z. In timeseries models, Z i is a random variable over te joint state corresponding to a time step; in continuous-time models, Z i is te random variable corresponding to an interval. Defining z j i = {z j, z j,..., z i } for i, j {,..., k} and j i, we ave te decomposition: f (z) = p(z,..., z k e) = g (z e)w (z e) k g i (z i z (i ), e)w i (z i z (i ), e) (2) i=2 were p( ) is te probability distribution under f. Equation 2 substitutes p wit p.d.f. g i by defining functions g i and w i ( ) = p( )/g i ( ) and requiring g i to ave te same support as p. Ten we define g by te composition of g i : g(z e) = g (z e) k i=2 g i(z i z (i ), e), and likewise for w. To generate a sample z j from proposal distribution g(z e), SIS samples eac z i in order from to k. 3 Metods In tis section we relate te target and proposal decompositions. Recall tat we are interested in sampling directly from te conditional distribution f (z) = p(z e) for fixed e. We define interval distributions f (z ) and fi (z i z (i ) ) suc tat f (z) can be factored into te interval distributions: f (z) = f (z ) k i=2 f i (z i z (i ) ). We define te interval distributions: fi (z i z (i ) ) = p(e z i ) p(e z (i ) ) p(z i z (i ) ) for i > and p(e z (i ) ) > 0, and f (z ) = p(e z )p(z )/p(e) for i = and p(e) > 0. Ten, by te law of probability, we ave: fi (z i z (i ) ) = E f [[e, z] z i ] E f [[e, z] z (i ) ] p(z i z (i ) ). (3) Te indicator function [e, z] is sortand for [ l κ {z l= e l } z] and takes value if te evidence matces z and 0 oterwise. Note tat in te sampling framework, Equation 3 corresponds to sampling from te unconditioned proposal distribution p(z i z (i ) ) and calculating te expected weigt of sample completions z k (i+) and z k i, given by te indicator functions. Tis procedure describes te expected outcome obtained by forward sampling wit rejection. However, wen f and f are igly dissimilar, te vast majority of samples from f will be rejected, i.e., [e, z] = 0 for most z. It may be better to sample from a proposal g wit weigt function w = f/g so tat sampling leads to fewer rejections. Substituting g in Equation 3, we get: f i (z i z (i ) ) = E g[w a k i ] E g [w r k i ]g i(z i z (i ) ). (4) Te terms E g [wk i a ] and E g[wk i r ] are te expected forward importance sampling weigts of z k i given acceptance (a) or rejection (r) of proposed assignment z i. Te derivation of Equation 4 is provided in te Appendix. Equation 4 provides te relationsip: fi versus g i, given by te ratio of expected weigts of completion of sample z under acceptance or rejection of z i. Tis allows us to furter improve g and gives us te sampling distribution. 3. Rejection Sampling to Recover fi Because Equation 4 relates te two distributions, we can generate samples from fi by conducting rejection sampling from g i. Selecting constant α suc tat fi αg i, i.e., αg i is te rejection envelope, we define te optimal interval acceptance probability a i by: a i (z i z (i ) ) = f i (z i z (i ) ) αg i (z i z (i ) ) = E g[wk i a ] αe g [wk i r (5) ]. By defining a i for all i, we can generate an unweigted sample from f (z) in O(k max i (fi ( )/g i( ))) steps given te weigt ratio expectations and appropriate coice of α. Tus, if we can recover a i for all i, we get = f as desired and our procedure generates unweigted samples. 3.2 Estimating te Weigt Ratio Te procedure of sampling intervals to completion depends on te expected weigt ratio in Equation 4. Unfortunately, exact calculation of te ratio is impractical because te expectations involved require summing over an exponential number of terms. We could resort to estimating it from weigted importance samples: completions of z given z i and z given z (i ). Wile possible, tis is inefficient because () it would require weigt estimations for every z i given z i, and (2) te estimation of te expected weigts itself relies on importance sampling. However, we can cast te estimation of te weigt ratio as a macine learning problem of binary classification. We recognize tat similar situations, in terms of state z i, evidence e, model (providing f) and proposal g, result in similar values of a i. Tus, we can learn a binary classifier Φ i (z i, e, f, g) to represent te probability of {acceptance, rejection} = {φ i ( ), φ i ( )} as a function of te situation. In particular, te expected weigt ratio in Equation 4 is proportional to te odds under f of accepting te z i sampled from g i. Te binary classifier provides an estimate of te probability of acceptance φ i (z i, e, f, g), from wic

4 we can derive te odds of acceptance. Substituting into Equation 5, we ave: a i (z i z (i ) ) ( ) φi (z i, e, f, g) = a i (z i z (i ) ), α φ i (z i, e, f, g) denoting te approximations as a i for all i. Ten our empirical proposal density is: k (z) = (z ) i (z i ) i=2 = g (z )a (z )c [g, a ] k g i (z i z (i ) )a i (z i z (i ) )c i [g i, a i ], i=2 were te c i are te normalizing functionals as in Equation. We provide pseudocode for te rejection-based SIS (Re- BaSIS) procedure in Algoritm. 3.3 Training te Classifier Φ i Conceptually, generating examples for te classifier Φ i is straigtforward. Given some z (i ), we sample z i from g i, accept wit probability ρ = /2, and sample to completion using g to get te importance weigt. Ten, an example is (y, x, w): y = {accept, reject}, x is a set of features encoding te situation, and w is te importance weigt. Te training procedure works because te mean weigt of te positive examples estimates E g [wk i a ], and likewise te mean weigt of te negative examples estimates E g [wk i r ]. By sampling wit training acceptance probability ρ, a calibrated classifier Φ ρ i (i.e., one tat minimizes te L 2 loss) estimates te probability: ρe g [wk i a ]/(ρe g[wk i a ] + ( ρ)e g [wk i r ]). Te estimated probability can ten be used to recover te expected weigt ratio: E g [wk i a ] ( ) ( ρ φ ρ E g [wk i r ] i (z ) i, e, f, g) ρ φ ρ i (z, i, e, f, g) and tus, te optimal classifier can be used to recover te rejection-based acceptance probability a i. We get our particular estimator φ i /( φ i ) by setting ρ = /2, toug in principle we could use oter values of ρ. In practice, we sample trajectories alternating acceptance and rejection of samples z i to get a proportion ρ = /2. Ten, we continue sampling te same trajectory to produce 2k training examples for a sequence of lengt k. We adopt tis procedure for efficiency at te cost of generating related training examples. Pseudocode for te generation of examples to train te classifier is provided in te Supplement. Inevitably, tere is some cost to pay to construct, including time for feature construction, classifier training, and classifier use. However, learning only needs to be done once, wile inference may be performed many times. Also, in many callenging problems g will not produce an ESS of any appreciable size, in our case because of a mismatc wit f, and in MCMC because of mode opping difficulties. Our metod adopts an approac complementary to particle metods to elp tackle suc problems, and we sow its utility in te CTBN application. Algoritm Rejection-based SIS (ReBaSIS) Input: conditional distributions {f j} and {g j} j, evidence e; constants α, k; i =, z = {}, w = ; classifiers {φ j} Output: sample z wit weigt w : wile i k do 2: accept = false 3: wile not accept do 4: Sample z i g i, r U[0, ] ( φi (z i,z,e,f,g) φ i (z i,z,e,f,g) 5: a = α 6: if r < a ten 7: accept = true 8: end if 9: end wile 0: z = {z i, z}, w = wf i(z i)c[g i(z i), a(z i)]/(g i(z i)a) : i = i + 2: end wile 3: return (z, w) 4 Continuous-Time Bayesian Networks We extend our analysis to a continuous-time model: te continuous-time Bayesian network (CTBN) (Nodelman, Selton, and Koller 2002), wic ave applications for example in anomaly detection (Xu and Selton 200) and medicine (Weiss, Natarajan, and Page 202). We review te CTBN model and sampling metods and extend our analysis. CTBNs are a model of discrete random variables X, X 2,..., X d = X over time. Te model specifies a directed grap over X tat determines te parents of eac variable. Te parents setting u X is te joint state of te parents of X. Ten, CTBNs assume tat te probability of transition for variable X out of state x at time t is given by te exponential distribution q x u e qx ut wit rate parameter (intensity) q x u for parents setting u. Te variable X transitions from state x to x wit probability Θ xx u, were Θ xx u is an entry in a state transition matrix Θ u. A complete CTBN model is described by two components: a distribution B over te initial joint state X, typically represented by a Bayesian network, and a directed grap over nodes representing X wit corresponding conditional intensity matrices (CIMs). Te CIMs old te intensities q x u and state transition probability matrices Θ u. Te likeliood of a CTBN model given data is computed as follows. A trajectory is a sequence of intervals of fixed state. For eac interval [t 0, t ), te duration t = t t 0 passes, and a variable X transitions at t from state x to x. During te interval all oter variables X i X remain in teir current states x i. Te interval likeliood is given by: q x u e q x ut e q x i ut. (6) }{{} X transitions ) Θ xx u x i:x i X }{{} to state x } {{ } wile X i s rest Ten, te sequence likeliood is given by te product of interval likelioods: q Mx u x u e q x ut x u X X x X u U X x x Θ M xx u xx u

5 were te M x u (and M xx u) are te numbers of transitions out of state x (to state x ), and were te T x u are te amounts of time spent in x given parents settings u. 4. Sampling in CTBNs Evidence provided in data is typically incomplete, i.e., te joint state is partially or fully unobserved over time. Tus, inference is performed to probabilistically complete te unobserved regions. CTBNs are generative models and provide a sampling framework to complete suc regions. Let a trajectory z be a sequence of (state,time) pairs (z i = {x i, x 2i,..., x di }, t i ) for i = {0,..., k}, were x ji is te jt CTBN variable at te it time, suc tat te sequence of t i are in [t start, t end ). Given an initial state z 0 = {x 0, x 20,..., x d0 }, transition times are sampled for eac variable x according to q x u e q x ut were x is te active state of X. Te variable X i tat transitions in te interval is selected based on te sortest sampled transition time. Te state to wic X i transitions is sampled from Θ xix i u. Ten te transition times are resampled as necessary according to intensities q x u, noting tat tese intensities may be different because of potential canges in te parents setting u. Te trajectory terminates wen all sampled transition times exceed a specified ending time. Previous work by Fan et al. describes a framework for importance sampling, particle filtering, and particle smooting in CTBNs (Fan, Xu, and Selton 200). By sampling from truncated exponentials, tey incur importance sample downweigts on teir samples. Recall tat Equation 6 is broken into tree components. Te weigts f i /g i are decomposed likewise: () a downweigt for te variable x tat transitions, according to te ratio of te exponential to te truncated exponential: ( e q x uτ ), (2) a downweigt corresponding to a lookaead point-estimate of Θ u assuming no oter variables cange until te evidence (we leave tis unmodified in our implementation), and (3) a downweigt for eac resting variable x i given by te ratio of not sampling a transition in time t from an exponential and a truncated exponential: ( e q x i uτ i )/( e q x i u(τ i t ) ). Finally, te product of all interval downweigts provides te trajectory importance sample weigt. 4.2 Rejection-Based Importance Sampling in CTBNs Tere is an equivalence between a fixed continuous-time trajectory and te discrete sequences described above. In particular, te rejection-based importance sampling metod requires tat te number of intervals k must be fixed, wile CTBNs produce trajectories wit varying numbers of intervals. Neverteless, for any set of trajectories, we can define ɛ-widt intervals small enoug tat at most one transition occurs per interval and tat suc transitions occur at te end of te interval. Ten for any set of trajectories over te duration [t start, t end ), we set k = (t end t start )/ɛ. Using te memoryless property of exponential distributions, te density of a single, one-transition interval is equal to te density of te product of a sequence of ɛ-widt, zero-transition intervals and one ɛ- widt, one-transition interval. In practice, it is simpler to use te CTBN sampling framework so tat eac interval is of appreciable size. We denote te evidence e as a sequence of tuples of type (state, start time, duration): (e i, t i,0, τ i ) allowing for point evidence wit zero duration, τ i = 0, and [e, z] cecks to see if z agrees wit eac e i over te duration. Unlike discrete time were we can enumerate te states z i, in continuous time te calculation of w g can be timeconsuming because of te normalizing integrals c i [g i, a i ]. From Equation, we ave, omitting te conditioning: g i (z i ) i (z i ) = Ω i a i (ζ)g i (ζ)dζ. a i (z i ) wic can eiter be approximated by ( φ i ( ))/φ i ( ), or calculated piecewise. Te approximation metod assumes te learned acceptance probability produces a proper conditional probability distribution for eac situation. In our experiments, we adopt te approximation metod and compare te results to sow it does not introduce significant bias. We also compare using a restricted feature set were an exact, piecewise computation is possible. Several properties of CTBNs make our approac appealing. First, CTBNs possess te Markov property; namely, te next state is independent of previous states given te current one. Second, CTBNs are omogeneous processes, so te model rate parameters are sared across intervals. We leverage tese facts wen learning eac acceptance probability a i. Te Markov property simplifies te learned probability of acceptance φ i (z i z (i ), e, f, g) to φ i (z i z (i ), e, f, g). Homogeneity simplifies te learning process because, if z (i ) = z (j ) and t i,0 = t j,0 for j i, ten φ i (z i z (i ), e, f, g) = φ i (z j z (j ), e, f, g). Te degeneracy of tese two cases indicates tat te probability of acceptance is situation-dependent and interval indexindependent, so a single classifier can be learned in place of k classifiers. 5 Experiments We compare our learning-based rejection metod (setting α = 2) wit te Fan et al. sampler (200). We learn a logistic regression (LR) model using online, stocastic gradient descent for eac CTBN variable state. An LR data example is (y, x, w), were y is one of {accept, reject}, x is a set of features, and w is te sequence completion weigt. For our experiments we use per-variable features encoding te state (as indicator variables), te time from te current time to next evidence, te time from te proposed sample to next evidence, and te time from te proposed sample to next matcing evidence. Except in te restricted feature set, te times are mapped to intervals [0, ] by using a e t/λ transformation, wit λ = {0 2, 0, 0 0, 0, 0 2 } to capture effects at different time scales. Te restricted feature set uses linear times truncated at a maximum time of 0. Tis allows fort piecewise calculation of te normalizing functional Because of te ig variance in weigts for samples of full sequence completions, we instead coose a local weigt: te weigt of te sequence troug te next m = 0 evidence times. Tis biases te learner to ave low variance weigts witin te local window, but it does not bias te proposal.

6 Probability density Probability density g f f* Time Time f* g f* f g Figure 2: Approximate transition densities (top) of f (target), f (target witout evidence), g (surrogate), and (learned rejection surrogate) in a one-variable, binary-state CTBN wit uniform transition rates of 0. and matcing evidence at t=5. Te learned distribution closely mimics f, te target distribution wit evidence, wile g was constructed to mimic f (exactly, in tis situation). All metods recover te weigted transition densities (bottom); for 20 evidence points wit one at t=5 and 9 after t=20, recovers te target distribution more precisely tan g per 0 6 samples. We analyze te performance of te rejection-based sampler by inspection of learned transition probability densities and te effective sample size (ESS) (Kong, Liu, and Wong 994). ESS is an indicator of te quality of te samples, and a larger value is better: ESS = /( n i= (W i ) 2 ), were W i = w i / n j= wj. We test our metod in several models: one-, two- and tree- variable, strong-cycle binary-state CTBNs, and te part-binary, 8-variable drug model presented in te original CTBN paper (Nodelman, Selton, and Koller 2002). Te strong-cycle models encode an intensity pat for particular joint states by sifting bit registers and adding and filling in an empty register wit a 0 or as in te following example. For te 3- variable strong cycle, intensities involved in te pat are, and intensities are 0. for all oter transitions. We generate sequences from ground trut models and censor eac to retain only 00 point evidences wit times t i drawn randomly, uniformly over te duration [0,20). Te code is provided at ttp://cs.wisc.edu/ jcweiss/aaai5. Figure 2 (top) illustrates te ability of to mimic f, te target distribution, in a one-node binary-state CTBN wit matcing evidence at t = 5. Te Fan et al. proposal g matces te target density in te absence of evidence, Table : Geometric mean of effective sample size (ESS) over 00 sequences, eac wit 00 observations; ESS is per 0 5 samples. Te proposal was learned wit 000 sequences. Model Fan et al. (g) Rejection SIS () Strong cycle, n= Strong cycle, n= Strong cycle, n= Drug f. However, wen approacing evidence (at t = 5), te transition probability given evidence goes to 0 as te next transition must also occur before t = 5 to be a viable sequence. Only f and exibit tis beavior. Figure 2 (bottom) sows te density approximations after weigting te samples, given a trajectory wit evidence at t = 5 and 9 evidence points after t = 20. Eac metod recovers te target distribution, but does so more precisely tan g, given a fixed number of samples (one million). Te proposal acceptance rate for was measured to be 45 percent. Te proposal based on te restricted feature set for unbiased inference is provided in te Supplement. Table sows tat te learned, rejection-based proposal outperforms te oter CTBN importance sampler g across all 4 models, resulting in an ESS 2 to 0 times larger. Generally as te number of variables increases, te ESS decreases because of te increasing mismatc between f and g. Wit an average ESS of only 29 in an 8-variable model, as we increase model sizes, we expect tat g would fail to produce a meaningful sample distribution more quickly tan would. Figure 3 sows tat te weigt distribution from is narrower tan tat from g on te log scale, using te drug model and 0 evidence points. Te interpretation is tat a larger fraction of examples from contribute substantially to te total weigt, resulting in a lower variance sample distribution. For example, any sample wit weigt below e 0 as negligible relative weigt and does not substantially affect te sample distribution. Tere are many fewer suc samples generated from tan from g. Density g Log weigt Figure 3: Distribution of log weigts. For sample completions of a trajectory wit 0 evidence points in te drug model, te distribution of log weigts using is muc narrower tan te distribution of log weigts using g (bottom).

7 6 Conclusion Our work as demonstrated tat macine learning can be used to improve sequential importance sampling via a rejection sampling framework. We sowed tat te proposal and target distributions are related by an expected weigt ratio, and tat te weigt ratio can be estimated by te probabilistic output of a binary classifier learned from weigted, local importance samples. We extended te algoritm to CTBNs, were we found experimentally tat using our learning algoritm produces a sampling distribution closer to te target and generates more effective samples. Continued investigations are warranted, including te use of non-parametric learning algoritms, a procedure for expanding local approximations, and extensions to oter grapical models, were conjugacy of te acceptance probability wit te proposal distribution could lead to improved performance. 7 Acknowledgments We gratefully acknowledge te CIBM Training Program grant 5T5LM007359, te NIGMS grant R0GM09768, te NLM grant R0LM0028, and te NSF grant IIS A Appendix To derive Equation 4, we relate te target densities f i (z i z (i ) ) wit te proposal densities g i (z i z (i ) ) via te (standard) derivation of Equation 3 using Bayes teorem, te law of total probability, and substitution: f i (z i z (i ) ) = p(e z i ) p(e z (i ) ) p(z i z (i ) ) p(e z k i+, z i )p(z k i+ z i ) z k i+ = p(e z k i, z i )p(z k i z i ) p(z i z i ) z k i [ l κ {z l=e l } z]p(z k i+ z i ) z k i+ = [ l κ {z l=e l } z]p(z k i z i ) p(z i z i ) z k i = E g[[e, z] k j=i+ w j(z j z j ) z i ] E g [[e, z] k j=i w j(z j z j ) z i ] p(z i z i ) = E g[[e, z] k j=i w j(z j z j ) z i ] E g [[e, z] k j=i w j(z j z j ) z i ] g i(z i z i ) = E g[w a k i ] E g [w r k i ]g i(z i z (i ) ). References Andrieu, C.; Doucet, A.; and Holenstein, R Particle Markov cain Monte Carlo metods. Journal of te Royal Statistical Society: Series B (Statistical Metodology) 72(3): Ceng, J., and Druzdzel, M. J AIS-BN: An adaptive importance sampling algoritm for evidential reasoning in large Bayesian networks. Journal of Artificial Intelligence Researc 3(): Copin, N.; Jacob, P.; Papaspiliopoulos, O.; et al. 20. Smc2: A sequential Monte Carlo algoritm wit particle Markov cain Monte Carlo updates. JR Stat. Soc. B (202, to appear). Cornebise, J.; Moulines, E.; and Olsson, J Adaptive metods for sequential importance sampling wit application to state space models. Statistics and Computing 8(4): Doucet, A.; Godsill, S.; and Andrieu, C On sequential Monte Carlo sampling metods for Bayesian filtering. Statistics and Computing 0(3): Fan, Y.; Xu, J.; and Selton, C. R Importance sampling for continuous time Bayesian networks. Te Journal of Macine Learning Researc : Kong, A.; Liu, J. S.; and Wong, W. H Sequential imputations and Bayesian missing data problems. Journal of te American Statistical Association 89(425): Liu, J. S.; Cen, R.; and Wong, W. H Rejection control and sequential importance sampling. Journal of te American Statistical Association 93(443): Montemerlo, M.; Trun, S.; Koller, D.; and Wegbreit, B Fastslam 2.0: An improved particle filtering algoritm for simultaneous localization and mapping tat provably converges. In International Joint Conference on Artificial Intelligence, volume 8, Lawrence Erlbaum Associates LTD. Nodelman, U.; Selton, C.; and Koller, D Continuous time Bayesian networks. In Uncertainty in artificial intelligence, Morgan Kaufmann Publisers Inc. Rao, V., and Te, Y. W. 20. Fast MCMC sampling for Markov jump processes and continuous time Bayesian networks. In Uncertainty in Artificial Intelligence. Sturlaugson, L., and Seppard, J. W Inference complexity in continuous time bayesian networks. In Uncertainty in Artificial Intelligence. Weiss, J.; Natarajan, S.; and Page, D Multiplicative forests for continuous-time processes. In Advances in Neural Information Processing Systems 25, Wolfel, M., and Faubel, F Considering uncertainty by particle filter enanced speec features in large vocabulary continuous speec recognition. In Acoustics, Speec and Signal Processing, ICASSP IEEE International Conference on, volume 4, IV 049. IEEE. Xu, J., and Selton, C. R Intrusion detection using continuous time Bayesian networks. Journal of Artificial Intelligence Researc 39(): Yuan, C., and Druzdzel, M. J An importance sampling algoritm based on evidence pre-propagation. In Uncertainty in Artificial Intelligence, Morgan Kaufmann Publisers Inc. Yuan, C., and Druzdzel, M. J. 2007a. Importance sampling for general ybrid Bayesian networks. In Artificial intelligence and statistics. Yuan, C., and Druzdzel, M. J. 2007b. Improving importance sampling by adaptive split-rejection control in Bayesian networks. In Advances in Artificial Intelligence. Springer

Regularized Regression

Regularized Regression David M. Blei Columbia University December 5, 205 Modern regression problems are ig dimensional, wic means tat te number of covariates p is large. In practice statisticians regularize