Bayesian analysis of massive datasets via particle filters

Size: px
Start display at page:

Download "Bayesian analysis of massive datasets via particle filters"

Transcription

1 Bayesian analysis of massive atasets via particle filters Greg Rigeway RAND PO Box 2138 Santa Monica, CA Davi Maigan Department of Statistics 477 Hill Center Rutgers University Piscataway, NJ ABSTRACT Markov Chain Monte Carlo (MCMC) techniques revolutionize statistical practice in the 1990s by proviing an essential toolkit for making the rigor an flexibility of Bayesian analysis computationally practical. At the same time the increasing prevalence of massive atasets an the expansion of the fiel of ata mining has create the nee to prouce statistically soun methos that scale to these large problems. Except for the most trivial examples, current MCMC methos require a complete scan of the ataset for each iteration eliminating their caniacy as feasible ata mining techniques. In this article we present a metho for making Bayesian analysis of massive atasets computationally feasible. The algorithm simulates from a posterior istribution that conitions on a smaller, more manageable portion of the ataset. The remainer of the ataset may be incorporate by reweighting the initial raws using importance sampling. Computation of the importance weights requires a single scan of the remaining observations. While importance sampling increases efficiency in ata access, it comes at the expense of estimation efficiency. A simple moification, base on the rejuvenation step use in particle filters for ynamic systems moels, siesteps the loss of efficiency with only a slight increase in the number of ata accesses. To show proof-of-concept, we emonstrate the metho on a mixture of transition moels that has been use to moel web traffic an robotics. For this example we show that estimation efficiency is not affecte while offering a 95% reuction in ata accesses. 1. INTRODUCTION The nee for rigorous statistical analysis has not gone unnotice in the ata mining community. Statistical concepts such as latent variables, spurious correlation, an problems involving moel search an selection have appeare in wiely note ata mining literature [6, 12]. However, algorithms, moel fitting methos that actually work on massive ata- Permission to make igital or har copies of all or part of this work for personal or classroom use is grante without fee provie that copies are not mae or istribute for profit or commercial avantage an that copies bear this notice an the full citation on the first page. To copy otherwise, to republish, to post on servers or to reistribute to lists, requires prior specific permission an/or a fee. SIGKDD 02 Emonton, Alberta, Canaa Copyright 2002 ACM X/02/ $5.00. sets, have been slow to appear. Bayesian analysis is a wiely accepte paraigm for estimating unknown parameters from ata. In applications with small to meium size atasets, Bayesian methos have foun great success in statistical practice. In particular, applie statistical work has seen a surge in the use of Bayesian hierarchical moels for moeling multilevel or relational ata [3] in a variety of fiels incluing health an eucation. Spatial moels in agriculture, image analysis, an remote sensing often utilize Bayesian methos an invariably require heavy computation (see [1] for an overview). As shown in [7], kernel methos also have a convenient Bayesian formulation, proucing posterior istributions over a general class of preiction moels. The power of Bayesian analysis comes from the transparent inclusion of prior knowlege, a more natural probabilistic interpretation of parameter estimates, an greater flexibility in moel specification. While Bayesian moels an the computational tools behin them has revolutionize the fiel, they continue to rely on algorithms that perform thousans even millions of laps through the ataset in orer to prouce estimates of the posterior istribution of the moel parameters. For massive atasets Bayesian methos still begin by a loa ata into memory step, make compromising assumptions, or resort to subsampling to skirt the issue. If the most severe penalty comes when requesting ata, algorithms might exist that only use a small manageable portion of the ataset at any one time. This paper proposes such an algorithm. It performs a rigorous Bayesian computation on a small, manageable portion of the ataset an aapts those calculations with the remaining observations. The aaptation attempts to minimize the number of times the algorithm loas each observation into memory. 2. TECHNIQUES FOR BAYESIAN COMPUTATION Except for the simplest of moels an regarless of the style of inference, estimation algorithms almost always require repeate scans of the ataset. We know that for well-behave likelihoos an priors, the posterior istribution converges to a multivariate normal [4, 15]. For large but finite samples this approximation works rather well on marginal istributions an lower imensional conitional istributions but oes not always provie an accurate approximation to the full joint istribution [8]. The normal approximation also assumes that one has the maximum likelihoo estimate for the parameter an the observe or ex-

2 pecte information matrix. Even normal posterior approximations an maximum likelihoo calculations can require heavy computation. Newton-Raphson type algorithms for maximum likelihoo estimation require several scans of the ataset, at least one for each iteration. When some observations also have missing ata, the algorithms (EM, for example) likely will eman even more scans. For some moels, ataset sizes, an applications these approximation methos may work an be preferable to a full Bayesian analysis. This will not always be the case an so the nee exists for improve techniques to learn accurately from massive atasets. Summaries of results from Bayesian ata analyses often are in the form of expectations such as the marginal mean, variance, an covariance of the parameters of interest. We compute the expecte value of the quantity of interest, h(θ), using E(h(θ) x 1,..., x N ) = h(θ)f(θ x 1,..., x N )θ (1) where f(θ x), is the posterior istribution of the parameters given the observe ata. Computation of these expectations requires calculating integrals that, for all but the simplest examples, are ifficult to compute in close form. Monte Carlo integration methos sample from the posterior, f(θ x), an appeal to the law of large numbers to estimate the integrals, lim M 1 M MX i=1 h(θ i) = h(θ)f(θ x 1,..., x N )θ (2) where the θ i compose a sample from f(θ x). The ability to compute these expectations efficiently is equivalent to being able to sample efficiently from f(θ x). Sampling schemes are often ifficult enough without the buren of large atasets. The aitional complexity of massive atasets usually causes each iteration of the Monte Carlo sampler to be slower. When the number of iterations alreay nees to be large, efficient proceures within each iteration are essential to timely elivery of results. 2.1 Importance sampling Importance sampling is a general Monte Carlo metho for computing integrals. As previously mentione, Monte Carlo methos approximate integrals of the form (1). The approximation in (2) epens on the ability to sample from f(θ x). When a sampling mechanism is not reaily available for the target istribution, f(θ x), but one is available for another sampling istribution, g(θ), we can use importance sampling. Note that for (1) we can write h(θ)f(θ x 1,..., x N )θ = = lim h(θ) f(θ x) g(θ)θ (3) g(θ) MX M i=1 w i h(θ i ) (4) where θ i is a raw from g(θ) an w i = f(θ i x)/g(θ i ). Note that the expecte value of w i uner g(θ) is 1. Therefore, if we are able to compute the importance sampling weights, w i, only up to a constant of proportionality, we can normalize the weights to compute the integral. P M i=1 h(θ)f(θ x 1,..., x N )θ = lim w ih(θ i ) P M M (5) i=1 wi Naturally, in orer for the sampling istribution to be useful, rawing from g(θ) must be easy. We also want our sampling istribution to be such that the limit converges quickly to the value of the integral. If the tails of g(θ) ecay faster than f(θ x) the weights will be numerically unstable. If the tails of g(θ) ecay much more slowly than f(θ x) we will frequently sample from regions where the weight will be close to zero, wasting computation time. Secon to sampling irectly from f(θ x), we woul like a sampling istribution slightly fatter than f(θ x). In section 2.3 we show that when we set the sampling ensity to be f(θ x 1,..., x n ), where n N so that we conition on a manageable subset of the entire ataset, the importance weights for each sample θ i require only one sequential scan of the remaining observations. Before beginning that iscussion, the next section introuces the most popular computational metho for Bayesian analysis of complex moels. 2.2 Markov chain Monte Carlo While a large group of statisticians have long felt that Bayesian analysis is appropriate for a wie class of problems, practical estimation methos were not available until Markov chain Monte Carlo (MCMC) techniques became available. Importance sampling is a useful tool, but for complex moels crafting a reasonable sampling istribution can be extremely ifficult. The excellent collection [11] contains a more etaile introuction to MCMC along with a variety of interesting examples an applications. As with importance sampling, the goal is to generate a set of raws from the posterior istribution f(θ x). Rather than create inepenent raws an reweight, MCMC methos buil a Markov chain, a sample of epenent raws, θ 1,..., θ M, that have stationary istribution f(θ x). It turns out that it is often easy to create such a Markov chain with a few basic strategies. However, there is still a bit of art involve in creating an efficient chain an assessing the chain s convergence. Figure 1 shows the Metropolis-Hastings algorithm [13, 17], a very general MCMC algorithm. Assume that we have a single raw θ 1 from f(θ x) an a proposal istribution for a new raw, q(θ θ 1). If we follow step 2 of the MCMC algorithm then the istribution of θ 2 will also be f(θ x). This is one of the key properties of the algorithm. Iterating this algorithm we will obtain a sequence θ 1,..., θ M that has f(θ x) as its stationary istribution. MCMC methos have two main avantages that make them so useful for Bayesian analysis. First, we can choose q s from which sampling is easy. Any q that oes not eterministically propose values an is capable of eventually visiting any value of θ will make the algorithm sample from the esire istribution. Special choices for q, which may or may not epen on the ata, simplify the algorithm. If q is symmetric, for example a Gaussian centere on θ i 1, then the entire proposal istributions cancel out in (6). If we choose a q that proposes values that are very close to θ i 1 then it will almost always accept the proposal but the chain will move very slowly an take a long time to converge to the stationary istribution. If q proposes new raws that are far from θ i 1 an outsie the region with most of the posterior mass, the proposals will almost always be rejecte an again the chain will converge slowly. With a little tuning the proposal istribution can usually be ajuste so that proposals are not rejecte or accepte too frequently. The

3 secon avantage is that there is no nee to compute the normalization constant of f(θ x) since it cancels out in (6). The Gibbs sampler [9] is a special case of the Metropolis- Hastings algorithm an is especially popular. If θ is a multiimensional parameter, the Gibbs sampler sequentially upates each of the components of θ from the full conitional istribution of that component given fixe values of all the other components an the ata. For many moels use in common practice, even the ones that yiel a complex posterior istribution, sampling from the posterior s full conitionals is often a relatively simple task. Conveniently, the acceptance probability (6) always equals 1 an yet the chains often converge relatively quickly. The example in section 4 utilizes a Gibbs sampler an goes into further etail of the example s full conitionals. MCMC as specifie, however, is computationally infeasible for massive atasets. Except for the most trivial examples, computing the acceptance probability (6) requires a complete scan of the ataset. Although the Gibbs sampler avois the acceptance probability calculation, precalculations for simulating from the full conitionals of f(θ x) require a full scan of the ataset, sometimes a full scan for each component! Since MCMC algorithms prouce epenent raws from the posterior, M usually has to be very large to reuce the amount of Monte Carlo variation in the posterior estimates. While MCMC makes fully Bayesian analysis practical it seems ea on arrival for massive ataset applications. Although this section has not given great etail about the MCMC methos, the important ieas for the purpose of this paper are that 1. MCMC methos make Bayesian analysis practical, 2. MCMC often requires an enormous number of laps through the ataset, an 3. given a θ rawn from f(θ x) we can use MCMC to raw another value, θ, from the same istribution. The last point will be the key to implementing a particle filter solution that allows us to apply MCMC methos to massive atasets. We will use this technique to switch the inner an outer loops in figure 1. The scan of the ataset will become the outer loop an the scan of the raws from f(θ x) will become the inner loop. 2.3 Importance sampling for analysis of massive atasets So far we have two tools, MCMC an importance sampling, to raw samples from an arbitrary posterior istribution. In this section we iscuss a particular form of importance sampling that will help perform Bayesian analysis for massive atasets. Ieally we woul like to sample efficiently an take avantage of all the information available in the ataset. A factorization of the integran of the right han sie of (3) shows that this is possible when the observations, x i, are inepenent given the parameters, θ. Such conitional inepenence is often satisfie, like in the class of hierarchical moels, even when the observations are marginally epenent. Let D 1 an D 2 be a partition of the ataset so that every observation is in either D 1 or D 2. As note for (1) we woul like to sample from the posterior conitione on all of the ata, f(θ D 1, D 2 ). Since sampling from f(θ D 1, D 2 ) 1. Initialize the parameter θ 1 2. For i in 2,..., M o Step (a) an/or (b) requires a scan of the ataset (a) Draw a proposal θ from q(θ θ i 1), (b) Loop through the ataset to compute the acceptance probability α(θ f(θ x)q(θ i 1 θ ), θ i 1 ) = min 1, f(θ i 1 x)q(θ θ i 1) (c) With probability α(θ, θ i 1) set θ i = θ. Otherwise set θ i = θ i 1 Figure 1: The Metropolis-Hastings algorithm is ifficult ue to the size of the ataset, we consier setting g(θ) = f(θ D 1 ) for use as our sampling istribution an using importance sampling to ajust the raws. If θ i, i = 1,..., M, are raws from f(θ D 1) then we can estimate the posterior expectation (1) as Ê(h(θ) D 1, D 2 ) = P M i=1 wih(θi) P M i=1 w i where the w i s are the importance sampling weights (6) (7) w i = f(θ i D 1, D 2 ). (8) f(θ i D 1) Although these weights still involve f(θ i D 1, D 2 ), they greatly simplify. w i = f(d 1, D 2 θ i )f(θ i ) f(d 1, D 2) = f(d 1 θ i )f(d 2 θ i )f(d 1 ) f(d 1 θ i )f(d 1, D 2 ) = f(d2 θi) f(d 2 D 1 ) Y f(d 2 θ i ) = f(d 1 ) f(d 1 θ i)f(θ i) (9) (10) x j D 2 f(x j θ i ) (11) Line (9) follows from applying Bayes theorem to the numerator an enominator. Equation (10) follows from (9) since the observations in the ataset partition D 1 are conitionally inepenent from those in D 2 given θ. Conveniently, (11) is just the likelihoo of the observations in D 2 evaluate at the sample value of θ. Figure 2 summarizes this result as an algorithm. The algorithm maintains the weights on the log scale for numerical stability. So rather than sample from the posterior conitione on all of the ata, D 1 an D 2, which slows the sampling proceure, we nee only to sample from the posterior conitione on D 1. The remaining ata, D 2, simply ajusts the sample parameter values by reweighting. The for loops in step 5 of figure 2 are interchangeable. The trick here is to have the inner loop scan through the raws so that the outer loop only nees to scan D 2 once to upate the weights. Although the

4 same computations take place, in practice physically scanning a massive ataset is far more expensive than scanning a parameter list. However, massive moels as well as massive atasets exist so that in these cases scanning the ataset may be cheaper than scanning the sample parameter vectors. We will continue to assume that scanning the ataset is the main impeiment to the ata analysis. We certainly can sample from f(θ D 1) more efficiently than from f(θ D 1, D 2) since simulating from f(θ D 1) will require a scan of a much smaller portion of the ataset. We also assume that, for a given value of θ, the likelihoo is reaily computable up to a constant, which is almost always the case. When some ata are missing, the processing of an observation in D 2 will require integrating out the missing information. Since the algorithm hanles each observation case by case, computing the observe likelihoo as an importance weight will be much more efficient than if it was embee an repeately compute in a Metropolis-Hastings rejection probability computation. Placing observations with missing values in D 2 greatly reuces the number of times this integration step nees to occur, easing likelihoo computations. 1. Loa as much ata into memory as possible to form D 1, taking into account space requirements for the Monte Carlo algorithm 2. Draw M times from f(θ D 1 ) via Monte Carlo or Markov chain Monte Carlo 3. Purge the memory of D 1 4. Create a vector of length M to store the logarithm of the weights an initialize them to 0 5. Iterate through the remaining observations. For each observation, x j, upate the log-weights on all of the raws from f(θ D 1 ) for x j in the partition D 2 o { for i in 1,..., M o { log w i log w i + log f(x j θ i) } } 6. Rescale to compute the weights w i exp (log w i max(log w i)) Figure 2: Importance sampling for massive atasets 2.4 Efficiency an the effective sample size The algorithm shown in figure 2 oes have some rawbacks. While it makes great gains in reucing the number of times the ata nee to be accesse the Monte Carlo variance of the importance sampling estimates grows quickly. The problem is easily emonstrate graphically as shown in figure 3. The wie histogram represents the sampling istribution f(θ D 1 ) that generates the initial posterior raws Figure 3: Comparison of f(θ D 1, D 2 ) an f(θ D 1 ) However, the target istribution, f(θ D 1, D 2), shown as the ensity plot, is shifte an narrower. About half of the raws from f(θ D 1 ) will be waste. Those that come from the right half will have importance weight near zero. Since all of the terms are positive in the familiar variance relationship Var(θ D 1 ) = E(Var(θ D 1, D 2 )) + Var(E(θ D 1, D 2 )), (13) the posterior variance with the aitional observations in D 2 on average will be smaller than the posterior variance conitione only on D 1. The aition of D 2 can increase the variance (see [22] for an example) but usually D 2 is large enough so that the averaging effect ominates. Therefore, although the location of the sampling istribution shoul be close to the target istribution, its sprea will most likely be wier than that of the target. As aitional observations become available, f(θ D 1, D 2 ) becomes much narrower than f(θ D 1 ). The result of this narrowing is that the weights of many of the original raws from the sampling istribution approach 0 an so we have few effective raws from the target ensity. As in [14], the effective sample size (ESS) is the number of observations from a simple ranom sample neee to obtain an estimate with Monte Carlo variation equal to the Monte Carlo variation obtaine with the M weighte raws of θ i. X! Var θ i = Var (14) Therefore, ESS 1 ESS i=1 P M! i=1 w iθ P i M i=1 wi P Var(θ) 1 M ESS = Var(θ) i=1 w2 i P M 2 (15) i=1 w i (P w i ) 2 ESS = P w 2 (16) i = M 1 + Var(w). (17) Höler s inequality implies that ESS is always less than or equal to M. With a little algebra, the ESS is also expressible in terms of the sample variance of the w i as shown in (17), which facilitates the stuy of its properties in Theorem 1. If the θ i are epenent from the start, as will be the case for MCMC raws, the effective sample size will further ecrease in aition to reuctions ue to unequal importance weights. When the MCMC algorithm mixes well so that the set of θ i are not too epenent, this is not too much of a problem.

5 Figure 4 shows the ecay of the effective sample size for a simulate example. The ata come from a three-imensional Gaussian with mean 0 an covariance equal to the ientity matrix. The posterior therefore concerns the three mean an the six covariance parameters. We sample M = 1000 times from the posterior conitione on n = 100 observations. After 300 aitional ata points the ESS has roppe to 10, a 99% loss in estimation efficiency from the initial Monte Carlo sample of M = At this point 65 of the initial 1000 raws account for 99% of the total weight. Figure 4 also overlays the ESS curve assuming a known covariance an the expecte ESS curve erive next. The following theorem concerning the variance of the important sampling weights can help us gauge the effect of these problems in practice. The theorem assumes that we observe a finite set of multivariate normal ata, x i. As before we will partition the x i s into two groups, D 1 an D 2. To get accurate estimates of the mean, µ, we will be concerne about the variance of the importance sampling weights, φ(µ D 1, D 2, Σ)/φ(µ D 1, Σ), where φ( ) is the normal ensity function. The theorem gives the variance of these importance sampling weights average over all possible atasets with a flat prior for µ. then Theorem 1. If, for j = 1,..., N, 1. x j N (µ, Σ) with known covariance Σ, 2. D 1 = {x 1,..., x n} an D 2 = {x n+1,..., x N }, an 3. µ N (µ 0, Λ 0 ) φ(µ D1, D 2, Σ) lim E D2 E D1 Var µ D1 Λ 1,Σ (18) 0 0 φ(µ D 1, Σ) N = 1 (19) n Proof: The most straightforwar proof of the theorem involves simply computing the big multivariate Gaussian integral in (18). Theorem 1 basically says that in the multivariate normal case with a flat prior the variance of the importance sampling weights is on average (19). These results may hol approximately in the non-normal case if the posterior istributions an the likelihoo are approximately normal. As we shoul expect, when n = N the variance of the weights is 0. As N increases relative to n the variance increases quickly. This is unfortunate in our case since we woul like to use this metho for large values of N an high imensional problems. Looking at this result from the effective sample size point of view we see that n ESS M. (20) N If we raw M times from the sampling istribution when the size of the secon partition D 2 is equal to the size of the first partition D 1, the effective sample size is ecrease by a factor of 2. Although things are looking grim for this metho, recent avances in particle filters siestep this problem by a simple rejuvenation step. Effective sample size Aitional observations Figure 4: The reuction in effective sample size with the aition of 1,000 observations. The top jagge curve assumes a known covariance while the bottom jagge also estimates the covariance. The smooth curve is the expecte ESS with a known covariance. 3. PARTICLE FILTERING FOR MASSIVE DATASETS The efficiency of the importance sampling scheme escribe in the previous section eteriorates when the importance weights accumulate on a small fraction of the initial raws. These θ i with the largest weights are those parameter values that have the greatest posterior mass given the ata absorbe so far. The remaining raws are simply wasting space. Sequential Monte Carlo methos [5] aim to aapt estimates of posterior istributions as new ata arrive. Particle filtering is the often use term to escribe methos that use importance sampling to filter out those particles, the θ i, that have the least posterior mass after incorporating the aitional ata. All of the methos struggle with maintaining a large effective Monte Carlo sample size while maintaining computational efficiency. The resample-move or rejuvenation step evelope in [10] greatly increases the sampling efficiency of particle filters in a clever fashion. We can iterate step 5 s outer loop shown in figure 2 until the ESS has eteriorate below some tolerance limit, perhaps 10% of M. Assume that this occurs after absorbing the next n 1 observations. At that point we have an importance sample from the posterior conitione on the first n + n 1 ata points. Then resample M times with replacement from the θ i where the probability that θ i is selecte is proportional to w i. Note that these raws still represent a sample, albeit a epenent sample, from the posterior conitione on the first n + n 1 ata points. Several of the θ i will be represente multiple times in this new sample. For the most part this refreshe sample will be evoi of those θ i not supporte by the ata. Remember that the basic iea behin MCMC was that given a raw from f(θ x 1,..., x n+n1 ) we can generate another observation from the same istribution by a single Metropolis-Hastings step. Although this new raw will still be epenent, it will have less epenence than leaving it

6 Figure 5: The resample-move step. 1) generate an initial sample from f(θ D 1 ). The ticks mark the particles, the sample θ i. 2) Weight base on f(θ D 1, D 2 ) an resample, the length of the vertical lines inicate the number of times resample. 3) For each θ i perform an MCMC step to iversify the sample. so that it has uplicates in the set of raws. Aitional MCMC steps will ecrease that epenence, increase the ESS, but also increase the number of times the algorithm accesses the first n + n 1 observations. Therefore, to rejuvenate the sample, for each of these new θ i s we can perform a single Metropolis-Hastings step centere aroun θ i where the acceptance probability is base on all n + n 1 ata points. Our rejuvenate θ i s now represent a more iverse set of parameter values with an effective sample size closer to M again. Figure 5 graphically walks through the resample-move process step-by-step. After rejuvenating the set of θ i, we can continue where we left off, on observation n + n 1 + 1, an continue absorbing aitional observations until either we inclue the entire ataset or the ESS again has roppe too low an we nee to repeat a rejuvenation step. As oppose to stanar MCMC, the particle filter implementation also amits an obvious path towar parallelization. The next section emonstrates the metho on a simulate ataset. 4. EXAMPLE: MIXTURES OF TRANSITION MODELS In this section we present a small example to emonstrate proof-of-concept. While it uses a ataset that can easily fit in main memory, it emonstrates the notion that the particle filter approach greatly reuces the number of ata accesses. At least for this example, aitional observations woul change the posterior slightly so that they can be absorbe by linearly scanning only the newest observations one or two times. Mixtures of transition moels have been use to moel users visiting web sites [2, 19, 20] an unsupervise training of robots [18]. In [2], the authors also evelop visualization tools (WebCANVAS) for unerstaning clusters of users an apply their methoology to the msnbc.com web site. Transition moels [21], or finite state Markov chains (although relate, in this context these are not to be confuse with Markov chain Monte Carlo), are useful for escribing iscrete time series where an observe series switches between a finite number of states. A particular sequence, for example (A,B,A,A,C,B) might be generate by a first-orer transition moel where the probability that the sequence moves to a particular state at time t + 1 epens only on the state at time t. Perhaps web users traverse a web site in such a manner. Given a set of sequences we can estimate the unerlying probability transition matrix, the matrix that escribes the probability of specific state to state transitions. In fact the posterior istribution is computable in close form with a single pass through the ataset by simply counting the number of times the sequences moves from state A to state A, state A to state B, an so on for all pairs of states. However, a particular set of sequences may not all share a common probability transition matrix. For example, visitors to a web site are heterogeneous an may iffer on their likely paths through the web site epening on their profession, their Internet experience, or the information that they seek. The mixture of transition moels assumes that the ataset consists of sequences, each generate by one of C transition matrices. However, neither the transition matrices nor the group assignments nor the number from each group are known. The goal, therefore, is to unerstan the shape of the posterior istribution of the elements of the two transition matrices an the mixing fraction given a sample of observe users paths. Inepenent samples from this posterior istribution are not easily obtaine irectly but the full conitionals, on the other han, are simple enough so that the Gibbs sampler is easy to implement (see [19] for complete etails). Let C be the number of clusters an S be the number of possible states. The unknown parameters of this moel are the C S S transition matrices, P 1,..., P C, the mixing vector α of length C containing the fraction of observations from each cluster, an the N cluster assignments, z j {1,..., C}. Placing a uniform prior on all parameters, the Gibbs sampler procees as follows. First ranomly initialize the cluster assignments, z j. Given the cluster assignments, the full conitional of the i th row of the transition matrix P c is Dirichlet(1 + n i1c, 1 + n i2c, 1 + n i3c,..., 1 + n isc ), (21) where n i1c, for example, is the number of times sequences for which z j = c transition from state i to state 1. The mixing vector is upate with a raw from Dirichlet(1 + X I(z j = 1),..., 1 + X I(z j = C)), (22) P where I(z j = c) counts the number of observations as-

7 signe to cluster c. Lastly we upate the cluster assignments conitional on the newly sample values for the transition matrices. The new cluster assignment for sequence j is rawn from a Multinomial(p 1, p 2,..., p C) where p c is the probability that transition matrix P c generate the sequence. With these new cluster assignments we return to (21) an so the Gibbs sampler iterates. As note in section 2.2, each iteration of the MCMC algorithm requires a full scan of the ataset, in this case two scans, one for the matrix upate an one for the cluster assignment upate. To test the improvement available using the particle filtering approach, we generate 10,000 sequences of length between 5 an 15 from two 4 4 transition matrices. We use the first n = 100 sequences to obtain the initial sample of M = 150 raws, step 1 of the algorithm shown in figure 2. We then sequentially accesse the aitional sequences, reweighting the M raws until the ESS roppe below 15. At that point, we resample an applie the rejuvenation step to the set of raws an continue again until the ESS roppe too low. Figure 6 shows the results for the number of times the particle filtering algorithm accesse each observation. The lower curve inicates the number of accesses. The first 100 observations show the greatest number of accesses (348 for this example) since they were also use to generate the initial sample. However, the aitional observations were accesse infrequently. For example, the algorithm accesse observation #2000 only 14 times an observation #10000 only twice. Number of times accesse Observation Figure 6: The frequency of access by observation. The horizontal line at 300 refers to the full MCMC run an the lower curve refers to the particle filter. The marks along the x-axis refer to occurrences of the rejuvenation step. For comparison, the line at 300 in figure 6 inicates the number of times the Gibbs sampler, conitione on the entire ataset, neee to access each observation. Each of the 150 iterations require one scan for the cluster assignments an a secon scan for the parameter upates (21) an (22). The slightly larger values for the first 100 observations are ue to their usage in etermining the starting values for the Gibbs sampler. This starting value selection process was the same for both the particle filter an the full Gibbs sampler. Figure 6 shows a 95% reuction in the total number of ata accesses when using the particle filter. The tick marks along the bottom mark the points at which a rejuvenation step took place. Note that they are very frequent at first an ecrease as the algorithm absorbs aitional observations. The marginal posterior stanar eviation approximately ecreases like O(1/ n) so that the target is shrinking at a slower rate as we a more ata. From the ESS approximation in (20) we can estimate the frequency of rejuvenation. As before, let n be the size of the initial sample. Now let N k be the total number of observations accommoate at the k th rejuvenation step. If we rejuvenate the θ i s when the ESS rops to p M then the N k are approximately relate accoring to n N 1 p, N1 N 2 p,..., Unraveling the recursion implies that N k n Nk 1 N k p. (23) 1 k, (24) p where is the imension of the parameter vector from theorem 1. When we let the ESS get very small before rejuvenation, equivalently setting p to be small, the N k can become large quickly. Naturally, there will be a balance between loss in computing efficiency an estimation efficiency. Fortunately N k grows exponentially in k, so that once k excees, the effective number of parameters we are trying to estimate, N k will grow quickly. Therefore, after approximately k = rejuvenations the algorithm has absorbe enough ata points so that it can withhol future rejuvenations until many more observations have been accommoate. While N k grows exponentially with k it grows only linearly with n, the number of observations in D 1. This implies that it may be better to spen more computational effort on the rejuvenation steps than the initial posterior sampling effort. For the mixture example, the effective number of parameters is no more than 25. Each transition matrix is 4 4 with the constraint that the rows sum to 1. So each of the two transition matrices has 12 free parameters. With the single mixing fraction parameter the total parameter count is 25. With aitional correlation amongst the parameters the effective number of parameters coul be less. In fact in our example we foun that equation (24) matches the observe frequency of rejuvenation to near perfection when = 17. While efficiency as measure with the number of ata accesses is important in the analysis of massive atasets, precision of parameter estimates is also important. Figure 7 shows the marginal posterior istributions for the 16 transition probabilities from the first cluster s transition matrix. The histogram is base on the M = 150 raws using the particle filtering metho. The overlai ensity is base on a rigorous MCMC run with 3000 raws. The histogram an ensity plots are nearly ientical except for small fluctuations. The posterior means from the two methos virtually overlap for each parameter. The figure also marks the location of the parameter value use to generate the ata. All of these values are within range of the posterior mean to the extent that we woul expect from sampling variability. While achieving a 95% reuction in the number of ata points ac-

8 Figure 7: The posterior istribution of the transition probabilities for one of the transition matrices. The histogram is base on the particle filter while the black curve is the estimate ensity base on a rigorous 3000 raw MCMC run. The two arker vertical lines are the posterior means base on the particle filter an the rigorous run an are nearly ientical if not overlapping. The ashe vertical line, which may be further away, is the true value use to simulate the ataset. cesse, the algorithm shows little if any loss in the estimation of the posterior istribution an the posterior mean. Note that increasing M oes not change the number ata accesses for the particle filter while each aitional raw represents yet another scan for the stanar implementation. If for some reason one was not confient in the particle filter results, one coul generate aitional MCMC iterations utilizing the entire ataset initiate from the particle filter raws. If the ensities change little then that woul be evience in favor of, but not necessarily proof of, the algorithm s estimation accuracy. The example escribe involves a fairly small ataset. In aitional experimentation we sample 1000 raws using the particle filter from the posterior istribution conitione on a ataset containing 1,406,000 observe processes. We observe similar performance metrics. The number of total observations accesse using the particle filter was 99.4% less than if we ha use the stanar MCMC implementation. At the same time, equation (24) maintains its preiction of the refresh rate for a moel with = 17 effective parameters. To conition on 1.4 million processes the particle filter ha to refresh 56 times, the last refresh after incorporating 136,000 processes with a single linear scan. We continue to observe no loss in estimation precision as all the true parameter values use to simulate the ata always lie in regions of high posterior probability. 5. DISCUSSION MCMC methos have been almost completely absent from ata mining research while they are wiely use in moern statistical analysis of complex moels. Inee when working with massive atasets the first orer of business may be obtaining simple point estimates for unknown parameters. Inevitably, analysts want to explore other aspects of the posterior istribution besies simply the posterior mean or moe. But to ate MCMC methos have simply been computationally infeasible for massive atasets. Likelihoo base ata squashing [16] is also a potential tool for making Bayesian analysis in massive atasets computationally feasible. It too uses the factorization of the likelihoo (11) to avoi too many scans of the ataset. Likelihoo base ata squashing locates a small number of ata points or pseuo-ata points with appropriate weights so that a weighte analysis of the pseuo-ataset woul pro-

9 uce the same results as the unweighte analysis of the massive ataset. It is possible that a posterior conitione on the pseuo-ataset may offer a goo importance sampling istribution so that some combination of ata-squashing, importance sampling, an particle filtering coul provie a coherent solution. While clearly the metho nees to unergo more empirical work to test the bounaries of its limitations, the erivation an preliminary simulation work shows promise. If we can generally reuce the number of ata accesses by 95% MCMC becomes viable for a large class of moels useful in ata mining. The sequential nature of algorithm also allows the analyst to stop when uncertainty in the parameters of interests has roppe below a require tolerance limit. Parallelization of the algorithm is rather straightforwar. Each processor manages a small set of the weighte raws from the posterior an is responsible for upating their weights an computing the refresh step. The last avantage that we iscuss here involves convergence of the MCMC sampler. As note in section 2.2, the key to MCMC begins with assuming that we have an initial raw from f(θ x). While in practice the analyst usually just starts the chain from some reasonably selecte starting point, the particle filter approach allows us to sample irectly from the prior to initialize the algorithm. Sampling from the prior istributions often use in practice is usually simple. Then the particle filter can run its course starting with the first observation. Even though subsequent steps introuce epenence, the algorithm will always generate new raws from the correct istribution without approximation. Bayesian analysis couple with Markov chain Monte Carlo methos continues to revitalize many areas of statistical analysis. Some variant of the algorithm we propose here may inee make this pair viable for massive atasets. 6. REFERENCES [1] J. Besag, P. Green, D. Higon, an K. Mengersen. Bayesian computation an stochastic systems (with iscussion). Statistical Science, 10:3 41, [2] I. Caez, D. Heckerman, C. Meek, P. Smyth, an S. White. Visualization of navigation patterns on a web site using moel-base clustering. Technical Report MSR-TR-00-18, Microsoft Research, March. [3] B. Carlin an T. Louis. Bayes an Empirical Bayes Methos for Data Analysis. Chapman an Hall, Boca Raton, FL, 2n eition, [4] M. DeGroot. Optimal Statistical Decisions. McGraw-Hill, New York, [5] A. Doucet, N. e Freitas, an N. Goron. Sequential Monte Carlo Methos in Practice. Springer-Verlag, [6] J. Eler an D. Pregibon. A statistical perspective on knowlege iscovery in atabases. In U. M. Fayya, G. Piatetsky-Shapiro, P. Smyth, an R. Uthurusamy, eitors, Avances in Knowlege Discovery an Data Mining, chapter 4. AAAI/MIT Press, [7] M. Figueireo. Aaptive sparseness using Jeffreys prior. In Neural Information Processing Systems - NIPS 2001, [8] A. Gelman, J. Carlin, H. Stern, an D. Rubin. Bayesian Data Analysis. Chapman Hall, New York, [9] S. Geman an D. Geman. Stochastic relaxation, Gibbs istributions an the Bayesian restoration of images. IEEE Transactions on Pattern Analysis an Machine Intelligence, 6: , [10] W. Gilks an C. Berzuini. Following a moving target - Monte Carlo inference for ynamic Bayesian moels. Journal of the Royal Statistical Society B, 63(1): , [11] W. Gilks, S. Richarson, an D. J. Spiegelhalter, eitors. Markov Chain Monte Carlo in Practice. Chapman an Hall, [12] C. Glymour, D. Maigan, D. Pregibon, an P. Smyth. Statistical themes an lessons for ata mining. Data Mining an Knowlege Discovery, 1(1):11 28, [13] W. K. Hastings. Monte Carlo sampling methos using Markov chains an their applications. Biometrika, 57:97 109, [14] A. Kong, J. Liu, an W. Wong. Sequential imputation an Bayesian missing ata problems. Journal of the American Statistical Association, 89: , [15] L. Le Cam an G. Yang. Asymptotics in Statistics: Some Basic Concepts. Springer-Verlag, New York, [16] D. Maigan, N. Raghavan, W. DuMouchel, M. Nason, C. Posse, an G. Rigeway. Instance construction via likelihoo-base ata squashing. In H. Liu an H. Motoa, eitors, Instance Selection an Construction - A ata mining perspective, chapter 12. Kluwer Acaemic Publishers, [17] N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, an E. Teller. Equations of state calculations by fast computing machine. Journal of Chemical Physics, 21: , [18] M. Ramoni, P. Sebastiani, an P. Cohen. Bayesian clustering by ynamics. Machine Learning, 47(1):91 121, [19] G. Rigeway. Finite iscrete Markov process clustering. Technical Report MSR-TR-97-24, Microsoft Research, September. [20] G. Rigeway an S. Altschuler. Clustering finite iscrete Markov chains. In Proceeings of the Section on Physical an Engineering Sciences, pages , [21] S. M. Ross. Probability Moels. Acaemic Press, 5th eition, [22] D. Spiegelhalter an R. Cowell. Learning in probabilistic expert systems. In J. Bernaro, J. Berger, A. Dawi, an A. Smith, eitors, Bayesian Statistics, volume 4, pages Clarenon Press, Oxfor, 1992.

A Review of Multiple Try MCMC algorithms for Signal Processing

A Review of Multiple Try MCMC algorithms for Signal Processing A Review of Multiple Try MCMC algorithms for Signal Processing Luca Martino Image Processing Lab., Universitat e València (Spain) Universia Carlos III e Mari, Leganes (Spain) Abstract Many applications

More information

Bayesian Analysis of Massive Datasets Via Particle Filters

Bayesian Analysis of Massive Datasets Via Particle Filters Bayesian Analysis of Massive Datasets Via Particle Filters Bayesian Analysis Use Bayes theorem to learn about model parameters from data Examples: Clustered data: hospitals, schools Spatial models: public

More information

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013 Survey Sampling Kosuke Imai Department of Politics, Princeton University February 19, 2013 Survey sampling is one of the most commonly use ata collection methos for social scientists. We begin by escribing

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

WEIGHTING A RESAMPLED PARTICLE IN SEQUENTIAL MONTE CARLO. L. Martino, V. Elvira, F. Louzada

WEIGHTING A RESAMPLED PARTICLE IN SEQUENTIAL MONTE CARLO. L. Martino, V. Elvira, F. Louzada WEIGHTIG A RESAMPLED PARTICLE I SEQUETIAL MOTE CARLO L. Martino, V. Elvira, F. Louzaa Dep. of Signal Theory an Communic., Universia Carlos III e Mari, Leganés (Spain). Institute of Mathematical Sciences

More information

Least-Squares Regression on Sparse Spaces

Least-Squares Regression on Sparse Spaces Least-Squares Regression on Sparse Spaces Yuri Grinberg, Mahi Milani Far, Joelle Pineau School of Computer Science McGill University Montreal, Canaa {ygrinb,mmilan1,jpineau}@cs.mcgill.ca 1 Introuction

More information

Introduction. A Dirichlet Form approach to MCMC Optimal Scaling. MCMC idea

Introduction. A Dirichlet Form approach to MCMC Optimal Scaling. MCMC idea Introuction A Dirichlet Form approach to MCMC Optimal Scaling Markov chain Monte Carlo (MCMC quotes: Metropolis et al. (1953, running coe on the Los Alamos MANIAC: a feasible approach to statistical mechanics

More information

A Course in Machine Learning

A Course in Machine Learning A Course in Machine Learning Hal Daumé III 12 EFFICIENT LEARNING So far, our focus has been on moels of learning an basic algorithms for those moels. We have not place much emphasis on how to learn quickly.

More information

Linear First-Order Equations

Linear First-Order Equations 5 Linear First-Orer Equations Linear first-orer ifferential equations make up another important class of ifferential equations that commonly arise in applications an are relatively easy to solve (in theory)

More information

THE VAN KAMPEN EXPANSION FOR LINKED DUFFING LINEAR OSCILLATORS EXCITED BY COLORED NOISE

THE VAN KAMPEN EXPANSION FOR LINKED DUFFING LINEAR OSCILLATORS EXCITED BY COLORED NOISE Journal of Soun an Vibration (1996) 191(3), 397 414 THE VAN KAMPEN EXPANSION FOR LINKED DUFFING LINEAR OSCILLATORS EXCITED BY COLORED NOISE E. M. WEINSTEIN Galaxy Scientific Corporation, 2500 English Creek

More information

Schrödinger s equation.

Schrödinger s equation. Physics 342 Lecture 5 Schröinger s Equation Lecture 5 Physics 342 Quantum Mechanics I Wenesay, February 3r, 2010 Toay we iscuss Schröinger s equation an show that it supports the basic interpretation of

More information

Parameter estimation: A new approach to weighting a priori information

Parameter estimation: A new approach to weighting a priori information Parameter estimation: A new approach to weighting a priori information J.L. Mea Department of Mathematics, Boise State University, Boise, ID 83725-555 E-mail: jmea@boisestate.eu Abstract. We propose a

More information

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences.

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences. S 63 Lecture 8 2/2/26 Lecturer Lillian Lee Scribes Peter Babinski, Davi Lin Basic Language Moeling Approach I. Special ase of LM-base Approach a. Recap of Formulas an Terms b. Fixing θ? c. About that Multinomial

More information

Topic 7: Convergence of Random Variables

Topic 7: Convergence of Random Variables Topic 7: Convergence of Ranom Variables Course 003, 2016 Page 0 The Inference Problem So far, our starting point has been a given probability space (S, F, P). We now look at how to generate information

More information

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012 CS-6 Theory Gems November 8, 0 Lecture Lecturer: Alesaner Mąry Scribes: Alhussein Fawzi, Dorina Thanou Introuction Toay, we will briefly iscuss an important technique in probability theory measure concentration

More information

ON MULTIPLE TRY SCHEMES AND THE PARTICLE METROPOLIS-HASTINGS ALGORITHM

ON MULTIPLE TRY SCHEMES AND THE PARTICLE METROPOLIS-HASTINGS ALGORITHM ON MULTIPLE TRY SCHEMES AN THE PARTICLE METROPOLIS-HASTINGS ALGORITHM L. Martino, F. Leisen, J. Coraner University of Helsinki, Helsinki (Finlan). University of Kent, Canterbury (UK). ABSTRACT Markov Chain

More information

Expected Value of Partial Perfect Information

Expected Value of Partial Perfect Information Expecte Value of Partial Perfect Information Mike Giles 1, Takashi Goa 2, Howar Thom 3 Wei Fang 1, Zhenru Wang 1 1 Mathematical Institute, University of Oxfor 2 School of Engineering, University of Tokyo

More information

d dx But have you ever seen a derivation of these results? We ll prove the first result below. cos h 1

d dx But have you ever seen a derivation of these results? We ll prove the first result below. cos h 1 Lecture 5 Some ifferentiation rules Trigonometric functions (Relevant section from Stewart, Seventh Eition: Section 3.3) You all know that sin = cos cos = sin. () But have you ever seen a erivation of

More information

Non-Linear Bayesian CBRN Source Term Estimation

Non-Linear Bayesian CBRN Source Term Estimation Non-Linear Bayesian CBRN Source Term Estimation Peter Robins Hazar Assessment, Simulation an Preiction Group Dstl Porton Down, UK. probins@stl.gov.uk Paul Thomas Hazar Assessment, Simulation an Preiction

More information

'HVLJQ &RQVLGHUDWLRQ LQ 0DWHULDO 6HOHFWLRQ 'HVLJQ 6HQVLWLYLW\,1752'8&7,21

'HVLJQ &RQVLGHUDWLRQ LQ 0DWHULDO 6HOHFWLRQ 'HVLJQ 6HQVLWLYLW\,1752'8&7,21 Large amping in a structural material may be either esirable or unesirable, epening on the engineering application at han. For example, amping is a esirable property to the esigner concerne with limiting

More information

Gaussian processes with monotonicity information

Gaussian processes with monotonicity information Gaussian processes with monotonicity information Anonymous Author Anonymous Author Unknown Institution Unknown Institution Abstract A metho for using monotonicity information in multivariate Gaussian process

More information

Estimating Causal Direction and Confounding Of Two Discrete Variables

Estimating Causal Direction and Confounding Of Two Discrete Variables Estimating Causal Direction an Confouning Of Two Discrete Variables This inspire further work on the so calle aitive noise moels. Hoyer et al. (2009) extene Shimizu s ientifiaarxiv:1611.01504v1 [stat.ml]

More information

Monte Carlo Methods with Reduced Error

Monte Carlo Methods with Reduced Error Monte Carlo Methos with Reuce Error As has been shown, the probable error in Monte Carlo algorithms when no information about the smoothness of the function is use is Dξ r N = c N. It is important for

More information

Cascaded redundancy reduction

Cascaded redundancy reduction Network: Comput. Neural Syst. 9 (1998) 73 84. Printe in the UK PII: S0954-898X(98)88342-5 Cascae reunancy reuction Virginia R e Sa an Geoffrey E Hinton Department of Computer Science, University of Toronto,

More information

Table of Common Derivatives By David Abraham

Table of Common Derivatives By David Abraham Prouct an Quotient Rules: Table of Common Derivatives By Davi Abraham [ f ( g( ] = [ f ( ] g( + f ( [ g( ] f ( = g( [ f ( ] g( g( f ( [ g( ] Trigonometric Functions: sin( = cos( cos( = sin( tan( = sec

More information

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks A PAC-Bayesian Approach to Spectrally-Normalize Margin Bouns for Neural Networks Behnam Neyshabur, Srinah Bhojanapalli, Davi McAllester, Nathan Srebro Toyota Technological Institute at Chicago {bneyshabur,

More information

Experiment 2, Physics 2BL

Experiment 2, Physics 2BL Experiment 2, Physics 2BL Deuction of Mass Distributions. Last Upate: 2009-05-03 Preparation Before this experiment, we recommen you review or familiarize yourself with the following: Chapters 4-6 in Taylor

More information

arxiv: v1 [hep-lat] 19 Nov 2013

arxiv: v1 [hep-lat] 19 Nov 2013 HU-EP-13/69 SFB/CPP-13-98 DESY 13-225 Applicability of Quasi-Monte Carlo for lattice systems arxiv:1311.4726v1 [hep-lat] 19 ov 2013, a,b Tobias Hartung, c Karl Jansen, b Hernan Leovey, Anreas Griewank

More information

Math 342 Partial Differential Equations «Viktor Grigoryan

Math 342 Partial Differential Equations «Viktor Grigoryan Math 342 Partial Differential Equations «Viktor Grigoryan 6 Wave equation: solution In this lecture we will solve the wave equation on the entire real line x R. This correspons to a string of infinite

More information

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation JOURNAL OF MATERIALS SCIENCE 34 (999)5497 5503 Thermal conuctivity of grae composites: Numerical simulations an an effective meium approximation P. M. HUI Department of Physics, The Chinese University

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

The Exact Form and General Integrating Factors

The Exact Form and General Integrating Factors 7 The Exact Form an General Integrating Factors In the previous chapters, we ve seen how separable an linear ifferential equations can be solve using methos for converting them to forms that can be easily

More information

ELECTRON DIFFRACTION

ELECTRON DIFFRACTION ELECTRON DIFFRACTION Electrons : wave or quanta? Measurement of wavelength an momentum of electrons. Introuction Electrons isplay both wave an particle properties. What is the relationship between the

More information

inflow outflow Part I. Regular tasks for MAE598/494 Task 1

inflow outflow Part I. Regular tasks for MAE598/494 Task 1 MAE 494/598, Fall 2016 Project #1 (Regular tasks = 20 points) Har copy of report is ue at the start of class on the ue ate. The rules on collaboration will be release separately. Please always follow the

More information

MONTE CARLO METHODS. Hedibert Freitas Lopes

MONTE CARLO METHODS. Hedibert Freitas Lopes MONTE CARLO METHODS Hedibert Freitas Lopes The University of Chicago Booth School of Business 5807 South Woodlawn Avenue, Chicago, IL 60637 http://faculty.chicagobooth.edu/hedibert.lopes hlopes@chicagobooth.edu

More information

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics This moule is part of the Memobust Hanbook on Methoology of Moern Business Statistics 26 March 2014 Metho: Balance Sampling for Multi-Way Stratification Contents General section... 3 1. Summary... 3 2.

More information

Introduction to Machine Learning

Introduction to Machine Learning How o you estimate p(y x)? Outline Contents Introuction to Machine Learning Logistic Regression Varun Chanola April 9, 207 Generative vs. Discriminative Classifiers 2 Logistic Regression 2 3 Logistic Regression

More information

Math 1B, lecture 8: Integration by parts

Math 1B, lecture 8: Integration by parts Math B, lecture 8: Integration by parts Nathan Pflueger 23 September 2 Introuction Integration by parts, similarly to integration by substitution, reverses a well-known technique of ifferentiation an explores

More information

STATISTICAL LIKELIHOOD REPRESENTATIONS OF PRIOR KNOWLEDGE IN MACHINE LEARNING

STATISTICAL LIKELIHOOD REPRESENTATIONS OF PRIOR KNOWLEDGE IN MACHINE LEARNING STATISTICAL LIKELIHOOD REPRESENTATIONS OF PRIOR KNOWLEDGE IN MACHINE LEARNING Mark A. Kon Department of Mathematics an Statistics Boston University Boston, MA 02215 email: mkon@bu.eu Anrzej Przybyszewski

More information

Necessary and Sufficient Conditions for Sketched Subspace Clustering

Necessary and Sufficient Conditions for Sketched Subspace Clustering Necessary an Sufficient Conitions for Sketche Subspace Clustering Daniel Pimentel-Alarcón, Laura Balzano 2, Robert Nowak University of Wisconsin-Maison, 2 University of Michigan-Ann Arbor Abstract This

More information

ELEC3114 Control Systems 1

ELEC3114 Control Systems 1 ELEC34 Control Systems Linear Systems - Moelling - Some Issues Session 2, 2007 Introuction Linear systems may be represente in a number of ifferent ways. Figure shows the relationship between various representations.

More information

05 The Continuum Limit and the Wave Equation

05 The Continuum Limit and the Wave Equation Utah State University DigitalCommons@USU Founations of Wave Phenomena Physics, Department of 1-1-2004 05 The Continuum Limit an the Wave Equation Charles G. Torre Department of Physics, Utah State University,

More information

MCMC algorithms for fitting Bayesian models

MCMC algorithms for fitting Bayesian models MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models

More information

Separation of Variables

Separation of Variables Physics 342 Lecture 1 Separation of Variables Lecture 1 Physics 342 Quantum Mechanics I Monay, January 25th, 2010 There are three basic mathematical tools we nee, an then we can begin working on the physical

More information

Topic Modeling: Beyond Bag-of-Words

Topic Modeling: Beyond Bag-of-Words Hanna M. Wallach Cavenish Laboratory, University of Cambrige, Cambrige CB3 0HE, UK hmw26@cam.ac.u Abstract Some moels of textual corpora employ text generation methos involving n-gram statistics, while

More information

Similarity Measures for Categorical Data A Comparative Study. Technical Report

Similarity Measures for Categorical Data A Comparative Study. Technical Report Similarity Measures for Categorical Data A Comparative Stuy Technical Report Department of Computer Science an Engineering University of Minnesota 4-92 EECS Builing 200 Union Street SE Minneapolis, MN

More information

On the Surprising Behavior of Distance Metrics in High Dimensional Space

On the Surprising Behavior of Distance Metrics in High Dimensional Space On the Surprising Behavior of Distance Metrics in High Dimensional Space Charu C. Aggarwal, Alexaner Hinneburg 2, an Daniel A. Keim 2 IBM T. J. Watson Research Center Yortown Heights, NY 0598, USA. charu@watson.ibm.com

More information

SYNCHRONOUS SEQUENTIAL CIRCUITS

SYNCHRONOUS SEQUENTIAL CIRCUITS CHAPTER SYNCHRONOUS SEUENTIAL CIRCUITS Registers an counters, two very common synchronous sequential circuits, are introuce in this chapter. Register is a igital circuit for storing information. Contents

More information

TIME-DELAY ESTIMATION USING FARROW-BASED FRACTIONAL-DELAY FIR FILTERS: FILTER APPROXIMATION VS. ESTIMATION ERRORS

TIME-DELAY ESTIMATION USING FARROW-BASED FRACTIONAL-DELAY FIR FILTERS: FILTER APPROXIMATION VS. ESTIMATION ERRORS TIME-DEAY ESTIMATION USING FARROW-BASED FRACTIONA-DEAY FIR FITERS: FITER APPROXIMATION VS. ESTIMATION ERRORS Mattias Olsson, Håkan Johansson, an Per öwenborg Div. of Electronic Systems, Dept. of Electrical

More information

Optimal Signal Detection for False Track Discrimination

Optimal Signal Detection for False Track Discrimination Optimal Signal Detection for False Track Discrimination Thomas Hanselmann Darko Mušicki Dept. of Electrical an Electronic Eng. Dept. of Electrical an Electronic Eng. The University of Melbourne The University

More information

Optimization of Geometries by Energy Minimization

Optimization of Geometries by Energy Minimization Optimization of Geometries by Energy Minimization by Tracy P. Hamilton Department of Chemistry University of Alabama at Birmingham Birmingham, AL 3594-140 hamilton@uab.eu Copyright Tracy P. Hamilton, 1997.

More information

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x)

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x) Y. D. Chong (2016) MH2801: Complex Methos for the Sciences 1. Derivatives The erivative of a function f(x) is another function, efine in terms of a limiting expression: f (x) f (x) lim x δx 0 f(x + δx)

More information

Convergence of Random Walks

Convergence of Random Walks Chapter 16 Convergence of Ranom Walks This lecture examines the convergence of ranom walks to the Wiener process. This is very important both physically an statistically, an illustrates the utility of

More information

7.1 Support Vector Machine

7.1 Support Vector Machine 67577 Intro. to Machine Learning Fall semester, 006/7 Lecture 7: Support Vector Machines an Kernel Functions II Lecturer: Amnon Shashua Scribe: Amnon Shashua 7. Support Vector Machine We return now to

More information

Influence of weight initialization on multilayer perceptron performance

Influence of weight initialization on multilayer perceptron performance Influence of weight initialization on multilayer perceptron performance M. Karouia (1,2) T. Denœux (1) R. Lengellé (1) (1) Université e Compiègne U.R.A. CNRS 817 Heuiasyc BP 649 - F-66 Compiègne ceex -

More information

Collapsed Variational Inference for HDP

Collapsed Variational Inference for HDP Collapse Variational Inference for HDP Yee W. Teh Davi Newman an Max Welling Publishe on NIPS 2007 Discussion le by Iulian Pruteanu Outline Introuction Hierarchical Bayesian moel for LDA Collapse VB inference

More information

Lower Bounds for the Smoothed Number of Pareto optimal Solutions

Lower Bounds for the Smoothed Number of Pareto optimal Solutions Lower Bouns for the Smoothe Number of Pareto optimal Solutions Tobias Brunsch an Heiko Röglin Department of Computer Science, University of Bonn, Germany brunsch@cs.uni-bonn.e, heiko@roeglin.org Abstract.

More information

A Modification of the Jarque-Bera Test. for Normality

A Modification of the Jarque-Bera Test. for Normality Int. J. Contemp. Math. Sciences, Vol. 8, 01, no. 17, 84-85 HIKARI Lt, www.m-hikari.com http://x.oi.org/10.1988/ijcms.01.9106 A Moification of the Jarque-Bera Test for Normality Moawa El-Fallah Ab El-Salam

More information

Level Construction of Decision Trees in a Partition-based Framework for Classification

Level Construction of Decision Trees in a Partition-based Framework for Classification Level Construction of Decision Trees in a Partition-base Framework for Classification Y.Y. Yao, Y. Zhao an J.T. Yao Department of Computer Science, University of Regina Regina, Saskatchewan, Canaa S4S

More information

Collapsed Gibbs and Variational Methods for LDA. Example Collapsed MoG Sampling

Collapsed Gibbs and Variational Methods for LDA. Example Collapsed MoG Sampling Case Stuy : Document Retrieval Collapse Gibbs an Variational Methos for LDA Machine Learning/Statistics for Big Data CSE599C/STAT59, University of Washington Emily Fox 0 Emily Fox February 7 th, 0 Example

More information

Angles-Only Orbit Determination Copyright 2006 Michel Santos Page 1

Angles-Only Orbit Determination Copyright 2006 Michel Santos Page 1 Angles-Only Orbit Determination Copyright 6 Michel Santos Page 1 Abstract This ocument presents a re-erivation of the Gauss an Laplace Angles-Only Methos for Initial Orbit Determination. It keeps close

More information

TMA 4195 Matematisk modellering Exam Tuesday December 16, :00 13:00 Problems and solution with additional comments

TMA 4195 Matematisk modellering Exam Tuesday December 16, :00 13:00 Problems and solution with additional comments Problem F U L W D g m 3 2 s 2 0 0 0 0 2 kg 0 0 0 0 0 0 Table : Dimension matrix TMA 495 Matematisk moellering Exam Tuesay December 6, 2008 09:00 3:00 Problems an solution with aitional comments The necessary

More information

Survey-weighted Unit-Level Small Area Estimation

Survey-weighted Unit-Level Small Area Estimation Survey-weighte Unit-Level Small Area Estimation Jan Pablo Burgar an Patricia Dörr Abstract For evience-base regional policy making, geographically ifferentiate estimates of socio-economic inicators are

More information

Multi-View Clustering via Canonical Correlation Analysis

Multi-View Clustering via Canonical Correlation Analysis Keywors: multi-view learning, clustering, canonical correlation analysis Abstract Clustering ata in high-imensions is believe to be a har problem in general. A number of efficient clustering algorithms

More information

Role of parameters in the stochastic dynamics of a stick-slip oscillator

Role of parameters in the stochastic dynamics of a stick-slip oscillator Proceeing Series of the Brazilian Society of Applie an Computational Mathematics, v. 6, n. 1, 218. Trabalho apresentao no XXXVII CNMAC, S.J. os Campos - SP, 217. Proceeing Series of the Brazilian Society

More information

Chapter 6: Energy-Momentum Tensors

Chapter 6: Energy-Momentum Tensors 49 Chapter 6: Energy-Momentum Tensors This chapter outlines the general theory of energy an momentum conservation in terms of energy-momentum tensors, then applies these ieas to the case of Bohm's moel.

More information

Generalizing Kronecker Graphs in order to Model Searchable Networks

Generalizing Kronecker Graphs in order to Model Searchable Networks Generalizing Kronecker Graphs in orer to Moel Searchable Networks Elizabeth Boine, Babak Hassibi, Aam Wierman California Institute of Technology Pasaena, CA 925 Email: {eaboine, hassibi, aamw}@caltecheu

More information

Diophantine Approximations: Examining the Farey Process and its Method on Producing Best Approximations

Diophantine Approximations: Examining the Farey Process and its Method on Producing Best Approximations Diophantine Approximations: Examining the Farey Process an its Metho on Proucing Best Approximations Kelly Bowen Introuction When a person hears the phrase irrational number, one oes not think of anything

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Tutorial on Maximum Likelyhood Estimation: Parametric Density Estimation

Tutorial on Maximum Likelyhood Estimation: Parametric Density Estimation Tutorial on Maximum Likelyhoo Estimation: Parametric Density Estimation Suhir B Kylasa 03/13/2014 1 Motivation Suppose one wishes to etermine just how biase an unfair coin is. Call the probability of tossing

More information

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback Journal of Machine Learning Research 8 07) - Submitte /6; Publishe 5/7 An Optimal Algorithm for Banit an Zero-Orer Convex Optimization with wo-point Feeback Oha Shamir Department of Computer Science an

More information

Group Importance Sampling for particle filtering and MCMC

Group Importance Sampling for particle filtering and MCMC Group Importance Sampling for particle filtering an MCMC Luca Martino, Víctor Elvira, Gustau Camps-Valls Image Processing Laboratory, Universitat e València (Spain). IMT Lille Douai CRISTAL (UMR 989),

More information

YVES F. ATCHADÉ, GARETH O. ROBERTS, AND JEFFREY S. ROSENTHAL

YVES F. ATCHADÉ, GARETH O. ROBERTS, AND JEFFREY S. ROSENTHAL TOWARDS OPTIMAL SCALING OF METROPOLIS-COUPLED MARKOV CHAIN MONTE CARLO YVES F. ATCHADÉ, GARETH O. ROBERTS, AND JEFFREY S. ROSENTHAL Abstract. We consier optimal temperature spacings for Metropolis-couple

More information

Leaving Randomness to Nature: d-dimensional Product Codes through the lens of Generalized-LDPC codes

Leaving Randomness to Nature: d-dimensional Product Codes through the lens of Generalized-LDPC codes Leaving Ranomness to Nature: -Dimensional Prouct Coes through the lens of Generalize-LDPC coes Tavor Baharav, Kannan Ramchanran Dept. of Electrical Engineering an Computer Sciences, U.C. Berkeley {tavorb,

More information

Problems Governed by PDE. Shlomo Ta'asan. Carnegie Mellon University. and. Abstract

Problems Governed by PDE. Shlomo Ta'asan. Carnegie Mellon University. and. Abstract Pseuo-Time Methos for Constraine Optimization Problems Governe by PDE Shlomo Ta'asan Carnegie Mellon University an Institute for Computer Applications in Science an Engineering Abstract In this paper we

More information

KNN Particle Filters for Dynamic Hybrid Bayesian Networks

KNN Particle Filters for Dynamic Hybrid Bayesian Networks KNN Particle Filters for Dynamic Hybri Bayesian Networs H. D. Chen an K. C. Chang Dept. of Systems Engineering an Operations Research George Mason University MS 4A6, 4400 University Dr. Fairfax, VA 22030

More information

ensembles When working with density operators, we can use this connection to define a generalized Bloch vector: v x Tr x, v y Tr y

ensembles When working with density operators, we can use this connection to define a generalized Bloch vector: v x Tr x, v y Tr y Ph195a lecture notes, 1/3/01 Density operators for spin- 1 ensembles So far in our iscussion of spin- 1 systems, we have restricte our attention to the case of pure states an Hamiltonian evolution. Toay

More information

UNIFYING PCA AND MULTISCALE APPROACHES TO FAULT DETECTION AND ISOLATION

UNIFYING PCA AND MULTISCALE APPROACHES TO FAULT DETECTION AND ISOLATION UNIFYING AND MULISCALE APPROACHES O FAUL DEECION AND ISOLAION Seongkyu Yoon an John F. MacGregor Dept. Chemical Engineering, McMaster University, Hamilton Ontario Canaa L8S 4L7 yoons@mcmaster.ca macgreg@mcmaster.ca

More information

Planar sheath and presheath

Planar sheath and presheath 5/11/1 Flui-Poisson System Planar sheath an presheath 1 Planar sheath an presheath A plasma between plane parallel walls evelops a positive potential which equalizes the rate of loss of electrons an ions.

More information

Robust Low Rank Kernel Embeddings of Multivariate Distributions

Robust Low Rank Kernel Embeddings of Multivariate Distributions Robust Low Rank Kernel Embeings of Multivariate Distributions Le Song, Bo Dai College of Computing, Georgia Institute of Technology lsong@cc.gatech.eu, boai@gatech.eu Abstract Kernel embeing of istributions

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Hybrid Fusion for Biometrics: Combining Score-level and Decision-level Fusion

Hybrid Fusion for Biometrics: Combining Score-level and Decision-level Fusion Hybri Fusion for Biometrics: Combining Score-level an Decision-level Fusion Qian Tao Raymon Velhuis Signals an Systems Group, University of Twente Postbus 217, 7500AE Enschee, the Netherlans {q.tao,r.n.j.velhuis}@ewi.utwente.nl

More information

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs Lectures - Week 10 Introuction to Orinary Differential Equations (ODES) First Orer Linear ODEs When stuying ODEs we are consiering functions of one inepenent variable, e.g., f(x), where x is the inepenent

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

The Role of Models in Model-Assisted and Model- Dependent Estimation for Domains and Small Areas

The Role of Models in Model-Assisted and Model- Dependent Estimation for Domains and Small Areas The Role of Moels in Moel-Assiste an Moel- Depenent Estimation for Domains an Small Areas Risto Lehtonen University of Helsini Mio Myrsylä University of Pennsylvania Carl-Eri Särnal University of Montreal

More information

The Press-Schechter mass function

The Press-Schechter mass function The Press-Schechter mass function To state the obvious: It is important to relate our theories to what we can observe. We have looke at linear perturbation theory, an we have consiere a simple moel for

More information

Modeling the effects of polydispersity on the viscosity of noncolloidal hard sphere suspensions. Paul M. Mwasame, Norman J. Wagner, Antony N.

Modeling the effects of polydispersity on the viscosity of noncolloidal hard sphere suspensions. Paul M. Mwasame, Norman J. Wagner, Antony N. Submitte to the Journal of Rheology Moeling the effects of polyispersity on the viscosity of noncolloial har sphere suspensions Paul M. Mwasame, Norman J. Wagner, Antony N. Beris a) epartment of Chemical

More information

Transmission Line Matrix (TLM) network analogues of reversible trapping processes Part B: scaling and consistency

Transmission Line Matrix (TLM) network analogues of reversible trapping processes Part B: scaling and consistency Transmission Line Matrix (TLM network analogues of reversible trapping processes Part B: scaling an consistency Donar e Cogan * ANC Eucation, 308-310.A. De Mel Mawatha, Colombo 3, Sri Lanka * onarecogan@gmail.com

More information

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k A Proof of Lemma 2 B Proof of Lemma 3 Proof: Since the support of LL istributions is R, two such istributions are equivalent absolutely continuous with respect to each other an the ivergence is well-efine

More information

The Principle of Least Action

The Principle of Least Action Chapter 7. The Principle of Least Action 7.1 Force Methos vs. Energy Methos We have so far stuie two istinct ways of analyzing physics problems: force methos, basically consisting of the application of

More information

WEIGHTING A RESAMPLED PARTICLES IN SEQUENTIAL MONTE CARLO (EXTENDED PREPRINT) L. Martino, V. Elvira, F. Louzada

WEIGHTING A RESAMPLED PARTICLES IN SEQUENTIAL MONTE CARLO (EXTENDED PREPRINT) L. Martino, V. Elvira, F. Louzada WEIGHTIG A RESAMLED ARTICLES I SEQUETIAL MOTE CARLO (ETEDED RERIT) L. Martino, V. Elvira, F. Louzaa Dep. of Signal Theory an Communic., Universia Carlos III e Mari, Leganés (Spain). Institute of Mathematical

More information

Revisiting Uncertainty in Graph Cut Solutions

Revisiting Uncertainty in Graph Cut Solutions Revisiting Uncertainty in Graph Cut Solutions Daniel Tarlow Dept. of Computer Science University of Toronto tarlow@cs.toronto.eu Ryan P. Aams School of Engineering an Applie Sciences Harvar University

More information

LDA Collapsed Gibbs Sampler, VariaNonal Inference. Task 3: Mixed Membership Models. Case Study 5: Mixed Membership Modeling

LDA Collapsed Gibbs Sampler, VariaNonal Inference. Task 3: Mixed Membership Models. Case Study 5: Mixed Membership Modeling Case Stuy 5: Mixe Membership Moeling LDA Collapse Gibbs Sampler, VariaNonal Inference Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox May 8 th, 05 Emily Fox 05 Task : Mixe

More information

Nonlinear Adaptive Ship Course Tracking Control Based on Backstepping and Nussbaum Gain

Nonlinear Adaptive Ship Course Tracking Control Based on Backstepping and Nussbaum Gain Nonlinear Aaptive Ship Course Tracking Control Base on Backstepping an Nussbaum Gain Jialu Du, Chen Guo Abstract A nonlinear aaptive controller combining aaptive Backstepping algorithm with Nussbaum gain

More information

Sparse Reconstruction of Systems of Ordinary Differential Equations

Sparse Reconstruction of Systems of Ordinary Differential Equations Sparse Reconstruction of Systems of Orinary Differential Equations Manuel Mai a, Mark D. Shattuck b,c, Corey S. O Hern c,a,,e, a Department of Physics, Yale University, New Haven, Connecticut 06520, USA

More information

Lecture 2: Correlated Topic Model

Lecture 2: Correlated Topic Model Probabilistic Moels for Unsupervise Learning Spring 203 Lecture 2: Correlate Topic Moel Inference for Correlate Topic Moel Yuan Yuan First of all, let us make some claims about the parameters an variables

More information

Construction of the Electronic Radial Wave Functions and Probability Distributions of Hydrogen-like Systems

Construction of the Electronic Radial Wave Functions and Probability Distributions of Hydrogen-like Systems Construction of the Electronic Raial Wave Functions an Probability Distributions of Hyrogen-like Systems Thomas S. Kuntzleman, Department of Chemistry Spring Arbor University, Spring Arbor MI 498 tkuntzle@arbor.eu

More information

A Minimum Variance Method for Lidar Signal Inversion

A Minimum Variance Method for Lidar Signal Inversion 468 J O U R N A L O F A T M O S P H E R I C A N D O C E A N I C T E C H N O L O G Y VOLUME 31 A Minimum Variance Metho for Liar Signal Inversion ANDREJA SU SNIK Centre for Atmospheric Research, University

More information

Multi-View Clustering via Canonical Correlation Analysis

Multi-View Clustering via Canonical Correlation Analysis Technical Report TTI-TR-2008-5 Multi-View Clustering via Canonical Correlation Analysis Kamalika Chauhuri UC San Diego Sham M. Kakae Toyota Technological Institute at Chicago ABSTRACT Clustering ata in

More information

eqr094: Hierarchical MCMC for Bayesian System Reliability

eqr094: Hierarchical MCMC for Bayesian System Reliability eqr094: Hierarchical MCMC for Bayesian System Reliability Alyson G. Wilson Statistical Sciences Group, Los Alamos National Laboratory P.O. Box 1663, MS F600 Los Alamos, NM 87545 USA Phone: 505-667-9167

More information