Bayesian analysis of massive datasets via particle filters

Size: px

Start display at page:

Download "Bayesian analysis of massive datasets via particle filters"

Janis Montgomery
6 years ago
Views:

1 Bayesian analysis of massive atasets via particle filters Greg Rigeway RAND PO Box 2138 Santa Monica, CA Davi Maigan Department of Statistics 477 Hill Center Rutgers University Piscataway, NJ ABSTRACT Markov Chain Monte Carlo (MCMC) techniques revolutionize statistical practice in the 1990s by proviing an essential toolkit for making the rigor an flexibility of Bayesian analysis computationally practical. At the same time the increasing prevalence of massive atasets an the expansion of the fiel of ata mining has create the nee to prouce statistically soun methos that scale to these large problems. Except for the most trivial examples, current MCMC methos require a complete scan of the ataset for each iteration eliminating their caniacy as feasible ata mining techniques. In this article we present a metho for making Bayesian analysis of massive atasets computationally feasible. The algorithm simulates from a posterior istribution that conitions on a smaller, more manageable portion of the ataset. The remainer of the ataset may be incorporate by reweighting the initial raws using importance sampling. Computation of the importance weights requires a single scan of the remaining observations. While importance sampling increases efficiency in ata access, it comes at the expense of estimation efficiency. A simple moification, base on the rejuvenation step use in particle filters for ynamic systems moels, siesteps the loss of efficiency with only a slight increase in the number of ata accesses. To show proof-of-concept, we emonstrate the metho on a mixture of transition moels that has been use to moel web traffic an robotics. For this example we show that estimation efficiency is not affecte while offering a 95% reuction in ata accesses. 1. INTRODUCTION The nee for rigorous statistical analysis has not gone unnotice in the ata mining community. Statistical concepts such as latent variables, spurious correlation, an problems involving moel search an selection have appeare in wiely note ata mining literature [6, 12]. However, algorithms, moel fitting methos that actually work on massive ata- Permission to make igital or har copies of all or part of this work for personal or classroom use is grante without fee provie that copies are not mae or istribute for profit or commercial avantage an that copies bear this notice an the full citation on the first page. To copy otherwise, to republish, to post on servers or to reistribute to lists, requires prior specific permission an/or a fee. SIGKDD 02 Emonton, Alberta, Canaa Copyright 2002 ACM X/02/ $5.00. sets, have been slow to appear. Bayesian analysis is a wiely accepte paraigm for estimating unknown parameters from ata. In applications with small to meium size atasets, Bayesian methos have foun great success in statistical practice. In particular, applie statistical work has seen a surge in the use of Bayesian hierarchical moels for moeling multilevel or relational ata [3] in a variety of fiels incluing health an eucation. Spatial moels in agriculture, image analysis, an remote sensing often utilize Bayesian methos an invariably require heavy computation (see [1] for an overview). As shown in [7], kernel methos also have a convenient Bayesian formulation, proucing posterior istributions over a general class of preiction moels. The power of Bayesian analysis comes from the transparent inclusion of prior knowlege, a more natural probabilistic interpretation of parameter estimates, an greater flexibility in moel specification. While Bayesian moels an the computational tools behin them has revolutionize the fiel, they continue to rely on algorithms that perform thousans even millions of laps through the ataset in orer to prouce estimates of the posterior istribution of the moel parameters. For massive atasets Bayesian methos still begin by a loa ata into memory step, make compromising assumptions, or resort to subsampling to skirt the issue. If the most severe penalty comes when requesting ata, algorithms might exist that only use a small manageable portion of the ataset at any one time. This paper proposes such an algorithm. It performs a rigorous Bayesian computation on a small, manageable portion of the ataset an aapts those calculations with the remaining observations. The aaptation attempts to minimize the number of times the algorithm loas each observation into memory. 2. TECHNIQUES FOR BAYESIAN COMPUTATION Except for the simplest of moels an regarless of the style of inference, estimation algorithms almost always require repeate scans of the ataset. We know that for well-behave likelihoos an priors, the posterior istribution converges to a multivariate normal [4, 15]. For large but finite samples this approximation works rather well on marginal istributions an lower imensional conitional istributions but oes not always provie an accurate approximation to the full joint istribution [8]. The normal approximation also assumes that one has the maximum likelihoo estimate for the parameter an the observe or ex-

2 pecte information matrix. Even normal posterior approximations an maximum likelihoo calculations can require heavy computation. Newton-Raphson type algorithms for maximum likelihoo estimation require several scans of the ataset, at least one for each iteration. When some observations also have missing ata, the algorithms (EM, for example) likely will eman even more scans. For some moels, ataset sizes, an applications these approximation methos may work an be preferable to a full Bayesian analysis. This will not always be the case an so the nee exists for improve techniques to learn accurately from massive atasets. Summaries of results from Bayesian ata analyses often are in the form of expectations such as the marginal mean, variance, an covariance of the parameters of interest. We compute the expecte value of the quantity of interest, h(θ), using E(h(θ) x 1,..., x N ) = h(θ)f(θ x 1,..., x N )θ (1) where f(θ x), is the posterior istribution of the parameters given the observe ata. Computation of these expectations requires calculating integrals that, for all but the simplest examples, are ifficult to compute in close form. Monte Carlo integration methos sample from the posterior, f(θ x), an appeal to the law of large numbers to estimate the integrals, lim M 1 M MX i=1 h(θ i) = h(θ)f(θ x 1,..., x N )θ (2) where the θ i compose a sample from f(θ x). The ability to compute these expectations efficiently is equivalent to being able to sample efficiently from f(θ x). Sampling schemes are often ifficult enough without the buren of large atasets. The aitional complexity of massive atasets usually causes each iteration of the Monte Carlo sampler to be slower. When the number of iterations alreay nees to be large, efficient proceures within each iteration are essential to timely elivery of results. 2.1 Importance sampling Importance sampling is a general Monte Carlo metho for computing integrals. As previously mentione, Monte Carlo methos approximate integrals of the form (1). The approximation in (2) epens on the ability to sample from f(θ x). When a sampling mechanism is not reaily available for the target istribution, f(θ x), but one is available for another sampling istribution, g(θ), we can use importance sampling. Note that for (1) we can write h(θ)f(θ x 1,..., x N )θ = = lim h(θ) f(θ x) g(θ)θ (3) g(θ) MX M i=1 w i h(θ i ) (4) where θ i is a raw from g(θ) an w i = f(θ i x)/g(θ i ). Note that the expecte value of w i uner g(θ) is 1. Therefore, if we are able to compute the importance sampling weights, w i, only up to a constant of proportionality, we can normalize the weights to compute the integral. P M i=1 h(θ)f(θ x 1,..., x N )θ = lim w ih(θ i ) P M M (5) i=1 wi Naturally, in orer for the sampling istribution to be useful, rawing from g(θ) must be easy. We also want our sampling istribution to be such that the limit converges quickly to the value of the integral. If the tails of g(θ) ecay faster than f(θ x) the weights will be numerically unstable. If the tails of g(θ) ecay much more slowly than f(θ x) we will frequently sample from regions where the weight will be close to zero, wasting computation time. Secon to sampling irectly from f(θ x), we woul like a sampling istribution slightly fatter than f(θ x). In section 2.3 we show that when we set the sampling ensity to be f(θ x 1,..., x n ), where n N so that we conition on a manageable subset of the entire ataset, the importance weights for each sample θ i require only one sequential scan of the remaining observations. Before beginning that iscussion, the next section introuces the most popular computational metho for Bayesian analysis of complex moels. 2.2 Markov chain Monte Carlo While a large group of statisticians have long felt that Bayesian analysis is appropriate for a wie class of problems, practical estimation methos were not available until Markov chain Monte Carlo (MCMC) techniques became available. Importance sampling is a useful tool, but for complex moels crafting a reasonable sampling istribution can be extremely ifficult. The excellent collection [11] contains a more etaile introuction to MCMC along with a variety of interesting examples an applications. As with importance sampling, the goal is to generate a set of raws from the posterior istribution f(θ x). Rather than create inepenent raws an reweight, MCMC methos buil a Markov chain, a sample of epenent raws, θ 1,..., θ M, that have stationary istribution f(θ x). It turns out that it is often easy to create such a Markov chain with a few basic strategies. However, there is still a bit of art involve in creating an efficient chain an assessing the chain s convergence. Figure 1 shows the Metropolis-Hastings algorithm [13, 17], a very general MCMC algorithm. Assume that we have a single raw θ 1 from f(θ x) an a proposal istribution for a new raw, q(θ θ 1). If we follow step 2 of the MCMC algorithm then the istribution of θ 2 will also be f(θ x). This is one of the key properties of the algorithm. Iterating this algorithm we will obtain a sequence θ 1,..., θ M that has f(θ x) as its stationary istribution. MCMC methos have two main avantages that make them so useful for Bayesian analysis. First, we can choose q s from which sampling is easy. Any q that oes not eterministically propose values an is capable of eventually visiting any value of θ will make the algorithm sample from the esire istribution. Special choices for q, which may or may not epen on the ata, simplify the algorithm. If q is symmetric, for example a Gaussian centere on θ i 1, then the entire proposal istributions cancel out in (6). If we choose a q that proposes values that are very close to θ i 1 then it will almost always accept the proposal but the chain will move very slowly an take a long time to converge to the stationary istribution. If q proposes new raws that are far from θ i 1 an outsie the region with most of the posterior mass, the proposals will almost always be rejecte an again the chain will converge slowly. With a little tuning the proposal istribution can usually be ajuste so that proposals are not rejecte or accepte too frequently. The

3 secon avantage is that there is no nee to compute the normalization constant of f(θ x) since it cancels out in (6). The Gibbs sampler [9] is a special case of the Metropolis- Hastings algorithm an is especially popular. If θ is a multiimensional parameter, the Gibbs sampler sequentially upates each of the components of θ from the full conitional istribution of that component given fixe values of all the other components an the ata. For many moels use in common practice, even the ones that yiel a complex posterior istribution, sampling from the posterior s full conitionals is often a relatively simple task. Conveniently, the acceptance probability (6) always equals 1 an yet the chains often converge relatively quickly. The example in section 4 utilizes a Gibbs sampler an goes into further etail of the example s full conitionals. MCMC as specifie, however, is computationally infeasible for massive atasets. Except for the most trivial examples, computing the acceptance probability (6) requires a complete scan of the ataset. Although the Gibbs sampler avois the acceptance probability calculation, precalculations for simulating from the full conitionals of f(θ x) require a full scan of the ataset, sometimes a full scan for each component! Since MCMC algorithms prouce epenent raws from the posterior, M usually has to be very large to reuce the amount of Monte Carlo variation in the posterior estimates. While MCMC makes fully Bayesian analysis practical it seems ea on arrival for massive ataset applications. Although this section has not given great etail about the MCMC methos, the important ieas for the purpose of this paper are that 1. MCMC methos make Bayesian analysis practical, 2. MCMC often requires an enormous number of laps through the ataset, an 3. given a θ rawn from f(θ x) we can use MCMC to raw another value, θ, from the same istribution. The last point will be the key to implementing a particle filter solution that allows us to apply MCMC methos to massive atasets. We will use this technique to switch the inner an outer loops in figure 1. The scan of the ataset will become the outer loop an the scan of the raws from f(θ x) will become the inner loop. 2.3 Importance sampling for analysis of massive atasets So far we have two tools, MCMC an importance sampling, to raw samples from an arbitrary posterior istribution. In this section we iscuss a particular form of importance sampling that will help perform Bayesian analysis for massive atasets. Ieally we woul like to sample efficiently an take avantage of all the information available in the ataset. A factorization of the integran of the right han sie of (3) shows that this is possible when the observations, x i, are inepenent given the parameters, θ. Such conitional inepenence is often satisfie, like in the class of hierarchical moels, even when the observations are marginally epenent. Let D 1 an D 2 be a partition of the ataset so that every observation is in either D 1 or D 2. As note for (1) we woul like to sample from the posterior conitione on all of the ata, f(θ D 1, D 2 ). Since sampling from f(θ D 1, D 2 ) 1. Initialize the parameter θ 1 2. For i in 2,..., M o Step (a) an/or (b) requires a scan of the ataset (a) Draw a proposal θ from q(θ θ i 1), (b) Loop through the ataset to compute the acceptance probability α(θ f(θ x)q(θ i 1 θ ), θ i 1 ) = min 1, f(θ i 1 x)q(θ θ i 1) (c) With probability α(θ, θ i 1) set θ i = θ. Otherwise set θ i = θ i 1 Figure 1: The Metropolis-Hastings algorithm is ifficult ue to the size of the ataset, we consier setting g(θ) = f(θ D 1 ) for use as our sampling istribution an using importance sampling to ajust the raws. If θ i, i = 1,..., M, are raws from f(θ D 1) then we can estimate the posterior expectation (1) as Ê(h(θ) D 1, D 2 ) = P M i=1 wih(θi) P M i=1 w i where the w i s are the importance sampling weights (6) (7) w i = f(θ i D 1, D 2 ). (8) f(θ i D 1) Although these weights still involve f(θ i D 1, D 2 ), they greatly simplify. w i = f(d 1, D 2 θ i )f(θ i ) f(d 1, D 2) = f(d 1 θ i )f(d 2 θ i )f(d 1 ) f(d 1 θ i )f(d 1, D 2 ) = f(d2 θi) f(d 2 D 1 ) Y f(d 2 θ i ) = f(d 1 ) f(d 1 θ i)f(θ i) (9) (10) x j D 2 f(x j θ i ) (11) Line (9) follows from applying Bayes theorem to the numerator an enominator. Equation (10) follows from (9) since the observations in the ataset partition D 1 are conitionally inepenent from those in D 2 given θ. Conveniently, (11) is just the likelihoo of the observations in D 2 evaluate at the sample value of θ. Figure 2 summarizes this result as an algorithm. The algorithm maintains the weights on the log scale for numerical stability. So rather than sample from the posterior conitione on all of the ata, D 1 an D 2, which slows the sampling proceure, we nee only to sample from the posterior conitione on D 1. The remaining ata, D 2, simply ajusts the sample parameter values by reweighting. The for loops in step 5 of figure 2 are interchangeable. The trick here is to have the inner loop scan through the raws so that the outer loop only nees to scan D 2 once to upate the weights. Although the

4 same computations take place, in practice physically scanning a massive ataset is far more expensive than scanning a parameter list. However, massive moels as well as massive atasets exist so that in these cases scanning the ataset may be cheaper than scanning the sample parameter vectors. We will continue to assume that scanning the ataset is the main impeiment to the ata analysis. We certainly can sample from f(θ D 1) more efficiently than from f(θ D 1, D 2) since simulating from f(θ D 1) will require a scan of a much smaller portion of the ataset. We also assume that, for a given value of θ, the likelihoo is reaily computable up to a constant, which is almost always the case. When some ata are missing, the processing of an observation in D 2 will require integrating out the missing information. Since the algorithm hanles each observation case by case, computing the observe likelihoo as an importance weight will be much more efficient than if it was embee an repeately compute in a Metropolis-Hastings rejection probability computation. Placing observations with missing values in D 2 greatly reuces the number of times this integration step nees to occur, easing likelihoo computations. 1. Loa as much ata into memory as possible to form D 1, taking into account space requirements for the Monte Carlo algorithm 2. Draw M times from f(θ D 1 ) via Monte Carlo or Markov chain Monte Carlo 3. Purge the memory of D 1 4. Create a vector of length M to store the logarithm of the weights an initialize them to 0 5. Iterate through the remaining observations. For each observation, x j, upate the log-weights on all of the raws from f(θ D 1 ) for x j in the partition D 2 o { for i in 1,..., M o { log w i log w i + log f(x j θ i) } } 6. Rescale to compute the weights w i exp (log w i max(log w i)) Figure 2: Importance sampling for massive atasets 2.4 Efficiency an the effective sample size The algorithm shown in figure 2 oes have some rawbacks. While it makes great gains in reucing the number of times the ata nee to be accesse the Monte Carlo variance of the importance sampling estimates grows quickly. The problem is easily emonstrate graphically as shown in figure 3. The wie histogram represents the sampling istribution f(θ D 1 ) that generates the initial posterior raws Figure 3: Comparison of f(θ D 1, D 2 ) an f(θ D 1 ) However, the target istribution, f(θ D 1, D 2), shown as the ensity plot, is shifte an narrower. About half of the raws from f(θ D 1 ) will be waste. Those that come from the right half will have importance weight near zero. Since all of the terms are positive in the familiar variance relationship Var(θ D 1 ) = E(Var(θ D 1, D 2 )) + Var(E(θ D 1, D 2 )), (13) the posterior variance with the aitional observations in D 2 on average will be smaller than the posterior variance conitione only on D 1. The aition of D 2 can increase the variance (see [22] for an example) but usually D 2 is large enough so that the averaging effect ominates. Therefore, although the location of the sampling istribution shoul be close to the target istribution, its sprea will most likely be wier than that of the target. As aitional observations become available, f(θ D 1, D 2 ) becomes much narrower than f(θ D 1 ). The result of this narrowing is that the weights of many of the original raws from the sampling istribution approach 0 an so we have few effective raws from the target ensity. As in [14], the effective sample size (ESS) is the number of observations from a simple ranom sample neee to obtain an estimate with Monte Carlo variation equal to the Monte Carlo variation obtaine with the M weighte raws of θ i. X! Var θ i = Var (14) Therefore, ESS 1 ESS i=1 P M! i=1 w iθ P i M i=1 wi P Var(θ) 1 M ESS = Var(θ) i=1 w2 i P M 2 (15) i=1 w i (P w i ) 2 ESS = P w 2 (16) i = M 1 + Var(w). (17) Höler s inequality implies that ESS is always less than or equal to M. With a little algebra, the ESS is also expressible in terms of the sample variance of the w i as shown in (17), which facilitates the stuy of its properties in Theorem 1. If the θ i are epenent from the start, as will be the case for MCMC raws, the effective sample size will further ecrease in aition to reuctions ue to unequal importance weights. When the MCMC algorithm mixes well so that the set of θ i are not too epenent, this is not too much of a problem.

5 Figure 4 shows the ecay of the effective sample size for a simulate example. The ata come from a three-imensional Gaussian with mean 0 an covariance equal to the ientity matrix. The posterior therefore concerns the three mean an the six covariance parameters. We sample M = 1000 times from the posterior conitione on n = 100 observations. After 300 aitional ata points the ESS has roppe to 10, a 99% loss in estimation efficiency from the initial Monte Carlo sample of M = At this point 65 of the initial 1000 raws account for 99% of the total weight. Figure 4 also overlays the ESS curve assuming a known covariance an the expecte ESS curve erive next. The following theorem concerning the variance of the important sampling weights can help us gauge the effect of these problems in practice. The theorem assumes that we observe a finite set of multivariate normal ata, x i. As before we will partition the x i s into two groups, D 1 an D 2. To get accurate estimates of the mean, µ, we will be concerne about the variance of the importance sampling weights, φ(µ D 1, D 2, Σ)/φ(µ D 1, Σ), where φ( ) is the normal ensity function. The theorem gives the variance of these importance sampling weights average over all possible atasets with a flat prior for µ. then Theorem 1. If, for j = 1,..., N, 1. x j N (µ, Σ) with known covariance Σ, 2. D 1 = {x 1,..., x n} an D 2 = {x n+1,..., x N }, an 3. µ N (µ 0, Λ 0 ) φ(µ D1, D 2, Σ) lim E D2 E D1 Var µ D1 Λ 1,Σ (18) 0 0 φ(µ D 1, Σ) N = 1 (19) n Proof: The most straightforwar proof of the theorem involves simply computing the big multivariate Gaussian integral in (18). Theorem 1 basically says that in the multivariate normal case with a flat prior the variance of the importance sampling weights is on average (19). These results may hol approximately in the non-normal case if the posterior istributions an the likelihoo are approximately normal. As we shoul expect, when n = N the variance of the weights is 0. As N increases relative to n the variance increases quickly. This is unfortunate in our case since we woul like to use this metho for large values of N an high imensional problems. Looking at this result from the effective sample size point of view we see that n ESS M. (20) N If we raw M times from the sampling istribution when the size of the secon partition D 2 is equal to the size of the first partition D 1, the effective sample size is ecrease by a factor of 2. Although things are looking grim for this metho, recent avances in particle filters siestep this problem by a simple rejuvenation step. Effective sample size Aitional observations Figure 4: The reuction in effective sample size with the aition of 1,000 observations. The top jagge curve assumes a known covariance while the bottom jagge also estimates the covariance. The smooth curve is the expecte ESS with a known covariance. 3. PARTICLE FILTERING FOR MASSIVE DATASETS The efficiency of the importance sampling scheme escribe in the previous section eteriorates when the importance weights accumulate on a small fraction of the initial raws. These θ i with the largest weights are those parameter values that have the greatest posterior mass given the ata absorbe so far. The remaining raws are simply wasting space. Sequential Monte Carlo methos [5] aim to aapt estimates of posterior istributions as new ata arrive. Particle filtering is the often use term to escribe methos that use importance sampling to filter out those particles, the θ i, that have the least posterior mass after incorporating the aitional ata. All of the methos struggle with maintaining a large effective Monte Carlo sample size while maintaining computational efficiency. The resample-move or rejuvenation step evelope in [10] greatly increases the sampling efficiency of particle filters in a clever fashion. We can iterate step 5 s outer loop shown in figure 2 until the ESS has eteriorate below some tolerance limit, perhaps 10% of M. Assume that this occurs after absorbing the next n 1 observations. At that point we have an importance sample from the posterior conitione on the first n + n 1 ata points. Then resample M times with replacement from the θ i where the probability that θ i is selecte is proportional to w i. Note that these raws still represent a sample, albeit a epenent sample, from the posterior conitione on the first n + n 1 ata points. Several of the θ i will be represente multiple times in this new sample. For the most part this refreshe sample will be evoi of those θ i not supporte by the ata. Remember that the basic iea behin MCMC was that given a raw from f(θ x 1,..., x n+n1 ) we can generate another observation from the same istribution by a single Metropolis-Hastings step. Although this new raw will still be epenent, it will have less epenence than leaving it

6 Figure 5: The resample-move step. 1) generate an initial sample from f(θ D 1 ). The ticks mark the particles, the sample θ i. 2) Weight base on f(θ D 1, D 2 ) an resample, the length of the vertical lines inicate the number of times resample. 3) For each θ i perform an MCMC step to iversify the sample. so that it has uplicates in the set of raws. Aitional MCMC steps will ecrease that epenence, increase the ESS, but also increase the number of times the algorithm accesses the first n + n 1 observations. Therefore, to rejuvenate the sample, for each of these new θ i s we can perform a single Metropolis-Hastings step centere aroun θ i where the acceptance probability is base on all n + n 1 ata points. Our rejuvenate θ i s now represent a more iverse set of parameter values with an effective sample size closer to M again. Figure 5 graphically walks through the resample-move process step-by-step. After rejuvenating the set of θ i, we can continue where we left off, on observation n + n 1 + 1, an continue absorbing aitional observations until either we inclue the entire ataset or the ESS again has roppe too low an we nee to repeat a rejuvenation step. As oppose to stanar MCMC, the particle filter implementation also amits an obvious path towar parallelization. The next section emonstrates the metho on a simulate ataset. 4. EXAMPLE: MIXTURES OF TRANSITION MODELS In this section we present a small example to emonstrate proof-of-concept. While it uses a ataset that can easily fit in main memory, it emonstrates the notion that the particle filter approach greatly reuces the number of ata accesses. At least for this example, aitional observations woul change the posterior slightly so that they can be absorbe by linearly scanning only the newest observations one or two times. Mixtures of transition moels have been use to moel users visiting web sites [2, 19, 20] an unsupervise training of robots [18]. In [2], the authors also evelop visualization tools (WebCANVAS) for unerstaning clusters of users an apply their methoology to the msnbc.com web site. Transition moels [21], or finite state Markov chains (although relate, in this context these are not to be confuse with Markov chain Monte Carlo), are useful for escribing iscrete time series where an observe series switches between a finite number of states. A particular sequence, for example (A,B,A,A,C,B) might be generate by a first-orer transition moel where the probability that the sequence moves to a particular state at time t + 1 epens only on the state at time t. Perhaps web users traverse a web site in such a manner. Given a set of sequences we can estimate the unerlying probability transition matrix, the matrix that escribes the probability of specific state to state transitions. In fact the posterior istribution is computable in close form with a single pass through the ataset by simply counting the number of times the sequences moves from state A to state A, state A to state B, an so on for all pairs of states. However, a particular set of sequences may not all share a common probability transition matrix. For example, visitors to a web site are heterogeneous an may iffer on their likely paths through the web site epening on their profession, their Internet experience, or the information that they seek. The mixture of transition moels assumes that the ataset consists of sequences, each generate by one of C transition matrices. However, neither the transition matrices nor the group assignments nor the number from each group are known. The goal, therefore, is to unerstan the shape of the posterior istribution of the elements of the two transition matrices an the mixing fraction given a sample of observe users paths. Inepenent samples from this posterior istribution are not easily obtaine irectly but the full conitionals, on the other han, are simple enough so that the Gibbs sampler is easy to implement (see [19] for complete etails). Let C be the number of clusters an S be the number of possible states. The unknown parameters of this moel are the C S S transition matrices, P 1,..., P C, the mixing vector α of length C containing the fraction of observations from each cluster, an the N cluster assignments, z j {1,..., C}. Placing a uniform prior on all parameters, the Gibbs sampler procees as follows. First ranomly initialize the cluster assignments, z j. Given the cluster assignments, the full conitional of the i th row of the transition matrix P c is Dirichlet(1 + n i1c, 1 + n i2c, 1 + n i3c,..., 1 + n isc ), (21) where n i1c, for example, is the number of times sequences for which z j = c transition from state i to state 1. The mixing vector is upate with a raw from Dirichlet(1 + X I(z j = 1),..., 1 + X I(z j = C)), (22) P where I(z j = c) counts the number of observations as-

7 signe to cluster c. Lastly we upate the cluster assignments conitional on the newly sample values for the transition matrices. The new cluster assignment for sequence j is rawn from a Multinomial(p 1, p 2,..., p C) where p c is the probability that transition matrix P c generate the sequence. With these new cluster assignments we return to (21) an so the Gibbs sampler iterates. As note in section 2.2, each iteration of the MCMC algorithm requires a full scan of the ataset, in this case two scans, one for the matrix upate an one for the cluster assignment upate. To test the improvement available using the particle filtering approach, we generate 10,000 sequences of length between 5 an 15 from two 4 4 transition matrices. We use the first n = 100 sequences to obtain the initial sample of M = 150 raws, step 1 of the algorithm shown in figure 2. We then sequentially accesse the aitional sequences, reweighting the M raws until the ESS roppe below 15. At that point, we resample an applie the rejuvenation step to the set of raws an continue again until the ESS roppe too low. Figure 6 shows the results for the number of times the particle filtering algorithm accesse each observation. The lower curve inicates the number of accesses. The first 100 observations show the greatest number of accesses (348 for this example) since they were also use to generate the initial sample. However, the aitional observations were accesse infrequently. For example, the algorithm accesse observation #2000 only 14 times an observation #10000 only twice. Number of times accesse Observation Figure 6: The frequency of access by observation. The horizontal line at 300 refers to the full MCMC run an the lower curve refers to the particle filter. The marks along the x-axis refer to occurrences of the rejuvenation step. For comparison, the line at 300 in figure 6 inicates the number of times the Gibbs sampler, conitione on the entire ataset, neee to access each observation. Each of the 150 iterations require one scan for the cluster assignments an a secon scan for the parameter upates (21) an (22). The slightly larger values for the first 100 observations are ue to their usage in etermining the starting values for the Gibbs sampler. This starting value selection process was the same for both the particle filter an the full Gibbs sampler. Figure 6 shows a 95% reuction in the total number of ata accesses when using the particle filter. The tick marks along the bottom mark the points at which a rejuvenation step took place. Note that they are very frequent at first an ecrease as the algorithm absorbs aitional observations. The marginal posterior stanar eviation approximately ecreases like O(1/ n) so that the target is shrinking at a slower rate as we a more ata. From the ESS approximation in (20) we can estimate the frequency of rejuvenation. As before, let n be the size of the initial sample. Now let N k be the total number of observations accommoate at the k th rejuvenation step. If we rejuvenate the θ i s when the ESS rops to p M then the N k are approximately relate accoring to n N 1 p, N1 N 2 p,..., Unraveling the recursion implies that N k n Nk 1 N k p. (23) 1 k, (24) p where is the imension of the parameter vector from theorem 1. When we let the ESS get very small before rejuvenation, equivalently setting p to be small, the N k can become large quickly. Naturally, there will be a balance between loss in computing efficiency an estimation efficiency. Fortunately N k grows exponentially in k, so that once k excees, the effective number of parameters we are trying to estimate, N k will grow quickly. Therefore, after approximately k = rejuvenations the algorithm has absorbe enough ata points so that it can withhol future rejuvenations until many more observations have been accommoate. While N k grows exponentially with k it grows only linearly with n, the number of observations in D 1. This implies that it may be better to spen more computational effort on the rejuvenation steps than the initial posterior sampling effort. For the mixture example, the effective number of parameters is no more than 25. Each transition matrix is 4 4 with the constraint that the rows sum to 1. So each of the two transition matrices has 12 free parameters. With the single mixing fraction parameter the total parameter count is 25. With aitional correlation amongst the parameters the effective number of parameters coul be less. In fact in our example we foun that equation (24) matches the observe frequency of rejuvenation to near perfection when = 17. While efficiency as measure with the number of ata accesses is important in the analysis of massive atasets, precision of parameter estimates is also important. Figure 7 shows the marginal posterior istributions for the 16 transition probabilities from the first cluster s transition matrix. The histogram is base on the M = 150 raws using the particle filtering metho. The overlai ensity is base on a rigorous MCMC run with 3000 raws. The histogram an ensity plots are nearly ientical except for small fluctuations. The posterior means from the two methos virtually overlap for each parameter. The figure also marks the location of the parameter value use to generate the ata. All of these values are within range of the posterior mean to the extent that we woul expect from sampling variability. While achieving a 95% reuction in the number of ata points ac-

8 Figure 7: The posterior istribution of the transition probabilities for one of the transition matrices. The histogram is base on the particle filter while the black curve is the estimate ensity base on a rigorous 3000 raw MCMC run. The two arker vertical lines are the posterior means base on the particle filter an the rigorous run an are nearly ientical if not overlapping. The ashe vertical line, which may be further away, is the true value use to simulate the ataset. cesse, the algorithm shows little if any loss in the estimation of the posterior istribution an the posterior mean. Note that increasing M oes not change the number ata accesses for the particle filter while each aitional raw represents yet another scan for the stanar implementation. If for some reason one was not confient in the particle filter results, one coul generate aitional MCMC iterations utilizing the entire ataset initiate from the particle filter raws. If the ensities change little then that woul be evience in favor of, but not necessarily proof of, the algorithm s estimation accuracy. The example escribe involves a fairly small ataset. In aitional experimentation we sample 1000 raws using the particle filter from the posterior istribution conitione on a ataset containing 1,406,000 observe processes. We observe similar performance metrics. The number of total observations accesse using the particle filter was 99.4% less than if we ha use the stanar MCMC implementation. At the same time, equation (24) maintains its preiction of the refresh rate for a moel with = 17 effective parameters. To conition on 1.4 million processes the particle filter ha to refresh 56 times, the last refresh after incorporating 136,000 processes with a single linear scan. We continue to observe no loss in estimation precision as all the true parameter values use to simulate the ata always lie in regions of high posterior probability. 5. DISCUSSION MCMC methos have been almost completely absent from ata mining research while they are wiely use in moern statistical analysis of complex moels. Inee when working with massive atasets the first orer of business may be obtaining simple point estimates for unknown parameters. Inevitably, analysts want to explore other aspects of the posterior istribution besies simply the posterior mean or moe. But to ate MCMC methos have simply been computationally infeasible for massive atasets. Likelihoo base ata squashing [16] is also a potential tool for making Bayesian analysis in massive atasets computationally feasible. It too uses the factorization of the likelihoo (11) to avoi too many scans of the ataset. Likelihoo base ata squashing locates a small number of ata points or pseuo-ata points with appropriate weights so that a weighte analysis of the pseuo-ataset woul pro-

9 uce the same results as the unweighte analysis of the massive ataset. It is possible that a posterior conitione on the pseuo-ataset may offer a goo importance sampling istribution so that some combination of ata-squashing, importance sampling, an particle filtering coul provie a coherent solution. While clearly the metho nees to unergo more empirical work to test the bounaries of its limitations, the erivation an preliminary simulation work shows promise. If we can generally reuce the number of ata accesses by 95% MCMC becomes viable for a large class of moels useful in ata mining. The sequential nature of algorithm also allows the analyst to stop when uncertainty in the parameters of interests has roppe below a require tolerance limit. Parallelization of the algorithm is rather straightforwar. Each processor manages a small set of the weighte raws from the posterior an is responsible for upating their weights an computing the refresh step. The last avantage that we iscuss here involves convergence of the MCMC sampler. As note in section 2.2, the key to MCMC begins with assuming that we have an initial raw from f(θ x). While in practice the analyst usually just starts the chain from some reasonably selecte starting point, the particle filter approach allows us to sample irectly from the prior to initialize the algorithm. Sampling from the prior istributions often use in practice is usually simple. Then the particle filter can run its course starting with the first observation. Even though subsequent steps introuce epenence, the algorithm will always generate new raws from the correct istribution without approximation. Bayesian analysis couple with Markov chain Monte Carlo methos continues to revitalize many areas of statistical analysis. Some variant of the algorithm we propose here may inee make this pair viable for massive atasets. 6. REFERENCES [1] J. Besag, P. Green, D. Higon, an K. Mengersen. Bayesian computation an stochastic systems (with iscussion). Statistical Science, 10:3 41, [2] I. Caez, D. Heckerman, C. Meek, P. Smyth, an S. White. Visualization of navigation patterns on a web site using moel-base clustering. Technical Report MSR-TR-00-18, Microsoft Research, March. [3] B. Carlin an T. Louis. Bayes an Empirical Bayes Methos for Data Analysis. Chapman an Hall, Boca Raton, FL, 2n eition, [4] M. DeGroot. Optimal Statistical Decisions. McGraw-Hill, New York, [5] A. Doucet, N. e Freitas, an N. Goron. Sequential Monte Carlo Methos in Practice. Springer-Verlag, [6] J. Eler an D. Pregibon. A statistical perspective on knowlege iscovery in atabases. In U. M. Fayya, G. Piatetsky-Shapiro, P. Smyth, an R. Uthurusamy, eitors, Avances in Knowlege Discovery an Data Mining, chapter 4. AAAI/MIT Press, [7] M. Figueireo. Aaptive sparseness using Jeffreys prior. In Neural Information Processing Systems - NIPS 2001, [8] A. Gelman, J. Carlin, H. Stern, an D. Rubin. Bayesian Data Analysis. Chapman Hall, New York, [9] S. Geman an D. Geman. Stochastic relaxation, Gibbs istributions an the Bayesian restoration of images. IEEE Transactions on Pattern Analysis an Machine Intelligence, 6: , [10] W. Gilks an C. Berzuini. Following a moving target - Monte Carlo inference for ynamic Bayesian moels. Journal of the Royal Statistical Society B, 63(1): , [11] W. Gilks, S. Richarson, an D. J. Spiegelhalter, eitors. Markov Chain Monte Carlo in Practice. Chapman an Hall, [12] C. Glymour, D. Maigan, D. Pregibon, an P. Smyth. Statistical themes an lessons for ata mining. Data Mining an Knowlege Discovery, 1(1):11 28, [13] W. K. Hastings. Monte Carlo sampling methos using Markov chains an their applications. Biometrika, 57:97 109, [14] A. Kong, J. Liu, an W. Wong. Sequential imputation an Bayesian missing ata problems. Journal of the American Statistical Association, 89: , [15] L. Le Cam an G. Yang. Asymptotics in Statistics: Some Basic Concepts. Springer-Verlag, New York, [16] D. Maigan, N. Raghavan, W. DuMouchel, M. Nason, C. Posse, an G. Rigeway. Instance construction via likelihoo-base ata squashing. In H. Liu an H. Motoa, eitors, Instance Selection an Construction - A ata mining perspective, chapter 12. Kluwer Acaemic Publishers, [17] N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, an E. Teller. Equations of state calculations by fast computing machine. Journal of Chemical Physics, 21: , [18] M. Ramoni, P. Sebastiani, an P. Cohen. Bayesian clustering by ynamics. Machine Learning, 47(1):91 121, [19] G. Rigeway. Finite iscrete Markov process clustering. Technical Report MSR-TR-97-24, Microsoft Research, September. [20] G. Rigeway an S. Altschuler. Clustering finite iscrete Markov chains. In Proceeings of the Section on Physical an Engineering Sciences, pages , [21] S. M. Ross. Probability Moels. Acaemic Press, 5th eition, [22] D. Spiegelhalter an R. Cowell. Learning in probabilistic expert systems. In J. Bernaro, J. Berger, A. Dawi, an A. Smith, eitors, Bayesian Statistics, volume 4, pages Clarenon Press, Oxfor, 1992.

A Review of Multiple Try MCMC algorithms for Signal Processing

A Review of Multiple Try MCMC algorithms for Signal Processing Luca Martino Image Processing Lab., Universitat e València (Spain) Universia Carlos III e Mari, Leganes (Spain) Abstract Many applications