Fast Bayesian scan statistics for multivariate event detection and visualization

Size: px

Start display at page:

Download "Fast Bayesian scan statistics for multivariate event detection and visualization"

Johnathan Cameron
5 years ago
Views:

1 Research Article Received 22 January 2010, Accepted 22 January 2010 Published online in Wiley Online Library (wileyonlinelibrary.com) DOI: /sim.3881 Fast Bayesian scan statistics for multivariate event detection and visualization Daniel B. Neill The multivariate Bayesian scan statistic (MBSS) is a recently proposed, general framework for event detection and characterization in multivariate space time data. MBSS integrates prior information and observations from multiple data streams in a Bayesian framework, computing the posterior probability of each type of event in each space time region. MBSS has been shown to have many advantages over previous event detection approaches, including improved timeliness and accuracy of detection, easy interpretation and visualization of results, and the ability to model and accurately differentiate between multiple event types. This work extends the MBSS framework to enable detection and visualization of irregularly shaped clusters in multivariate data, by defining a hierarchical prior over all subsets of locations. While a naive search over the exponentially many subsets would be computationally infeasible, we demonstrate that the total posterior probability that each location has been affected can be efficiently computed, enabling rapid detection and visualization of irregular clusters. We compare the run time and detection power of this Fast Subset Sums method to our original MBSS approach (assuming a uniform prior over circular regions) on semi-synthetic outbreaks injected into real-world Emergency Department data from Allegheny County, Pennsylvania. We demonstrate substantial improvements in spatial accuracy and timeliness of detection, while maintaining the scalability and fast run time of the original MBSS method. Copyright 2011 John Wiley & Sons, Ltd. Keywords: event detection; disease surveillance; scan statistics 1. Introduction This work focuses on the task of disease surveillance, in which we monitor electronically available public health data sources, such as hospital visits and medication sales, to automatically detect emerging outbreaks of disease. Our goal is to develop disease surveillance systems which can identify outbreaks in the very early stages (typically, before a definitive diagnosis of the outbreak disease can be obtained), enabling a timely and effective public health response. The multivariate Bayesian scan statistic (MBSS) [1] is a recently proposed Bayesian framework for detection and characterization of emerging outbreaks. MBSS integrates information from multiple data streams, enabling more timely and more accurate event detection, and can model and differentiate between multiple event types. MBSS can be used to detect emerging outbreaks of disease, pinpoint the affected spatial region, distinguish between different outbreak diseases, and reduce the number of false-positive alerts due to irrelevant anomalies in the data [1]. By detecting and characterizing outbreaks in their early stages, MBSS can provide public health users with sufficient situational awareness to enable a rapid and informed response. MBSS has been shown to have numerous advantages over the frequentist spatial scan statistics approach [1]. By integrating information from multiple data streams and incorporating prior knowledge of an outbreak s effects on the monitored data streams, MBSS can achieve more timely and more accurate detection, detecting an average of 1.3 days faster than Kulldorff s multivariate scan statistic [2]. MBSS can model and accurately differentiate between multiple event types, thus distinguishing between patterns due to different outbreak diseases and those due to other irrelevant events in the data. Outbreak models can be specified by a domain expert or learned automatically from a small number of labeled training examples. Randomization testing is not necessary in the Bayesian framework: since 999 or more Monte Carlo replications must typically be performed to obtain accurate p-values for the frequentist approach, we can obtain a 1000 speedup by avoiding the need for randomization. Finally, the results produced by MBSS (the posterior probability H.J. Heinz III College, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, U.S.A. Correspondence to: Daniel B. Neill, H.J. Heinz III College, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, U.S.A. neill@cs.cmu.edu 455 Copyright 2011 John Wiley & Sons, Ltd. Statist. Med. 2011,

2 that each event type has affected each spatial region) are easy to interpret, visualize, and use for decision-making. MBSS computes the posterior probability Pr(H 1 (S, E k ) D) that each outbreak type E k has affected each spatial region S, and these results can be visualized as a posterior probability map, showing the total probability Pr(H 1 (s i ) D)= Pr(H 1 (S, E k ) D) E k S:s i S that each location s i has been affected. The primary limitation of MBSS is a computational issue common to many spatial scan statistic methods: to compute the posterior probabilities, we must evaluate a huge number of spatial regions S. If we place no constraints on the search region, all 2 N subsets of the N locations must be considered, which is computationally infeasible for even moderately large N. MBSS, like other typical scan statistic approaches, solves this problem by only considering a reduced set of regions, placing restrictions on the region shape. As in Kulldorff s original spatial scan statistic approach [3], MBSS restricts the search space to the set of O(N 2 ) distinct circular regions centered at one of the N spatial locations. While this limitation of the search space makes MBSS computationally feasible, the assumption of circular outbreaks causes it to lose detection power for elongated or irregular outbreaks, as well as reducing its ability to accurately pinpoint the affected region. Here we propose a new extension of the MBSS framework, Fast Subset Sums (FSS), which enables timely and accurate detection and visualization of irregularly shaped clusters in multivariate data. FSS can efficiently compute the posterior probability map, as well as the total posterior probability of an outbreak, without computing the individual posterior probabilities of each spatial region. It defines a hierarchical prior distribution over regions which assigns nonzero prior probability to each of the 2 N subsets of locations, and efficiently computes the summed posterior probabilities using this distribution. 2. Methods We briefly review the recently proposed MBSS framework, discuss its limitations, and then present our new FSS method Multivariate Bayesian scan statistics In the multivariate disease surveillance problem, our goal is to detect and characterize outbreaks based on their effects on the monitored data sources. We typically monitor aggregated count data for multiple spatial locations, time steps, and data streams. For example, to detect an influenza outbreak, we could monitor hospital Emergency Department (ED) visits, with each data stream representing the number of ED visits with a different symptom type (e.g. cough or fever ). The MBSS framework [1] assumes a dataset D consisting of multiple data streams D m, for m =1...M. Each data stream consists of spatial time series data collected at a set of spatial locations s i, for i =1...N. For each stream D m and location s i, we have a time series of observed counts ci,m t and a time series of expected counts bt i,m, where t =0 represents the current time step and t =1...T represent the counts from 1 to T time steps ago respectively. For example, a given observed count ci,m t might represent the total number of ED visits with fever symptoms, for a given zip code on a given day. The corresponding expected count bi,m t would be the expected number of ED visits with fever symptoms in that zip code on that day, estimated from the time series of historical fever counts for that zip code. Here we calculate expected counts using a simple, 28-day moving average, but other time series analysis methods can be also be applied, in order to adjust for seasonal and day-of-week trends. MBSS is designed to detect emerging outbreaks, identify the type of outbreak (e.g. distinguishing between anthrax and influenza), and pinpoint the affected locations. To do so, it compares the set of alternative hypotheses H 1 (S, E k ), each representing the occurrence of some outbreak type E k in some subset of locations S, against the null hypothesis H 0 that no outbreaks have occurred. These hypotheses are assumed to be mutually exclusive, and thus compound events (where multiple outbreaks are occurring simultaneously) must be modeled as separate hypotheses. The posterior probability of each hypothesis (given the dataset D) is computed using Bayes theorem: 456 Pr(H 1 (S, E k ) D) = Pr(D H 1(S, E k ))Pr(H 1 (S, E k )) Pr(D) Pr(H 0 D) = Pr(D H 0)Pr(H 0 ) Pr(D)

3 Figure 1. Bayesian network representation of the MBSS method. Solid ovals represent observed quantities, and dashed ovals represent hidden quantities that are modeled. The counts ci,m t are directly observed, whereas the baselines bt i,m and the parameter priors for each stream (α m,β m ) are estimated from historical data. In this expression, the posterior probability of each hypothesis is normalized by the total probability of the data, Pr(D)= Pr(D H 0 )Pr(H 0 )+ Pr(D H 1 (S, E k ))Pr(H 1 (S, E k )). S,E k The standard MBSS method [1] assumes a uniform prior Pr(H 1 (S, E k )) over all event types E k and all circular spatial regions S. As discussed below, nonuniform priors can also be incorporated. In some cases, the prior distribution can be learned from a small number of labeled training examples [4]. The likelihood of the data given each hypothesis, Pr(D H), is computed using the hierarchical Gamma Poisson model shown in Figure 1. Each count ci,m t is assumed to be Poisson distributed: ct i,m Poisson(qt i,m bt i,m ), where bt i,m is the expected count (computed by time series analysis of historical data) and qi,m t is the relative risk. Under the null hypothesis of no outbreaks, each relative risk qi,m t is assumed to be generated from a Gamma distribution: qt i,m Gamma(α m,β m ), where α m and β m are estimated from the historical data for stream D m, as described in [1]. Under the alternative hypothesis H 1 (S, E k ), the expected value of each relative risk is increased by a multiplicative factor xi,m t, the impact of the outbreak for the given spatial location s i, data stream D m, and time step t: qi,m t Gamma(xt i,m α m,β m ). For a given set of impacts X ={xi,m t }, the likelihood of the data can be computed from the Gamma Poisson model as follows [1]: Pr(D X) = Pr(ci,m t bt i,m, xt i,m,α m,β m ) i,m,t i,m,t ( β m β m +b t i,m ) x t i,m α m Γ(x t i,m α m +c t i,m ) Γ(x t i,m α m) In this expression, terms not dependent on the xi,m t have been removed, since these are constant for all hypotheses under consideration. For the null hypothesis H 0, no events have occurred, and thus we assume xi,m t =1 everywhere: Pr(D H 0 ) ( ) αm β m Γ(α m +ci,m t ) i,m,t β m +bi,m t Γ(α m ) For the alternative hypothesis H 1 (S, E k ), we must marginalize over the values of x t i,m : Pr(D H 1 (S, E k ))= X Pr(D X)Pr(X H 1 (S, E k )) The distribution of the impacts xi,m t is conditional on the outbreak type E k, the affected region S, and two additional parameters: the outbreak severity θ and the temporal window W. The outbreak is assumed to affect those and only those locations in region S, for the most recent W time steps (t =0...W 1). Thus we assume xi,m t =1 for unaffected locations (s i S) and for time steps before the start of the outbreak (t W ). For affected locations and time steps (s i S and t<w ), the impact xi,m t is computed as a function of the average impact x km,avg of outbreak type E k on data stream D m, and the outbreak severity θ: x t i,m =1+θ(x km,avg 1). For example, if x km,avg was equal to 1.4, an outbreak of typical severity (θ=1) would increase counts by 40 per cent (x t i,m =1.4), and an outbreak with severity θ=2 would increase counts by 80 per cent (xt i,m =1.8). 457

4 The average effects of each outbreak type on each stream, x km,avg, can either be specified by a domain expert or learned from training data, as described in [1]. For the experiments below, we assume a simple, fixed set of three outbreak models E 1...E 3 for two data streams D 1 and D 2. E 1 assumes that only stream D 1 is affected (x 11 =1.5 andx 12 =1), E 2 assumes that only stream D 2 is affected (x 21 =1andx 22 =1.5), and E 3 assumes that the outbreak has equal impact on both data streams (x 31 = x 32 =1.5). We assume a discrete uniform distribution Θ for θ, and assume that W is drawn uniformly between 1 and W max, where W max is the maximum temporal window size. W max is an important parameter for detection: larger values of W max improve detection power for more gradually emerging outbreaks, whereas smaller values of W max improve detection power for more rapidly emerging outbreaks [5]. We consider two typical values, W max =3andW max =7, in our experiments below. The total likelihood of the data given in the hypothesis H 1 (S, E k ) can be computed by marginalizing over θ and W : 1 Pr(D H 1 (S, E k ))= Pr(D H 1 (S, E k ),θ, W ) W max Θ θ Θ W 1...W max As described in [1], the posterior probability map can be computed by MBSS in seven steps: loading the count data, computing baselines, computing parameter priors, computing location likelihood ratios, computing region likelihood ratios, computing region posterior probabilities, and computing the posterior probability map. The first three steps can be performed in time proportional to the size of the dataset, O(NMT), where N, M, andt are the numbers of locations, streams, and time steps, respectively. The fourth step is to pre-compute the likelihood ratios LR i = m=1...m t=0...w 1 Pr(c t i,m bt i,m,x mα m,β m ) Pr(c t i,m bt i,m,α m,β m ) for each location s i, for each combination of the outbreak type E k, temporal window W, and severity θ. Then the likelihood ratio for a given region S (given E k, W,andθ) can be computed by multiplying the likelihood ratios LR i for all s i S. This formulation has the benefit that the expensive likelihood ratio computations are only performed a number of times proportional to the number of locations, rather than the much larger number of regions. The fifth step of the MBSS framework is to multiply the location likelihood ratios (or add log-likelihood ratios) to obtain the likelihood ratio for each spatial region S, for each combination of E k, W,andθ. The sixth step requires computation of posterior probabilities for each combination of outbreak type E k and spatial region S, marginalizing over W and θ, and the seventh step computes the posterior probability map by summing the posterior probabilities of all regions containing each spatial location. Each of the last three steps requires computation time proportional to the number of regions S, and thus MBSS is computationally expensive when the number of regions is large. The standard MBSS method deals with this computational issue by considering only a small fraction of the O(2 N ) possible subsets of the N spatial locations, limiting its search to circular regions S and assuming that all non-circular regions have zero prior probability. As in Kulldorff s original spatial scan statistic [3], MBSS examines the set of O(N 2 ) circular regions S, each containing a center location s c and its k 1 nearest neighbors (as measured by distance between the zip code centroids), for k =1...N. As noted above, MBSS typically assumes a uniform region prior Pr(H 1 (S, E k ) E k )=1/N S, where N S is the total number of space time regions, but nonuniform priors can also be incorporated, and various methods for learning these priors from data [1, 4] have been explored Fast subset sums As noted above, the primary limitation of the standard MBSS method is its exhaustive computation over spatial regions S, requiring us to restrict the set of spatial regions considered to only a small fraction of the O(2 N ) possible subsets of locations, e.g. only considering circular regions. This restriction prevents MBSS from searching over the huge number of irregularly shaped regions, reducing its detection power and spatial detection accuracy for highly elongated or irregular regions. However, two key insights enable us to circumvent this limitation. First, both the total posterior probability of an outbreak and the posterior probability map are sums of the posterior region probabilities Pr(H 1 (S, E k ) D), where the total posterior probability Pr(H 1 D) is a sum over all spatial regions, and the posterior probability of each spatial location s i is a sum over all spatial regions which contain s i. We show that each of these sums can be calculated efficiently without computing the individual posterior probabilities of each spatial region S. Second, we define a specific, nonuniform prior distribution Pr(H 1 (S, E k ) E k ) over all 2 N subsets of the data. This prior distribution has non-zero prior probabilities for any given subset of the data S, but more compact clusters have larger priors, thus enforcing a soft constraint on spatial proximity. Most importantly, as we demonstrate below, the prior distribution has a hierarchical structure which enables us to efficiently compute sums of posterior probabilities over exponentially many regions in linear time.

5 We consider a hierarchical region prior conditioned on two latent variables: the center location s c and the neighborhood size k. We assume a two-stage process in which s c and k are drawn from their respective distributions, and then the subset S is chosen conditional on s c and k. As in the original MBSS method (searching over circular regions), we can assume that s c is drawn uniformly at random from the set of spatial locations, and that the neighborhood size k is drawn uniformly at random between 1 and some constant k max (the maximum neighborhood size). We typically use k max = N, the number of spatial locations, but smaller values of k max can be used to reduce computation time, as shown below. Nonuniform distributions for the center s c and neighborhood size k can easily be incorporated into our method, but the experiments described below assume uniform distributions. Once we have drawn the center location s c and neighborhood size k, we then consider location s c and its k 1 nearest neighbors, and choose a subset of locations uniformly at random from the 2 k possible subsets of these k locations. Thus the probability of a given center s c, neighborhood k, and region S is Pr(s c,k, S)= Pr(s c )Pr(k)/2 k, if all locations s i S are contained in the k-neighborhood of s c,and0otherwise.the total prior probability of a given region S can be computed by marginalizing over s c and k. For uniform distributions of s c and k, with k max = N, we can compute sc Pr(S)= 2N kc+1 1, N 2 2 N where k c is the minimum neighborhood size such that the k-neighborhood of s c contains all elements of S. This expression is roughly proportional to 2 k min, where k min =min sc k c is a measure of the compactness of region S. The advantage of this formulation is that, for a given center location s c and a given neighborhood size k, we can compute the total posterior probability of the 2 k spatial regions in O(k) time. As we must consider O(N) center locations and O(k max ) neighborhood sizes, this approach enables us to compute the total posterior probability in time O(Nkmax 2 ). To do so, we define S ck as the k-neighborhood of center location s c (i.e. s c and its k 1 nearest neighbors). Then, conditioning on the outbreak type E k, severity θ, temporal window W, center location s c, and neighborhood size k, we can compute the total posterior probability that any region S S ck has been affected. First, we know Pr(S D) Pr(S)LR(S), S S ck S S ck where Pr(S) is the prior probability of region S, conditioned on E k, s c,andk, andlr(s) is the likelihood ratio of region S, Pr(D H 1 (S, E k ))/Pr(D H 0 ), conditioned on E k, θ, andw. Second, we know that Pr(S)LR(S)= 1 S S ck 2 k LR(S), S S ck as we are assuming a uniform prior over the 2 k subsets of S ck. Third, we know that LR(S)= LR i, S S ck S S ck s i S where LR i is the likelihood ratio of location s i, conditioned on E k, θ, andw. Fourth, as we are summing over all 2 k subsets of S ck, we can write the sum of 2 k products as a product of k sums: LR i = (1+LR i ). S S ck s i S s i S ck Fifth, we obtain the final result, S S ck Pr(S D) 1 2 k (1+LR i )= s i S ck s i S ck ( ) 1+LRi 2. Thus the posterior probability of an outbreak, conditioned on the outbreak type E k, temporal window W, severity θ, center location s c, and neighborhood size k, is proportional to the product of the smoothed likelihood ratios (1+LR i )/2 for all locations s i S ck. This is very similar to the case of circular regions, where the posterior probability of an outbreak (again, conditioned on E k, W, θ, s c,andk) was proportional to the product of the unsmoothed likelihood ratios LR i.in each case, we must compute the total posterior probability of an outbreak by marginalizing over E k, W, θ, s c,andk. Finally, we consider how the posterior probability map can be efficiently computed. To do so, we can efficiently compute the posterior probability of an outbreak affecting a given location s j, conditioned on the outbreak type E k, temporal window W, severity θ, center location s c, and neighborhood size k, using a procedure very similar to the above. The only difference is that we must compute Pr(S D) 1 LR 2 k i. S S ck :s j S S S ck :s j S s i S 459

6 In this case, we are summing over all 2 k 1 subsets of S ck that contain s j, and can write the sum of 2 k 1 products as the product of k 1 sums: and thus S S ck :s j S s i S S S ck :s j S LR i =LR j s i S ck {s j } (1+LR i ), ( ) LR Pr(S D) = j 1+LR j Pr(S D). S S ck We can compute the total posterior probability of an outbreak containing each location s j by marginalizing over E k, W, θ, s c,andk, thus enabling efficient computation of the posterior probability map Related work This work describes FSS, an extension of our recently proposed MBSS framework [1] which enables computationally efficient detection of irregularly shaped outbreaks. MBSS extends our Bayesian event detection framework [6] to integrate information from multiple data streams, improving detection power and reducing false positives, and also enables us to accurately model and distinguish between multiple outbreak types. The Bayesian scan statistic is a variant of the more traditional, hypothesis testing approach to spatial scan statistics, developed by Kulldorff and Nagarwalla [3, 7]. While Kulldorff s original spatial scan statistic [3] did not take the time dimension into account, later work generalized this method to the space time scan statistic by scanning over variable size temporal windows [8, 9]. Recent extensions, such as the expectation-based scan statistic [10] and model-based scan statistic [11], use historical data to model the expected distribution of counts in each spatial location, and the expectation-based scan statistic approach is also applied in the present work to obtain the expected counts bi,m t. Many other variants of the spatial and space time scan statistics have been proposed, differing in both the set of regions to be searched and the underlying statistical models. Various statistical models have been proposed for the spatial scan, ranging from simple Poisson and Gaussian statistics [10, 12] to robust and nonparametric models [13, 14]. While Kulldorff s original method [3] assumed circular search regions, other methods have searched over rectangles [15], ellipses [16], and various sets of irregularly shaped regions [17--19]. We note that previous spatial scan methods for detecting irregularly shaped regions either restrict the search space (reducing power for any outbreak regions which do not fit the assumed distribution) or perform a heuristic search, in which case they are not guaranteed to find an optimal or near-optimal region; our FSS method, on the other hand, performs an exact computation over all O(2 N ) subsets of locations, though it computes the posterior probability map rather than identifying the specific subset of greatest interest. Two multivariate extensions of the frequentist spatial scan have recently been proposed: Kulldorff s parametric scan [2], which directly extends the original spatial scan statistic to multiple data streams by assuming that all data streams are independent, and the nonparametric scan [14], which combines empirical p-values from multiple data streams without relying on an underlying parametric model. Unlike MBSS and FSS, neither of these two methods can differentiate between multiple outbreak types. Several other multivariate disease surveillance methods have been proposed, including multivariate extensions of traditional time series analysis methods [20--22] and network-based methods [23] that detect anomalous ratios of counts between streams. These purely temporal methods do not take spatial information into account: they may be used to detect anomalous increases in the aggregate time series of the entire area being monitored, rather than detecting and pinpointing a spatial cluster of affected locations. Additionally, these methods cannot model and differentiate between multiple event types. PANDA [24, 25] uses Bayesian network models to differentiate between multiple outbreak types (e.g. the Centers for Disease Control (CDC) Category A diseases), assuming an underlying entity-based model of ED visits. We have recently developed a multivariate model that incorporates spatial information into PANDA, using ED chief complaint data as evidence [26]. We have recently developed other efficient search methods based on linear-time subset scanning (LTSS), which allow us to efficiently perform unconstrained maximization of certain univariate likelihood ratio statistics over the 2 N subsets of the data [27]. These methods cannot be easily applied to the MBSS because of the added complications of non-uniform prior probabilities, multivariate data, and (most of all) the need to calculate the denominator of the posterior probability, which requires evaluation and summation of the posterior probabilities of all hypotheses under consideration. Additionally, LTSS methods do not guarantee a solution to constrained maximization problems, and thus are difficult to apply in spatial detection scenarios where we wish to enforce spatial proximity constraints on the detected regions.

7 3. Evaluation We now compare the run time, detection power, and spatial accuracy of the FSS method to the original MBSS. While FSS and MBSS both rely on the Bayesian probabilistic framework [1] described briefly above, MBSS assumes a uniform prior over the set of O(N 2 ) circular regions, whereas FSS assumes a nonuniform, hierarchical prior which is nonzero for all O(2 N ) subsets of the data. Thus we expect FSS to improve timeliness and accuracy of detection for elongated or irregular outbreak regions, as these are not modeled well by the assumption of circular regions. However, since a much larger set of regions must be considered, we expect the run time of FSS to be slower than the original MBSS method. However, we expect the run time of FSS to be much faster than a naive search over all O(2 N ) subsets, which would be computationally infeasible for even moderate values of N. The following subsections describe our test dataset (ED data from Allegheny County), compare the run time of FSS to the MBSS and naive subset sums methods, describe our outbreak simulations, and compare the detection power and spatial accuracy of FSS and MBSS for these simulated outbreaks Description of emergency department data We obtained a dataset of de-identified ED visit records collected from 10 Allegheny County hospitals from January 1, 2004 to December 31, Each record contains fields for the patient s date of admission to the ED, home zip code, chief complaint (free text), and International Classification of Diseases (ICD9) code (numeric). We removed records where the home zip code or admission date was missing, or where the home zip code was outside Allegheny County, leaving records (64.8 per cent). The free-text chief complaint was present for all remaining records, and the ICD9 code was present for (84.7 per cent) of the remaining records. From this data, we created two distinct streams of count data (cough/dyspnea (CD) and nausea/vomiting (NV)) by recording the number of patient records matching the given symptoms in each zip code for each day. A patient record was determined to match a given set of symptoms if its chief complaint string contained certain substrings, or if its ICD9 code matched certain values: for the CD stream, we included records with chief complaints containing the substrings cough, dyspnea, shortness, or sob, or with ICD9 codes matching (cough) or (shortness of breath). For the NV stream, we included records with chief complaints containing the substrings naus, vom, or n/v, or with ICD9 codes matching (nausea and/or vomiting) or (persistent vomiting). Each set of records was manually refined to remove spurious substring matches. The time series of daily counts (aggregated over all 97 Allegheny County zip codes) for each data stream is shown in Figure 2. The CD stream had a mean daily count of 44.0 cases, with a standard deviation of 12.1, and the NV stream had a mean daily count of 25.9 cases, with a standard deviation of 7.0. As these cases were spread over the 97 Allegheny County zip codes, many zip codes had zero counts on any given day. Both streams were weakly overdispersed, with index of dispersion (ratio of variance to mean) 3.31 for the CD data and 1.88 for the NV data, and daily counts of the two streams were positively correlated (r = 0.338). Both streams exhibited slight (but statistically significant) day-of-week trends, with counts peaking on Mondays, and clear seasonal trends, with counts peaking in February and March for the CD and NV datasets, respectively Comparison of run times Our first experiment compared the run times of FSS, Naive Subset Sums (searching exhaustively over all subsets of locations), and the original MBSS method (searching over circular regions), as a function of the maximum neighborhood Figure 2. Daily counts for two streams of Allegheny County Emergency Department data (cough/dyspnea and nausea/vomiting) from January 1, 2004 to December 31, 2005: (a) CD and (b) NV. 461

Figure 3. Total run times for 100 days of data, for Fast Subset Sums (FSS), naive subset sums, and the original MBSS approach, as a function of the maximum neighborhood size k max.

8 Figure 3. Total run times for 100 days of data, for Fast Subset Sums (FSS), naive subset sums, and the original MBSS approach, as a function of the maximum neighborhood size k max. size k max. Increasing the neighborhood size requires a larger number of regions to be searched: the number of search regions with neighborhood size k {1...k max } scales linearly with k max if only circular regions are considered, and exponentially with k max if we consider all subsets of the center and its k 1 nearest neighbors. We computed the time required by each method to compute the posterior probability maps for the first 100 days of Allegheny County ED data, for each value of k max. For each method, we used the same two data streams (CD and NV), the same three outbreak models (assuming one outbreak type that affects only CD, one outbreak type that affects only NV, and one that affects both streams equally), and the same maximum temporal window size of three days (W max =3). As shown in Figure 3, the run time of the original MBSS method increased gradually with increasing k max,upto a maximum of s (to process 100 days of data) for k max =90. The run time of the Naive Subset Sums method increased exponentially, making it computationally infeasible for k max 25. Run times were 5.02 h for 100 days of data for k max =15, and 161 h for 100 days of data for k max =20; for k max =30, the naive method would have required an estimated 68.5 days to process a single day of data, and for k max =90, it would have required an estimated years. Finally, we observe that the run time of FSS scales quadratically with increasing k max, up to a maximum of s (to process 100 days of data) for k max =90. Thus, while FSS is approximately 7.5 slower than the original MBSS method, it is still extremely fast, computing the posterior probability map for each day of data in under 9 s. We also considered a fixed k max =10, and examined the effects of doubling the number of data streams (from 2 to 4), doubling the number of event models (from 3 to 6), and doubling the maximum temporal window size (from 3 to 6), respectively. Each of these three changes increased the run time of each method by per cent, demonstrating linear dependence of run time on these three parameters as expected Simulation of outbreaks Next we used a semi-synthetic testing framework (injecting simulated multivariate outbreaks into the real-world ED data) to compare the detection power and spatial accuracy of the FSS and MBSS methods. We considered a simple class of simulated outbreaks with a linear increase in the expected number of cases over the duration of the outbreak. More precisely, our outbreak simulator takes three parameters: the outbreak duration T, the outbreak severity Δ, and the subset of affected zip codes S inject. Then for each injected outbreak, the outbreak simulator chooses the start date of the outbreak t start uniformly at random. On each day t of the outbreak, t =1...T, the outbreak simulator injects Poisson(tw i,m Δ m ) cases into each stream of each affected zip code, where w i,m is the weight of the zip code for that stream, t w i,m = ct i,m. We considered 10 differently shaped outbreak regions S inject, as shown in Figures 5 14 (left panels). All outbreaks were assumed to be two weeks in duration (T =14), and we assumed Δ CD =Δ NV =1. For each outbreak region, we considered 200 different, randomly generated outbreaks, giving a total of 2000 outbreaks for evaluation. We note that simulation of outbreaks is an active area of ongoing research in biosurveillance. The creation of realistic outbreak scenarios is important because of the difficulty of obtaining sufficient labeled data from real outbreaks, but is also very challenging. State-of-the-art outbreak simulations, such as those of Buckeridge et al. [28], and Wallstrom i t ct i,m

9 Table I. Comparison of MBSS and FSS methods, using maximum temporal window size W max =3. Outbreak region MBSS FSS (96.0 per cent) (98.0 per cent) (96.0 per cent) (96.5 per cent) (71.5 per cent) (85.0 per cent) (83.5 per cent) (90.5 per cent) (68.5 per cent) (95.0 per cent) (97.5 per cent) (99.5 per cent) (100 per cent) (100 per cent) (100 per cent) (100 per cent) (100 per cent) (100 per cent) (99.0 per cent) (100 per cent) Average (91.2 per cent) (96.5 per cent) Average days to detection, and proportion of outbreaks detected, at a fixed false-positive rate of 1/month. Table II. Comparison of MBSS and FSS methods, using maximum temporal window size W max =7. Outbreak region MBSS FSS (97.0 per cent) (97.0 per cent) (99.0 per cent) (99.0 per cent) (65.0 per cent) (79.0 per cent) (78.5 per cent) (83.5 per cent) (62.5 per cent) (94.5 per cent) (99.0 per cent) (100 per cent) (100 per cent) (100 per cent) (100 per cent) (100 per cent) (100 per cent) (100 per cent) (100 per cent) (100 per cent) Average (90.1 per cent) (95.3 per cent) Average days to detection, and proportion of outbreaks detected, at a fixed false-positive rate of 1/month. et al. [29], combine disease trends observed from past outbreaks with information about the current background data into which the outbreak is being injected, as well as allowing the user to adjust parameters, such as outbreak duration and severity. While the simple linear outbreak model that we use here is not a realistic model of the temporal progression of an outbreak, it enables precise comparison of the detection power of different methods, gradually ramping up the severity of the outbreak until it is detected Comparison of detection power We compared the detection power of the FSS and MBSS methods for each of the 10 outbreak regions, considering two values of the maximum temporal window size (W max =3andW max =7) for each method. We fixed the value of k max = N, i.e. the MBSS method considers all O(N 2 ) circular regions and the FSS method considers all O(2 N ) subsets of the data. For each combination of method and outbreak region, we computed the method s proportion of outbreaks detected and average number of days to detect as a function of the allowable false-positive rate. To do this, we first computed the maximum region score F =max S F(S) for each day of the original dataset with no outbreaks injected (the first 84 days of data are excluded, since these are used to calculate baseline estimates for our methods). Then for each injected outbreak, we computed the maximum region score for each outbreak day, and determined what proportion of the days for the original dataset have higher scores. Assuming that the original dataset contains no outbreaks, this is the proportion of false positives that we would have to accept in order to have detected the outbreak on day t. Fora fixed false-positive rate r, the days to detect for a given outbreak is computed as the first outbreak day (t =1...14) with proportion of false positives less than r. If no day of the outbreak has proportion of false positives less than r, the method has failed to detect that outbreak: for the purposes of our days to detect calculation, these are counted as 14 days to detect, but could also be penalized further. The detection performance of the MBSS and FSS methods is presented in Tables I and II. For each of the 10 outbreak regions, we present each method s average detection time and percentage of outbreaks detected at a fixed false-positive rate of 1/month. We also show the AMOC curves (average detection time as a function of false-positive rate) for MBSS and FSS, averaged over all 10 outbreak regions, in Figure

Figure 4. AMOC curves for FSS and MBSS, for maximum temporal window sizes 3 and 7. Average days to detection as a function of false-positive rate. Figure 5.

From the AMOC curves, we see that FSS consistently achieves more timely detection than MBSS across a range of false-positive rates.

10 Figure 4. AMOC curves for FSS and MBSS, for maximum temporal window sizes 3 and 7. Average days to detection as a function of false-positive rate. Figure 5. Outbreak region #1 (compact cluster, downtown Pittsburgh). Left: true outbreak region. Center: average posterior probability map for MBSS. Right: average posterior probability map for FSS. From the AMOC curves, we see that FSS consistently achieves more timely detection than MBSS across a range of false-positive rates. For a fixed false-positive rate of 1/month, FSS detected an average of one day earlier than MBSS for a maximum temporal window size W max =3, and 0.54 days earlier for W max =7, with fewer than half as many missed outbreaks in each case. Comparing the detection performance of FSS and MBSS for each of the 10 outbreak regions, we observe that both methods achieve very similar detection times for compact outbreak regions (#1, #2, and #8). For highly elongated outbreak regions (#5 and #10), FSS detected between 1.3 and 2.2 days earlier. For moderately elongated and irregularly shaped regions, the difference between methods was smaller: FSS detected days earlier for W max =3 and days earlier for W max =7. Both methods achieved more timely detection when W max =3, suggesting that the given outbreaks did not emerge gradually enough for the increased temporal window size to improve performance Comparison of spatial accuracy For each of the 10 outbreak regions, we compared the spatial accuracy of the MBSS and FSS methods. The center panels of Figures 5 14 show the average posterior probability maps for the original MBSS method for each of the 10 outbreak regions at the midpoint of the outbreak, averaged over the 200 randomly generated outbreaks affecting that region. The right panels of Figures 5 14 show the average posterior probability maps for FSS for each outbreak region at the midpoint of the outbreak. We used a maximum temporal window size W max =7 for both methods. Comparing each averaged map to the true affected regions (left panels of Figures 5 14), we observe that FSS more accurately estimates the spatial extent of the outbreak region. The detected regions were very similar for the compact outbreaks (#1, #2, and #8). For the elongated outbreaks (#4, #5, #9, and #10), FSS closely approximated the true outbreak region, whereas MBSS was not able to accurately distinguish the true region. Finally, for the irregular and disjoint outbreaks (#3, #6, and #7), FSS was better able to fit the irregular shape, whereas MBSS typically reported a compact cluster containing the irregular shape and several additional (incorrect) zip codes. To quantify these differences in spatial detection accuracy, we computed the average overlap coefficient, precision, and recall for each method for each of the 10 outbreak types, as a function of the outbreak day. To compute the overlap coefficient for a given day of a given outbreak, we first compute the set of detected zip codes. To do so, we find the zip code with largest posterior outbreak probability p, and consider any zip code with posterior probability greater than p/2 to have been detected. We compare the detected zip codes to the set of true zip codes containing injected cases, and count the number of hits (zip codes which are both true and detected). The overlap coefficient is then defined as

Figure 6. Outbreak region #2 (compact cluster, southeast of Pittsburgh). Left: true outbreak region.

Outbreak region #3 (two disjoint clusters). Left: true outbreak region. Center: average posterior probability map for MBSS.

Left: true outbreak region. Center: average posterior probability map for MBSS. Right: average posterior probability map for FSS.

Outbreak region #6 (irregular cluster, Pittsburgh North and South Sides). N hits /(N true + N detected N hits ).

Finally, we compute the average overlap coefficient, precision, and recall for each outbreak region (#1 10) for each outbreak day (1

11 Figure 6. Outbreak region #2 (compact cluster, southeast of Pittsburgh). Left: true outbreak region. Center: average posterior probability map for MBSS. Right: average posterior probability map for FSS. Figure 7. Outbreak region #3 (two disjoint clusters). Left: true outbreak region. Center: average posterior probability map for MBSS. Right: average posterior probability map for FSS. Figure 8. Outbreak region #4 (elongated cluster, downtown Pittsburgh). Left: true outbreak region. Center: average posterior probability map for MBSS. Right: average posterior probability map for FSS. Figure 9. Outbreak region #5 (highly elongated cluster, downtown Pittsburgh). Left: true outbreak region. Center: average posterior probability map for MBSS. Right: average posterior probability map for FSS. Figure 10. Outbreak region #6 (irregular cluster, Pittsburgh North and South Sides). Left: true outbreak region. Center: average posterior probability map for MBSS. Right: average posterior probability map for FSS. N hits /(N true + N detected N hits ). The precision is defined as N hits /N detected, and the recall is defined as N hits /N true. Finally, we compute the average overlap coefficient, precision, and recall for each outbreak region (#1 10) for each outbreak day (1 14), averaging over all 200 outbreaks of the given type. 465

Figure 11. Outbreak region #7 (irregular cluster, Monroeville area). Left: true outbreak region. Center: average posterior probability map for MBSS. Right: average posterior probability map for FSS.

Figure 13. Outbreak region #9 (elongated cluster, south of Pittsburgh). Left: true outbreak region. Center: average posterior probability map for MBSS.

Center: average posterior probability map for MBSS. Right: average posterior probability map for FSS.

Tables III and IV show the average overlap coefficient, precision, and recall for each outbreak region for each method, at the midpoint of the outbreak.

12 Figure 11. Outbreak region #7 (irregular cluster, Monroeville area). Left: true outbreak region. Center: average posterior probability map for MBSS. Right: average posterior probability map for FSS. Figure 12. Outbreak region #8 (compact cluster, west of Pittsburgh). Left: true outbreak region. Center: average posterior probability map for MBSS. Right: average posterior probability map for FSS. Figure 13. Outbreak region #9 (elongated cluster, south of Pittsburgh). Left: true outbreak region. Center: average posterior probability map for MBSS. Right: average posterior probability map for FSS. Figure 14. Outbreak region #10 (highly elongated cluster, north Allegheny County). Left: true outbreak region. Center: average posterior probability map for MBSS. Right: average posterior probability map for FSS. 466 Figure 15 shows the average overlap coefficients (averaged over all 10 outbreak regions) for each method as a function of the outbreak day. Tables III and IV show the average overlap coefficient, precision, and recall for each outbreak region for each method, at the midpoint of the outbreak. From the figure, we observe that FSS increased the average overlap coefficient by 6 15 per cent when compared with MBSS, for a given outbreak day and a given value of the maximum temporal window size W max. Both methods had higher overlap coefficients for W max =7: as many of the affected locations may have had no counts injected on a given time step, a longer temporal window enabled the methods to observe a larger number of injected counts, thus improving their ability to pinpoint the affected region. At the midpoint of the outbreak, FSS increased the average overlap coefficient by 7.3 per cent for W max =3and 10.0 per cent for W max =7. However, we observe from the tables that the overlap coefficients were similar for the more compact outbreaks (#1, #2, and #7), while FSS substantially improved the overlap coefficients for the more elongated outbreaks (#5, #9, and #10). FSS obtained substantially higher precision than MBSS for all 10 outbreak regions, whereas MBSS obtained substantially higher recall for 8 of 10 outbreak regions. Increasing W max from 3 to 7 improved both precision and recall for FSS, whereas only improving precision for MBSS. FSS tended to miss affected zip codes with very small numbers of injected counts, whereas MBSS tended to include these zip codes if they were spatially proximate

Figure 15. Spatial detection accuracy for FSS and MBSS, for maximum temporal window sizes 3 and 7. Average overlap coefficient between true and estimated regions for each outbreak day. Table III.

13 Figure 15. Spatial detection accuracy for FSS and MBSS, for maximum temporal window sizes 3 and 7. Average overlap coefficient between true and estimated regions for each outbreak day. Table III. Comparison of MBSS and FSS methods, using maximum temporal window size W max =3. Precision Recall Overlap Outbreak region MBSS (per cent) FSS (per cent) MBSS (per cent) FSS (per cent) MBSS (per cent) FSS (per cent) Average Average overlap coefficient, precision, and recall at the midpoint of the outbreak. Table IV. Comparison of MBSS and FSS methods, using maximum temporal window size W max =7. Precision Recall Overlap Outbreak region MBSS (per cent) FSS (per cent) MBSS (per cent) FSS (per cent) MBSS (per cent) FSS (per cent) Average Average overlap coefficient, precision, and recall at the midpoint of the outbreak. to other affected zip codes; on the other hand, MBSS often detected additional spatially proximate zip codes which were not part of the affected region. The most likely reason for these differences is that FSS picks out the (possibly irregularly shaped) subset of locations with sufficiently many injected cases without including the surrounding locations, whereas MBSS typically detects a larger, circular region containing these locations. When the true outbreak region is compact (circular or nearly circular), the circular region chosen by MBSS may better approximate its true extent; for elongated 467

14 or irregular regions, the circular region chosen by MBSS is a poor approximation, and the greater flexibility of FSS is needed to improve spatial accuracy. 4. Conclusions The FSS method is an extension of our MBSS framework [1] which allows the computationally efficient detection of irregular clusters. By defining a hierarchical prior over all O(2 N ) subsets of the N monitored locations, rather than considering only the O(N 2 ) circular regions, FSS can achieve more timely detection of elongated or irregularly shaped clusters, and can more accurately pinpoint the affected subset of locations. Our experiments demonstrate substantial improvements in detection time ( days) and spatial accuracy (more than doubling the overlap coefficient between true and detected regions) for highly elongated clusters, while achieving similar detection time and spatial accuracy for compact clusters. Most importantly, while a naive approach to computing the posterior probability map (requiring likelihood computations for each of the exponentially many subsets of the data) would be computationally infeasible, our computationally efficient FSS method scales up to the approximately 100 zip codes in Allegheny County, requiring less than 9 s to compute the posterior probability map for a given day of data. Our continuing exploration of the FSS approach will extend the present work in two main directions. First, while our evaluation of detection power and spatial accuracy assumed a maximum neighborhood size k max equal to the number of locations N, k max can also be set to less than N if we wish to rule out highly elongated clusters, improving detection power for compact and nearly compact clusters, and reducing computation time. Similarly, while we assumed that each subset of the chosen neighborhood (a center location s c and its k 1 nearest neighbors) was equally likely, we can also generalize to the case where we independently choose (with some probability p) whether each location in the neighborhood is affected. Our current method is equivalent to assuming p = 1 2, and searching over circles is equivalent to assuming p =1, but we can also choose a value of p between 1 2 and 1. This generalization still allows us to consider elongated and irregular regions, but gives a higher weight to more compact regions. Our future work will perform a detailed analysis of the detection power of the generalized FSS method as a function of k max, p, and the compactness of the outbreak region. Finally, we propose to incorporate incremental model learning into the FSS approach. As in our original MBSS method, the average effects of each outbreak type on each data stream can be learned from a small number of labeled training examples using a smoothed maximum likelihood approach [1]. Learning the hierarchical prior over subsets of the data is more challenging, as we must infer the distributions over center locations s c and neighborhood size k (and optionally, the location probability p) from partially labeled training examples. Typically, the subset of affected locations S is labeled, but the values of s c, k, andp are unknown. We will apply a generalized expectation-maximization (GEM) approach to estimate the distributions of these values, assuming multinomial distributions for s c and k, andafixed(but unknown) parameter p. This model is very similar to the latent center model described in [4], but conditions on the neighborhood size k rather than the radius r, and assumes that the conditional probability that a location is affected is uniform (within the chosen neighborhood) rather than decreasing as a function of distance from the center. We believe that incorporation of learning into the FSS approach will enable efficient optimization of the hierarchical prior from a small number of training examples, further improving the timeliness and accuracy of outbreak detection. Acknowledgements This work was partially supported by the National Science Foundation grants IIS , IIS , and IIS References 1. Neill DB, Cooper GF. A multivariate Bayesian scan statistic for early event detection and characterization. Machine Learning 2009; DOI: /s Kulldorff M, Mostashari F, Duczmal L, Yih WK, Kleinman K, Platt R. Multivariate scan statistics for disease surveillance. Statistics in Medicine 2007; 26: Kulldorff M. A spatial scan statistic. Communications in Statistics: Theory and Methods 1997; 26(6): Makatchev M, Neill DB. Learning outbreak regions in Bayesian spatial scan statistics. Proceedings of the ICML/UAI/COLT Workshop on Machine Learning for Health Care Applications, Helsinki, Finland, Neill DB. Expectation-based scan statistics for monitoring spatial time series data. International Journal of Forecasting 2009; 25: Neill DB, Moore AW, Cooper GF. A Bayesian spatial scan statistic. Advances in Neural Information Processing Systems 18, 2006; Kulldorff M, Nagarwalla N. Spatial disease clusters: detection and inference. Statistics in Medicine 1995; 14:

Scalable Bayesian Event Detection and Visualization

Scalable Bayesian Event Detection and Visualization Daniel B. Neill Carnegie Mellon University H.J. Heinz III College E-mail: neill@cs.cmu.edu This work was partially supported by NSF grants IIS-0916345,