XXV Simposio Internacional de Estadística 2015 Armenia, Colombia, 5, 6, 7 y 8 de Agosto de 2015 Monitoring aggregated Poisson data with probability control limits Victor Hugo Morales Ospina 1,a, José Alberto Vargas Navas 2,b 1 Departamento de Matemáticas y Estadística, Facultad de Ciencias, Universidad de Córdoba, Montería, Colombia 2 Departamento de Estadística, Facultad de Ciencias, Universidad Nacional de Colombia, Bogotá, Colombia Resumen The aggregation of data is a common practice in many applications, specially those with large numbers of events, so determining the impact of their use, and dispose of schemes that allow an adequate monitoring of such situations, is very important. If we assume that certain rate of events of interest have a Poisson distribution, then aggregating these events for an aggregation period of length d time units, it is obtained again data with Poisson distribution, so the schemes traditionally used for monitoring processes with such data, could also be used when these are aggregated. These schemes generally require an adequate knowledge of the samples sizes with which we are going to work, which in cases such as healthcare surveillance, it is not possible to have, since sample size in this case is related to the population at risk, which frequently changes over time. If we aggregate data when the sample size varies over time, and the count of events or non-conformities follows an (conditional) independent Poisson distribution given the corresponding sample size, the result is a process with time-varying sample sizes and with (conditional) Poisson distribution. For situations that involve sample sizes time varying, proposals such as Zhou et al. (2012), and Ryan & Woodall (2010), allow adequate monitoring of the processes, however, as Shen et al. (2013) pointed out, these and other works were built on the ground that the sample size is assumed to follow a pre-specied random or deterministic model, which is known a priori when establishing appropriate control limits before the control charts initiate. As Zhou et al. (2012) pointed out, traditional control charts are very sensitive to the correct specication of sample sizes. Furthermore, as Shen et al. (2013) pointed out, an inappropriate assumption of the distribution function may lead to unexpected performance of control charts, e.g., excessive false alarms at early runs of the control charts which in turn hurt an operator's condence in valid alarms. To overcome this drawback, Shen et al. (2013) proposes the use of probability control limits in an EWMA control chart for monitoring Poisson count data with time-varying sample sizes in Phase II. In this work, no matter what the (unknown) time-varying sample sizes are, the proposed EWMA chart always shares identical run length distribution with the Geometric distribution. Essentially this chart uses dynamic control limits which are determined online and depend only on the current and past sample size observations. It does not need to specify any sample size models before implementation, except the desired false alarm rate. Given the importance of determining the eect of aggregation of data in monitoring schemes, and considering the large number of applications that require or use time-varying sample sizes with little knowledge about their distribution, we propose to study the eect of aggregating data in monitoring schemes which employ probabilistic limits of control. Palabras clave: Aggregation of data, Poisson distribution, probability control limits, time-varying sample sizes. a Profesor Asistente. E-mail: vmorales@sinu.unicordoba.edu.co b Profesor Titular. E-mail: javargasn@unal.edu.co 1
2 Victor Hugo Morales Ospina & José Alberto Vargas Navas 1. Aggregated data Poisson In many applications, it is frequent to use methodologies and practices that facilitate the analysis of certain information, allowing to correct, or at least attenuate the diculties that can generate some characteristics in the data, such as the zero-excess presence. One of these procedures, is the data aggregation. The eect of this practice was studied by Reynolds & Stoumbos (2000). They considered two CUSUM chart types for monitoring changes in the proportion of deective items. The rst chart was based on binomial variables that result from counting the total number of eective items in a sample of size n. The second chart was based on Bernoulli observations which corresponds to the individual items checked in the samples. As result of this proceeding, it was concluded that the CUSUM Bernoulli chart showed a better global performance, specially for detecting big increases in the no conforming rates items. Reynolds & Stoumbus (2004a) researched about of convenience or not to group observations, if it is desirable to eectively monitor the process. In this approach, they investigated whether it is better to use n = 1 or n > 1 As a result they found that, the use of CUSUM charts combinations for the mean and the variance, in general, produce best statistic performance in detecting small and big shifts, sustained or transitories in µ or in σ, when the sample consist of one observation. Besides, if the Shewhart chart is used,n = 1 is better when it is required to detect small sustained shifts. Reynolds & Stoumbos (2004b) investigate whether to use n = 1 is better than n > 1 from the perspective of statistical performance in monitoring the mean and the variance process. In order to get to this, the performance of Shewhart, EWMA and CUSUM charts are compared. The result investigation shows that it is not reasonable to use the Shewhart control chart when there are indivdual observations, and the EWMA and CUSUM charts have a better statistic performance for a wide range of sample sizes and out-of-control like drift processes. In the same way, signicant dierences in the statistic performance of the EWMA and CUSUM charts, were not found. With these charts, by using n = 1 there is a better statistic performance than n > 1. One of the biggest areas of application of data aggregation is related to all those situations in which a Poisson type variable is generated, such as in the public health surveillance. In this area, there are some known researchers that deal with data aggregation. For example, Burkom et al. (2004) explore the data aggregating by space, time and categories of data, and discuss the data aggregation relevances in the ecacy of alert algorithms relevant to public health surveillance, among others. The authors concluded that a judicious strategy of data aggregation has an important function within the improvement of biomonitoring systems. Gan (1994) compares the performance of two CUSUM charts. In one chart he considered the time between Poisson events, which has exponential distribution, and in the other chart he supposes aggregate counts, which have Poisson distribution. Schuh et al. (2013), extend Gan's (1994) investigation by exploring to a greater extent the relative performance of the exponential CUSUM control charts and Poisson CUSUM considering some longer periods of aggregation. 2. The EWMAG chart As indicated earlier, Shen et al.(2013) proposed the EWMAG chart, which use probability control limits in a EWMA control chart for monitoring Poisson count data with time-varying sample sizes in Phase II. The proposed EWMA chart is called EWMAG chart because it is in control (IC), run length distribution is theoretically identical to the Geometric distribution, i.e., the false alarm rate does not depend on the time of the monitoring, nor does the sample size being monitored. Let X t be the count of an adverse event during the xed time period (t 1, t] (count of events at time t). Suppose X t independentely follows the Poisson distribution with the mean θn t conditional on n t,where θ and n t denote the occurrence rate of the event and sample size at time t, respectively. It is desired to detect an abrupt change in the occurrence rate from θ 0 to another unknown value θ 1 > θ 0. The EWMAG
Monitoring aggregated Poisson data with probability control limits 3 chart uses the statistics Z i = (1 λ)z i 1 + λ X i n i, i = 1, 2,..., t (1) as the charting statistic, where Z 0 = θ 0 and λ (0, 1] is a smoothing parameter which determines the weights of past observations. The control limit of the EWMAG chart must satisfy the following equations P (Z 1 > h 1 (α) n 1 ) = α P (Z t > h t (α) Z i < h i (α), 1 i < t, n t ) = α for t > 1 (2) where h t (α) is the control limit at time t, and α is the pre-specied false alarm rate. At time t, the probability control limit is determined right after we observed the value of n t. Consequently, the EWMAG chart does not need the assumption of future sample sizes and does not suer from wrong model assumptions. This property makes the proposed EWMAG chart signicantly dierent from previous control charts. Because of the intricacy of the conditional probability 2, it is impossible to solve h t (α) analytically. This way, computational procedures are necessary. The procedures in order to nd the probability control limits is summarized in the following algorithm: 1. If there is no a out-of-control (OC) signal at time t 1 (t = 1, 2,...), Xt,i (i = 1,..., M) are generated from the distribution Poisson(θ 0 n t ) where n t is exactly known. Accordingly, M values of the pseudo charting statistic Z t are obtained through Ẑ t,i = (1 λ)ẑt 1,j + λ X t,i n t (3) where i = 1,..., M and Ẑt 1,j, j 1,..., M is uniformly selected from Ẑ t 1,M, which contains the rts M ranked pseudo Z t 1. 2. Sort them in ascending order and the α upper empirical quantile of those M values are used for approximating the control limit h t (α). 3. Compare the value of Ẑ t, which is calculated based on observed X t and n t, with h t (α) to decide whether to issue an OC signal or to continue toward the next time point. 4. If there is a continuity, set M = [M(1 α)] and eliminate the values Ẑt,(M +1),...,Ẑt,(M). Then go back to step 1. 3. The EWMAG chart for aggregated Poisson data Our proposal is to determine the eect of aggregating data when a Poisson process is monitored with the sizes of samples variable over time. For this purpose, we adapt the proposal of Shen et al. (2013) to this context. Therefore, simulation procedures are considered to nd control limits of EWMAG chart with aggregated observation over time. The eect of aggregation of data is determined taking into account the sensiblity of EWMAG chart to detect out-of-control processes after considering dierent periods of aggregation, with the sensibility of this chart as proposed by Shen et al. (2013). The comparison criteria will be the ARL. The simulations will be made considering t sampling points corresponding to an equal number of samples. The observations obtained will be monitored with the EWMAG chart, and also having into account dierent aggregations periods according with our proposal. A more detailed description of this process, is shown below.
4 Victor Hugo Morales Ospina & José Alberto Vargas Navas If in times t 1, t 2,..., t i,..., t n,... are observed Poisson data X 1, X 2,..., X i,..., X n,... with means θ 0 n ti conditional on n ti respectively (X i P (θ 0 n ti n ti )), where θ 0 is the occurrence rate of the event and n ti is the sample size at time t i, then in the time period [t i, t j ], i, j Z + it will be obtained the observation Y ij = j k=i X k, which have Poisson distribution with mean j θ 0 k=i n t k conditional on j k=i n t k. Thus, according to Shen et al. (2013), it is possible to implement the EWMAG chart for the variable Y. Let Y 1r the variable Y observed for the time period [t 1, t r ]. Under the IC condition, Y 1r should follow the Poisson distribution with mean r θ 0 conditional on r, where r is exactly known. Therefore we can obtain the control limits during the rst time period of aggregation, by randomly generating Ŷ1r,i, where i = 1,..., M, from the distribution r Poisson(θ 0 ) and correspondingly calculating M values of pseudo Z 1r from 1 with Z 0 = θ 0, say Ẑ1r,1,...Ẑ1r,M. Then those values have to be sorted in ascending order and stored in a vector Ẑ 1rM. Then control limit h 1r (α) can be approximated as the M = M(1 α) largest values in Ẑ 1rM, where M(1 α) denotes the largest integer less than or equal to M(1 α). After determining the control limit h 1r (α), we compare the value of Ẑ1r, which is calculated based on the observed Y 1r and r, with h 1r (α). An out-of-control(oc) signal is issued if Ẑ1r > h 1r (α). Otherwise, it is possible to move forward to the next time period, [t r+1, t s ], s > r + 1. According to 2, in order to determine the control limit h r+1,s (α) corresponding to time period [t r+1, t s ], we should ensure that the value of pseudo Z 1r be less than or equal to h 1r (α). Hence only the ranked values Ẑ1r,(1),...Ẑ1r,(M ) should be kept to determine h r+1,s (α). We store the M ranked pseudo Z 1r into a vector Ẑ 1rM. Let Let Y r+1,s the variable Y observed for the time period of aggregation [t r+1, t s ]. Given s k=r+1 n t k, a vector Ẑ r+1,sm with dimension M can be obtained through Ŷ r+1,s,i Ẑ r+1,s,i = (1 λ)ẑ1,r,,j + λ s k=r+1 n (4) t k where i = 1,..., M, Ẑ1,r,,j is uniformly selected from Ẑ 1rM with j {1,..., M }, and Ŷr+1,s,i are randomly generated from s Poisson(θ 0 k=r+1 n t k ). After sorting the M elements of Ẑ r+1,s,m in ascending order, we obtain the control limit h r+1,s (α) by setting it at the (1 α) quantile of the M elements. An out-of-control(oc) signal is issued if Ẑ r+1,s > h r+1,s (α). Otherwise, it is possible to move forward to the next time period, [t s+1, t u ], s > r + 1. Repeat the above procedures by simulating M samples of u Poisson(θ 0 k=s+1 n t k ),... etc. Finally, for dierent out-of-control states, the variables X and Y will be monitored with the proposal of Shen et al. (2013) and compared taking into account the ARL in each case.
Monitoring aggregated Poisson data with probability control limits 5 4. Bibliography Burkom, H., Elbert, Y., Feldman, A., Lin, J. (2004), Role of Data Aggregation in Biosurveillance Detection Strategies with Applications from ESSENCE, 53(Supplement), pp. 67 73 Gan, F. F. (1994). Design of Optimal Exponential CUSUM Control Charts. Journal of Quality Technology, 26(2), pp. 109 124. Reynolds, M. R., Jr. and Stoumbos, Z. G. (2000). A General Approach to Modelling CUSUM Charts for a Proportion. IIE Transactions 32, pp.515 515 535. Reynolds, M. R., Jr. and Stoumbos, Z. G. (2004a). Should Observations Be Grouped for Eective Process Monitoring?, Journal of Quality Technology, 36(4), pp.343 366. Reynolds, M. R., Jr. and Stoumbos, Z. G. (2004b). Control Charts and the Ecient Allocation of Sampling Resources, Technometrics, 46(2), pp.200 214. Ryan, A. G and Woodall, W.H. (2010) Control charts for Poisson count data with varying sample sizes, Journal of Quality Technology 42, pp. 260 274. Schuh, A., Woodall, W. H., and Camelio, J. A. (2013). The Eect of Aggregating Data When Monitoring a Poisson Process, Journal of Quality Technology 45(3), pp. 260 271. Shen, X., Zou, C., Tsung, F., and Jiang, W. (2013). Monitoring Poisson count data with probability control limits when sample sizes are time-varying, Naval Research Logistics 60(8), pp. 625 636. Zhou, Q., Zou, C., Wang, Z., and Jiang, W. (2012), Likelihood-based EWMA charts for Monitoring Poisson count data with time varying sample sizes, J Am Stat Assoc 107, pp. 1049 1062.