SCALABLE ROBUST MONITORING OF LARGE-SCALE DATA STREAMS. By Ruizhi Zhang and Yajun Mei Georgia Institute of Technology

Size: px

Start display at page:

Download "SCALABLE ROBUST MONITORING OF LARGE-SCALE DATA STREAMS. By Ruizhi Zhang and Yajun Mei Georgia Institute of Technology"

Wesley Warren
5 years ago
Views:

1 Submitted to the Annals of Statistics SCALABLE ROBUST MONITORING OF LARGE-SCALE DATA STREAMS By Ruizhi Zhang and Yajun Mei Georgia Institute of Technology Online monitoring large-scale data streams has many important applications such as industrial quality control, signal detection, biosurveillance, but unfortunately it is highly non-trivial to develop scalable schemes that are able to tackle two issues of robustness concerns: (1) the unknown sparse number or subset of affected data streams and (2) the uncertainty of model specification for high-dimensional data. In this article, we develop a family of scalable robust schemes in the scenario when the local data streams are from Tukey-Huber s gross error models with outliers. We first define a new local detection statistic called L -CUSUM statistic that can reduce the effect of outliers by using the Box-Cox transformation of the likelihood function. Then we propose to raise a global alarm based upon the sum of the soft-thresholding transformation of these local L -CUSUM statistics so as to filter out unaffected data streams. In addition, we propose a new concept of false alarm breakdown point to measure the robustness of schemes with respect to outliers, and also characterize the breakdown point of our proposed schemes. Asymptotic analysis and extensive numerical simulations are conducted to illustrate the robustness and usefulness of our proposed schemes. 1. Introduction. Robust statistics have been extensively studied in the offline context when the full data set is available for analysis with contaminated outliers, e.g., robust estimation (Huber, 1964; Basu et al., 1998), robust hypothesis testing (Huber, 1965; Heritier and Ronchetti, 1994), and robust regression (Yohai, 1987; Cantoni and Ronchetti, 2001). Also see the classical books, Huber and Ronchetti (2009) or Hampel et al. (2011), for literature review. Here we propose scalable robust methods in the online context of monitoring large-scale data streams. Our research is motivated by real-world applications in industrial quality control, biosurveilance, key infrastructure or internet traffic monitoring, in which sensors are deployed to constantly monitor the changing environment, see Shmueli and Burkom (2010); Tartakovsky, Polunchenko and Sokolov (2013); Yan, Paynabar and Shi (2015). One would like to detect an undesirable event as quickly as possible by monitoring large-scale data streams generated from these sensors, but there are two issues of robustness concerns here. The first one is that only a sparse number of data streams might be affected by the event, but we do not know which subsets of data streams are affected or the exact Keywords and phrases: Change-point, CUSUM, robustness, quickest detection, scalable, sparsity. 1

2 2 Fig 1. A local data stream with outliers when a change in distribution occurs at time ν = 50. number of affected data streams. Hence we want to effectively detect the changing event regardless of different combinations of affected data streams. Xie and Siegmund (2013) was the first one to tackle this sparsity/robustness issue by semi-bayesian approach, and later Wang and Mei (2015) developed shrinkage-estimation-based schemes. Chan (2017) developed asymptotic optimality theory for largescale independent Gaussian data streams. Unfortunately all these research is based on specific parametric models (e.g.. Gaussian) for the observations which may easily be violated in practice. Moreover, these existing methods are computationally expensive and not scalable for monitoring large-scale data streams. The second, and possibly more serious, concern is that local data streams might involve random outliers under the normal state that do not indicate the changing event. One specific example is vehicle rush hour traffic monitoring, where one would like to decide whether or not to add new lanes or build new roads due to larger population or constructions of new stadium or shopping mall. However, the observed traffic data might involve the outliers that are caused by other events such as major car accidents or severe weather, which are related but probably should not be the decisive factors in the decision making. Another example is the detection of Distributed denial-ofservice (DDoS) in cyber-security by monitoring the numbers of attempted connections. When the observed number of connections suddenly becomes huge, it might be due to the normal fluctuations and does not necessarily imply the DDoS attack, especially if it goes back to normal state shortly, see Tartakovsky, Polunchenko and Sokolov (2013). For better illustration, Figure 1 plots a sequence of simulated one-dimensional observations whose distributions changes from N(0, 1) to N(1, 1) at time 50 with some contaminated outliers. These outliers often cause standard statistical methods

3 SCALABLE ROBUST ONLINE MONITORING 3 to raise local false alarms, and thus the system-wise false alarm can be huge given the large number of sensors or data streams. Indeed, too frequent system-wise false alarms are the main reason why the usefulness of bio and syndromic surveillance via a huge sensor grid (e.g., hospitals, state/conty surveillance systems) throughout the U.S., is in debate, see Stoto, Schonlau and Mariano (2004). In this paper, we develop scalable robust methods to tackle the above-mentioned two robustness issues when online monitoring large-scale data streams. Here we adopt the classical offline robust statistics approach: we have a parametric, idealized model that might be a good approximation to the true model, but we cannot and do not assume that the assumed model is exactly correct, see Huber and Ronchetti (2009). It is desirable to develop a statistical method that has a reasonably good efficiency at the assumed model, and should still have good performances if there are small deviations from the assumed model. In particular, under our online monitoring context, the assumed model specifies the number of affected local data streams as well as the local distributions of local data streams. Our proposed methods are robust in the sense that small deviations from the number of affected local data streams or the local distributions should not impair the performance too much. From the modeling viewpoint, we assume that the true model for each local data is Tukey-Huber s gross error model (Tukey, 1962; Huber, 1964), which is a two-component mixture model with one component being the assumed, idealized model for signals and the other component for outliers. The occurring event changes the signal component of Tukey-Huber s gross error model for only a few affected local data streams. From the methodology viewpoint, our proposed schemes tackle these two robustness issues by combining the newly developed online robust idea of sum-shrinkage of local detection statistics in Liu, Zhang and Mei (2017) with the contemporary offline robust concept of L q likelihood in Ferrari and Yang (2010); Qin and Priebe (2017). We should acknowledge that online monitoring of one-dimensional or low-dimensional multivariate data streams has been well studied in the sequential change-point detection literature. Many classical procedures have been developed under the parametric models such as Page s CUSUM procedure (Page, 1954), Shiryaev-Roberts procedure (Shiryaev, 1963; Roberts, 1966), window-limited procedures (Lai, 1995) and scan statistics (Glaz et al., 2001), and some classical, fundamental research were established in Shiryaev (1963), Lorden (1971), Pollak (1985, 1987), Moustakides (1986), Ritov (1990), Lai (1995), etc. For a review, see the books such as Basseville and Nikiforov (1993), Poor and Hadjiliadis (2009), Tartakovsky, Nikiforov and Basseville (2014). There are some limited research on robust monitoring one-dimensional data, and most of them were nonparametric

4 4 methods such as rank-based method in Gordon and Pollak (1994, 1995), or kernel-based method in Desobry, Davy and Doncarli (2005). However, these nonparametric methodologies generally lose efficiency under the specific parametric or semi-parametric models. An exception is Unnikrishnan, Veeravalli and Meyn (2011), which reduces the problem of robust monitoring one-dimensional data to the problem of detecting a change in the least favorable distribution that heavily depends on the proportion of outliers. Unfortunately it is unclear how to extend the concept of least favorable distribution to the context of large-scale data streams when there is uncertainty on the subset of affected local data streams. In addition, we should also mention that in a completely different context, Chen and Zhang (2015) proposes a nonparametric, graph-based method for monitoring large-scale data streams. Their research was motivated by the change of friendship network structure over time, and focused on detecting the global change in the correlation structures between data streams. While our motivating examples are biosurveillance, traffic and security monitoring, where we are interested in detecting local changes for sparse affected local data streams among a large number of data streams. Intuitively it is more challenging to detect sparse local changes than global changes across all data streams. Our research makes several contributions in the statistics field by combining robust statistics with sequential change-point detection for online monitoring large-scale data streams. First, our proposed method is robust with respect to the uncertainty of the number or subset of affected data streams as well as the uncertainty of the model assumptions of local data. Second, our proposed method is scalable and computationally simple, as its recursive form allows one to easily implement it over long time period for large-scale data streams via parallel computing at each local data stream with fixed memory requirements. Third, inspired by the concept of breakdown point (Hampel, 1968) in the offline robust statistics and the excessive false alarms in practice, we propose a novel concept of false alarm breakdown point to measure the robustness of any schemes, and showed that our proposed scheme is indeed has much larger false alarm breakdown point than the classical CUSUMbased schemes. Finally, from the mathematical viewpoint, we use Chebyshev s inequality, not the standard renewal theory, to derive non-asymptotic low bounds on the average run of length of false alarm for our proposed method. The non-asymptotic results hold regardless of the number of data streams, and allow us to provide a deep insight of monitoring large-scale data streams in the modern asymptotic regime when the number of data streams goes to. The remainder of this article is organized as follows. In Section 2, we present preliminaries and

5 SCALABLE ROBUST ONLINE MONITORING 5 background information of quickest change detection or sequential change-point detection. Then we develop our proposed schemes for online robust monitoring large-scale data streams in Section 3. The theoretical properties of our proposed schemes are provided in Section 4. In Section 5, we introduce the concept of false alarm breakdown point and characterize the breakdown point of our proposed schemes. The simulation results are presented in Section 6 and the proofs of our main theorems are postponed to Section Preliminaries and background. Suppose we are monitoring K local data streams over time, and denote by X k,n the local observation at the k-th local data stream at time n. (1) Data Stream 1 : X 1,1, X 1,2, Data Stream 2 : X 2,1, X 2,2, Data Stream K : X K,1, X K,2,. Initially, the system is in control, but at some unknown time ν, an undesired event may occur and affect some unknown data streams in the sense of changing local distributions of some, but not all, local data streams X k,n s. Equivalently, we are monitoring K-dimensional random vectors, X n = (X 1,n,, X K,n ), over time n, and the occurring event affects some, but not necessarily all, components of the X n s. We would like to utilize the observed data X k,n s to raise an alarm as quickly as possible once the true change occurs, subject to the false alarm constraint. To highlight our main ideas, we make a simplified assumption that the X k,n s are independent and identically distributed (i.i.d.) over time and across different data streams. Here the X k,n s might be one-dimensional or low-dimensional raw data, or derived features or residuals from some spatialtemporal models. Note that this independence assumption is not as restrict as one thought in many practical applications. For instance, one can first build some baseline spatio-temporal models, and then monitor independent residuals instead of dependent raw data, see Xie, Huang and Willett (2013) and Liu, Mei and Shi (2015) for two real-world applications in solar flare and hot-forming process. Another possibility is monitoring independent features, see Paynabar, Zou and Qiu (2016), Wang, Paynabar and Mei (2017) to use principal component analysis (PCA) to extract independent coefficient features of multi-channel profiles, and then monitors the independent PCA coefficients instead of raw profile data. For the purpose of a more rigorous presentation, let us begin with the local models, or the local

6 6 pre- and post- change distributions, for local data streams. At the high level, a good approximation of local models is the classical change-point model of detecting a change in local distribution from one known density function f 0 ( ) to another known density function f 1 ( ) at some unknown time ν, see Lorden (1971). Below we will refer this as the idealized model. As mentioned in the Introduction, due to the outliers, it often makes more sense to assume that these two given distributions, f 0 (x) and f 1 (x), capture most, but not all, information of the data. This motivated us to follow the classical offline robust statistics to define the true model of local observations as Tukey-Huber s models. Specifically, we assume that the data X k,n s are i.i.d. with density h 0 (x) when n ν 1, but i.i.d. with density h 1 (x) when n ν, where the h 0 s and h 1 s are from Tukey-Huber s models of the two-component mixture densities (2) h 0 (x) = (1 ɛ)f 0 (x) + ɛg(x), and h 1 (x) = (1 ɛ)f 1 (x) + ɛg(x). Here ɛ [0, 1) is the contamination ratio, and g(x) is the contamination density, which is usually assumed to be unknown except with a fat tail. Below the model in (2) will be referred as the gross error model, and clearly it becomes the idealized model when ɛ = 0. Intuitively, under the gross error model in (2), most of the data X k,n s are from the idealized pre-change or post-change distributions f 0 (x) or f 1 (x), but a small proportion of observations are contaminated and have another unknown density g(x). Alternatively, the contamination distribution g can also be considered as another intrinsic post-change distribution but we are not interested in detecting a change in density from f 0 (x) to g(x), and only interested in detecting the change from f 0 (x) to f 1 (x). Also note that we consider a simplified setup with the same contamination ratio ɛ and the same contamination density function g(x) under the pre-change and post-change distributions. However, we should emphasize that our proposed methods can be easily extended to more general cases when ɛ and g can be different between h 0 and h 1, since our schemes only utilize the knowledge of f 0 (x) and f 1 (x), but not ɛ or g(x) (though the asymptotic or optimality properties will depend on ɛ and g). Under the hypothesis of no changes, the data X k,n s are i.i.d. with density h 0, and denote the corresponding probability measure and expectation as P ( ) ɛ and E ( ) ɛ. Here and below the subscript ɛ is used to highlight the proportion ɛ of outliers in the gross error model. Under the alternative hypothesis that a change occurs at time ν, m out of K data streams are affected, and for those affected m local data streams, the observations X k,n s are i.i.d. with density h 1 when

7 SCALABLE ROBUST ONLINE MONITORING 7 n ν, whereas the observations from unaffected data streams are still i.i.d. with density h 0. The probability measure and expectation in this case are denoted by P (ν) ɛ and E (ν) ɛ. Next, we present the mathematical formulation of our online monitoring problems under the standard minimax formulation for sequential change-point detection (Lorden, 1971). In our context, the statistical procedure is defined as a stopping time T which represents the time when we raise an alarm to declare that a change has occurred. Here T is an integer-valued random variable, and the event {T = t} is based only on the observations in the first t time steps. Under the standard minimax formulation in Lorden (1971), when the number m of affected data streams, the contamination ratio ɛ and contamination distribution g are known, one would like to find a stopping time T that asymptotically minimizes the detection delay ) D ɛ (T ) = ess sup E (ν) ɛ ((T ν + 1) + Fν 1 (3) sup ν 1 for each and every combination of m affected local data streams, subject to the false alarm constraint (4) E ( ) ɛ (T ) γ for some pre-specified large constant γ > 0. Here F ν 1 = (X 1,[1,ν 1],..., X K,[1,ν 1] ) denotes past global information at time ν, X k,[1,ν 1] = (X k,1,..., X k,ν 1 ) is past local information for the k-th data stream. When m, ɛ or g is unknown, ideally the false alarm constraint in (4) and the detection delay minimization in (3) hold uniformly for all possible ɛ and g. Since this is clearly impossible, we will investigate the asymptotic properties of our schemes for given m, ɛ and g, and then compare the efficiency and robustness with other procedures through asymptotic analysis and numerical simulations under various hypothetical conditions, especially with different m and ɛ. Finally, let us review the classical Cumulative Sum (CUSUM) procedure in Page (1954) for monitoring local data streams under the idealized model with ɛ = 0, and discuss the challenges of extension to the gross error model. When the local distribution of the X k,n s may change from f 0 to f 1 at some unknown time ν, the problem of online monitoring local data streams can be formulated as repeatedly testing the null hypothesis H 0 : ν = (i.e., no change) against the composite alternative hypothesis H 1 : 1 ν < (i.e., a change occurs at some finite time) at each and every time step n. Consider the observed data X k,1,, X k,n at the k-th local stream at time n, its joint density function when the change occurs at ν becomes n i=1 f ν (X k,1,, X k,n ) = f 0(X k,i ), if ν = or if n + 1 ν < ; ν 1 i=1 f 0(X k,i ) n i=ν f 1(X k,i ), if 1 ν n.

8 8 Then for the k-th local data stream, the logarithm of the generalized-likelihood ratio (GLR) statistic at time n is defined as (5) W k,n = max 1 ν< log f ν(x k,1,, X k,n ) f ν= (X k,1,, X k,n ) = max { max 1 ν n which can be computed recursively as ( Wk,n = max Wk,n 1 + log f ) 1(X k,n ) (6) f 0 (X k,n ), 0 for n 1 with the initial value W k,0 the W k,n n i=ν } log f 1(X k,i ) f 0 (X k,i ), 0, = 0. In the sequential change-point detection literature, in (5) or (6) is referred as the CUSUM statistic, and the classical CUSUM procedure raises a local alarm at the first time n when the CUSUM statistic Wk,n in (5) or (6) exceeds some pre-specified constant. It is not surprising that as the GLR, the CUSUM statistic Wk,n in (5) or (6) yields a statistically efficient procedure for monitoring local data streams under the idealized model with ɛ = 0, see Lorden (1971); Moustakides (1986); Ritov (1990), but its statistical efficiency degrades significantly in the presence of even mild outliers for monitoring local data streams under the gross error model in (2). While one may still apply the GLR principle directly to the gross error model theoretically by maximizing over the uncertainty of the contamination ratio ɛ or contamination distribution g, the corresponding local GLR statistic no longer has a recursive form, and thus the corresponding GLR procedure lose computationally efficiency. Moreover, since the subset of affected data streams are unknown, the GLR principle would search over all possible combinations of local affected data streams, which can be huge for large-scale data streams. Hence, we want to develop alternative schemes that are efficient and scalable for monitoring large-scale data streams and can better balance the tradeoff between statistical efficiency and computational efficiency. 3. Our proposed methodology. At the high-level, our proposed scalable scheme monitors each local data stream individually in parallel and then combines local detection statistics together to raise a global alarm. For the purpose of easy understanding, we split into two subsections. Subsection 3.1 discusses how to construct robust local detection statistics in the presence of outliers by the contemporary offline robust concept of L q likelihood in Ferrari and Yang (2010); Qin and Priebe (2017), and Subsection 3.2 presents how to use the sum-shrinkage technique in Liu, Zhang and Mei (2017) to combine local decisions together to raise a global alarm under the uncertainty of affected local data streams.

9 SCALABLE ROBUST ONLINE MONITORING Construction of local L -CUSUM statistics. In this subsection we apply the offline robust concept in Ferrari and Yang (2010), Qin and Priebe (2017) to the online monitoring context to construct robust local detection statistics under the gross error model in (2). Recall that in the offline setting when one has full dataset available for analysis, one naive approach to deal with outliers is to first detect and remove outliers, and then conduct data analysis on the remaining data. However, such naive approach is generally inefficient under the online setting of monitoring large-scale streams, since some abnormal observations might just be observations that are trying to tell us that a change has occurred, and removing them as outliers will prevent one to detect the true change quickly. Here we follow the offline robust statistics literature to keep all observations but to de-emphasize the role of abnormal observations for online monitoring. To be more concrete, in the offline statistics, the log-likelihood function n i=1 log f θ(x i ) or the log-likelihood ratio test statistic {sup θ1 Θ 1 n i=1 log f θ 1 (X i ) sup θ0 Θ 0 n i=1 log f θ 0 (X i )} plays important roles for point estimation or hypothesis testing under the idealized model, but their properties will degrade significantly under the gross error model in (2). To better balance the tradeoff between efficiency and robustness, Ferrari and Yang (2010) proposes a robust point estimator by maximizing n i=1 [f θ(x i ) 1]/ and Qin and Priebe (2017) proposes a robust hypothesis testing statistic by essentially considering {sup θ1 Θ 1 n i=1 [f θ 1 (X i ) 1]/ sup θ0 Θ 0 n i=1 [f θ 0 (X i ) 1]/} before bias correction. Note that the power transformation of [u 1]/ is know as the Box-Cox transformation in the data transformation context that transforms a raw data into a normally distributed data, see Box and Cox (1964). Here the power transformation is applied to the likelihood function f(x) domain, instead of the raw data domain, in the robust context so as to de-emphasize the role of abnormal observations. A high-level intuition is as follows. On the one hand, for those typical observations X i s, the value of f(x i ) should be moderate, and thus [f(x i ) 1]/ log f(x i ) as 0, e.g., the statistical efficiency of the likelihood functions can be maintained. On the other hand, for those outlier data X i s, the values of the likelihood function f(x i ) can be very small. Thus the loglikelihood log f(x i ) might go to, but for a given > 0, the value of [f(x i ) 1]/ is bounded below by 1/. Thus the effect of these outliers can be severe for the log-likelihood function, but is controlled under the power transformation. With a suitable choice of, the power transformation [f(x) 1]/ can reach a good balance between statistical efficiency and robustness. Below we extend the above robust concept to the online monitoring context to construct local

10 10 L -CUSUM statistics that is robust for gross error model. To be more specific, based on the classical CUSUM statistics in (5) and (6) for the idealized model, we propose to replace the log-likelihood ratio log f 1 (X k,n ) log f 0 (X k,n ) by [f 1 (X k,n ) 1]/ [f 0 (X k,n ) 1]/ = [f 1 (X k,n ) f 0 (X k,n ) ]/ for the gross error model for some > 0. That is, for the k th local data stream, we define the local L -CUSUM statistics by the following recursive formula over time n : (7) W,k,n = max ( W,k,n 1 + [f 1(X k,n )] [f 0 (X k,n )], 0 ), for n 1, and W,k,0 = 0. Here 0 is a tuning parameter that can control the tradeoff between statistical efficiency and robustness under the gross error model in (2). It is interesting to note that as 0, our proposed local L -CUSUM statistics W,k,n in (7) will converges to the classical CUSUM statistic Wk,n in (5) or (6), and the choice of will be discussed later in subsection Our proposed global monitoring scheme. When online monitoring large-scale data streams under the idealized model, a standard approach is to apply shrinkage to the post-change parameter estimation in order to deal with the uncertainty of sparse affected local data streams, see Xie and Siegmund (2013); Wang and Mei (2015); Chan (2017), but unfortunately such approaches are often computationally expensive. Recently a scalable approach is proposed in Liu, Zhang and Mei (2017) when the number m of affected local data streams is known. The key idea is to apply shrinkage to local detection statistics, not local post-change parameters, since it can also filter out as many unaffected data streams as possible, e.g., the order-thresholding transformation that only keep the largest m local detection statistics. In this paper, we have an additional challenge on the uncertainty of the number of affected local data streams. Fortunately the soft-thresholding transformation turns out to be effective, since it not only filters out those unaffected streams, but also keeps any local data streams that might provide information about the changing event. Mathematically, our proposed global monitoring scheme is defined as the stopping time N (b, d) that raises a global alarm at the first time { } K (8) N (b, d) = inf n 1 : max{0, W,k,n d} b, k=1 where W,k,n is the local L -CUSUM statistic in (7), the constant d is the tuning parameter to filter out those unaffected data streams, and the control limit b > 0 is chosen to satisfy the false alarm constraint in (4).

11 SCALABLE ROBUST ONLINE MONITORING 11 We should mention that besides the soft-thresholding transformation, there are other approaches to combine the local detection statistics together to make a global alarm. Two popular approaches in the literature are the MAX and the SUM schemes, see Tartakovsky and Veeravalli (2008) and Mei (2010): (9) (10) N,max (b) = inf N,sum (b) = inf { } n 1 : max W,k,n b, 1 k K { } K n 1 : W,k,n b. Unfortunately, the MAX and SUM approaches are generally statistically inefficient unless in k=1 extreme cases of very few or many affected local data streams. For the purpose of fair comparison, besides those methods in Chan (2017) for Gaussian data under the idealized model, we also consider several other comparison methods to better illustrate the advantages of our proposed global monitoring scheme in (8). Regarding the robustness in the presences of local outliers, we compare our proposed local L -CUSUM statistic in (7) with the classical CUSUM statistic W k,n in (5) or (6). In other words, the baseline scheme is the special case of our proposed scheme N =0 (b, d) in (8) when = 0, which is based on the soft-thresholding transformation of local CUSUM statistics. On the other hand, regarding the robustness of the number of affected data streams, our proposed scheme N (b, d) in (8) will be compared to the MAX and SUM schemes, N,max (b) and N,sum (b) in (9) with the same parameter. 4. Theoretical Properties. In this section, we investigate the performance properties of our proposed scheme N (b, d) in (8) under the gross error model in (2), and we pay special attention to the dimension effect of the number K of data streams as K. For that purpose, it is necessary to introduce two technical assumptions on the variable Y = ([f 1 (X)] [f 0 (X)] )/ when X is distributed according to h 0 or h 1 under the gross error model in (2). Note when = 0, the variable Y should be treated as log(f 1 (X)/f 0 (X)). The first one is in parallel to the Kullback-Leibler information of the idealize model. Assumption 4.1. Given ɛ 0 and 0, assume [ ] [f 1 (X)] [f 0 (X)] (11) I 1 (ɛ, ) = E h1 [ ] [ ] [f 1 (X)] [f 0 (X)] [f 1 (X)] [f 0 (X)] = (1 ɛ)e f1 + ɛe g

12 12 is positive, where E h1, E f1 and E g denote the expectations when the density function of X is h 1, f 1 and g, respectively. We should emphasize that this assumption is rather wild for small ɛ, > 0. For instance, when ɛ = = 0, I 1 (ɛ = 0, = 0) becomes the well-known Kullback-Leibler information number (12) I(f 1, f 0 ) = I 1 (0, 0) = E f1 log(f 1 (X)/f 0 (X)), which is always positive unless f 0 = f 1. Since all functions are continuous with respect to and ɛ, it is reasonable to assume that I 1 (ɛ, ) are also positive for small ɛ, > 0. The second assumption involves some probability backgrounds. For a random variable Y with pdf s(y), assume that the moment generating function ϕ(λ) = E(e λy ) = e λy s(y)dy is well-defined. Then ϕ(λ) is a convex function of λ with ϕ(0) = 1, and there often exists another non-zero constant λ such that ϕ(λ ) = 1, see Lemma 7.1 below. If such λ exists, it is easy to show that λ > 0 if and only if E(Y ) < 0, since ϕ(λ) is convex and ϕ (0) = E(Y ). Moreover, such λ allows us to construct a new probability density function q(y) = e λ y s(y) where s(y) is the pdf of Y, and thus e λ y is just the likelihood ratio q(y)/s(y). Our second assumption essentially says that such λ > 0 exists for Y = ([f 1 (X)] [f 0 (X)] )/ under the pre-change hypothesis, and is rigorously stated as follows. Assumption 4.2. Given ɛ 0 and 0, assume there exists a number λ(ɛ, ) > 0 such that { } 1 = E h0 exp λ(ɛ, ) [f 1(X)] [f 0 (X)] (13) = (1 ɛ)e f0 exp {λ(ɛ, ) [f 1(X)] [f 0 (X)] } + ɛe g exp {λ(ɛ, ) [f 1(X)] [f 0 (X)] }. When = ɛ = 0, it is easy to see λ(ɛ = 0, = 0) = 1 in Assumption 4.2 since E f0 (e Y ) = 1 for Y = log f 1(X) f 0 (X). This suggests that Assumption 4.2 is reasonable at least when ɛ and are small. With Assumptions 4.1 and 4.2, we are able to present the properties of our proposed scheme N (b, d) in (8) in the following subsections. Subsection 4.1 discusses the false alarm properties, whereas Subsection 4.2 investigates the detection delay properties including the robustness regarding on the number of affected local data streams. The choice of tuning parameters of our proposed scheme N (b, d) in (8) is provided in Subsection 4.3. Since the false alarm robustness with respect to the outliers are very important in practice, we present it separately in Section 5.

13 SCALABLE ROBUST ONLINE MONITORING False alarm analysis. In this subsection, we analyze the global false alarm rate of our proposed scheme N (b, d) in (8) for online monitoring K local data streams under the gross error model in (2), no matter how large K is. The classical techniques in sequential change-point detection for one-dimensional data are based on the change of measure arguments and then use renewal theory to conduct overshoot analysis under the asymptotic setting as the global threshold b goes to. Unfortunately such renewal-theory-based analysis often yields poor approximations when the dimension K is moderately large, since the overshoot constant generally increases exponentially as a function of the dimension K. Moreover, they cannot be extended to the modern asymptotic regime when the number K of local data streams goes to. In other words, these classical techniques are unable to provide deep insight on the effects of the dimension K. Here we present an alternative approach that is based on Chebyshev s inequality and can provide useful information bounds on the global false alarm rate regardless of how large the number K of data streams is. Theorem 4.1. Given that Assumption 4.2 holds for ɛ 0 and 0, i.e., λ(ɛ, ) > 0. If λ(ɛ, )b > K exp{ λ(ɛ, )d}, then the average run length to false alarm of our proposed scheme N (b, d) in (8) satisfies (14) E ( ) ɛ [N (b, d)] 1 4 exp ( [ λ(ɛ, )b K exp{ λ(ɛ, )d} ] 2 ). The detailed proof of Theorem 4.1 will be postponed in subsection 7.1, and here let us add some comments to better understand the theorem. First, to the best of our knowledge, our rigorous, nonasymptotic result in (14) is the first of its kind in the sequential change-point detection literature, and it holds no matter how large the number K of data streams is. This allows us to investigate the modern asymptotic regime when the dimension K goes to. Second, the assumption of λ(ɛ, )b > K exp{ λ(ɛ, )d} essentially says that the global threshold b of our proposed scheme N (b, d) in (8) should be large enough if one want to control the global false alarm rate when online monitoring large-scale streams. In particular, in order to satisfy the false alarm constraint γ in (4), it is natural to set the right-hand side of (14) to γ. This yields a conservative choice of b that satisfies λ(ɛ, )b = K exp{ λ(ɛ, )d} + log(4γ). Such a choice of b will automatically satisfy the key assumption of λ(ɛ, )b > K exp{ λ(ɛ, )d} in the theorem. Third, when ɛ = = 0, we have λ(ɛ = 0, = 0) = 1, and our lower bound (14) is similar, though slightly looser, as compared to those results in equation (3.17) of Liu, Zhang and Mei (2017),

14 14 whose arguments are heuristic under a more refined assumption on some tail distribution (see G(x) defined in (39) below). Here we provide a rigorous mathematical statement in Theorem 4.1 with fewer assumptions, though the price we pay is that the corresponding lower bound is a little loose. Finally, it turns out that our lower bound (14) provides the correct first-order term of the classical CUSUM procedure when online monitoring K = 1 data stream under the idealized model. In that case, we have ɛ = = d = 0, and the classical CUSUM procedure is the special case of our procedure N =0 (b, d = 0). Since λ(ɛ = 0, = 0) = 1, our lower bound (14) shows that for any b > 1, (15) lim inf b log E ( ) ɛ=0 [N =0(b, d = 0)] 1. b Meanwhile, as the classical CUSUM procedure, it is well-known from the classical renewal-theorybased techniques that lim ( ) ɛ=0 [N =0(b,d=0)] log E b b = 1, see Lorden (1971). Hence, our lower bound (14) provides the correct first-order term for log E ( ) ɛ [N (b, d)] under the one-dimensional case as b. As a result, we feel our lower bound in (14) is not bad in the modern asymptotic regime when the dimension K goes to Detection delay analysis. In this subsection, we provide the detection delays of our proposed scheme N (b, d) in (8) under the gross error model in (2) when m out of K data streams are affected by the occurring event for some given 1 m K. The following theorem presents the detection delay properties, and the proof will be postponed in subsection 7.2. Theorem 4.2. Suppose Assumption 4.1 of I 1 (ɛ, ) > 0 in (11) holds, and assume m out of K local data streams are affected. If b/m + d goes to, then the detection delay of N (b, d) satisfies ( ) 1 b (16) D ɛ (N (b, d)) (1 + o(1)) I 1 (ɛ, ) m + d, where the o(1) term does not depend on the dimension K, and might depend on m and as well as the distributions h 0 and h 1. So far Theorems 4.1 and 4.2 investigate the performance properties of our proposed scheme N (b, d) in (8) without considering the false alarm constraint γ in (4). Let us now investigate the detection delay properties of our proposed scheme N (b, d) in (8) under the gross error model in (2), subject to the false alarm constraint γ in (4).

15 SCALABLE ROBUST ONLINE MONITORING 15 The following corollary characterizes such detection delay properties when the number m of affected data streams is known. It also includes the suitable choices of the soft-threshold parameter d and the global detection threshold b under the asymptotic regime when the false alarm constraint γ = γ(k) as the dimension K whereas the number m of affected data streams m = m(k) may or may not go to. Corollary 4.1. Under the assumptions of Theorems 4.1 and 4.2, for a given 0 and given d 0, a choice of global detection threshold (17) b γ = 1 ( ) 2 log(4γ) + K exp{ λ(ɛ, )d}, λ(ɛ, ) will guarantee that our proposed scheme N (b, d) satisfies the global false alarm constraint γ in (4). Moreover, in the asymptotic regime when the false alarm constraint γ = γ(k) and m = m(k) << min(log γ, K) as the dimension K, with b = b γ in (17), a first-order optimal choice of the soft-thresholding parameter d that minimize the upper bound of detection delay in (16) is (18) d opt = { 1 log K } log γ + log λ(ɛ, ) m m and the detection delay of the corresponding optimized scheme N (b γ, d opt ) in (8) satisfies { 1 + o(1) log γ log γ D ɛ (N (b γ, d opt )) + log λ(ɛ, )I 1 (ɛ, ) m m + log K } (19). m Proof: The choice of b = b γ in (17) follows directly from Theorem 4.1. To prove (18), we abuse the notation and use λ to denote λ(ɛ, ) for simplification. By Theorem 4.2, the optimal d is the non-negative value that minimize the function (20) l(d) := b γ m + d = 1 λm ( log(4γ) + Ke λd ) 2 + d. This is an elementary optimization problem, and the optimal d can be found by taking derivative of l(d) with respect to d, since l(d) is a convex function of d. To see this, l (d) = 1 log(4γ) m ( Ke λd + ) log(4γ) 2 4m l (d) = λ log(4γ) m ( Ke λd + ) Ke 2 λd > 0.

16 16 Thus l(d) is a convex function on [0, + ), and the optimal d opt value can be found by setting l (d) = 0 : Ke λd = This gives an unique optimal value m + log(4γ) log(4γ). (21) d opt = 1 λ log K ( m log(4γ) 2 1 log(4γ)) 2 [ = 1 m + 1 λ log 4 log(4γ) + ] log(4γ) + log K m m, which is equivalent to those in (18) under the assumption that m = m(k) << min(log γ, K). Plugging d = d opt in (21) back to (17) yields (19), and thus the corollary is proved. Note that on the right-hand side of (19), the dominant order is max( log γ m, log K m ), and the second term of log log γ m might be negligible. However, we decide to keep it in Corollary 4.1, since this term will reflect the effect on the assumed number of affected data streams. In practice, we often do not know the true number m of affected data streams. We may make an imperfect assumption that m 0 out of K data streams are affected, and then adopt the corresponding mis-optimized proposed scheme N (b γ, d opt ) in Corollary 4.1 under the imperfect assumption of m = m 0, e.g., d opt in (18) is defined when m = m 0. The following corollary establishes the robustness of our proposed scheme on the number of affected local data streams. Corollary 4.2. Assume the optimized scheme N (b γ, d opt ) in Corollary 1 is designed under the assumption that m 0 data streams are affected. When the true number of affected local data streams is m, if max(m, m 0 ) << min(log γ, K), its detection delay satisfies { 1 + o(1) log γ log γ D ɛ (N (b γ, d opt )) + log + log K } (22), λ(ɛ, )I 1 (ɛ, ) m m 0 m 0 which is asymptotically equivalent to the right-hand side of (19) whenever (23) log m log m 0 << max( log γ m, log K m ). Corollary 4.2 follows at once from Corollary 4.1, Theorem 4.2, and the facts that the difference between the right-hand sides of (22) and (19) is log m log m 0. Condition (23) essentially means

17 SCALABLE ROBUST ONLINE MONITORING 17 that we do not mis-specify the number of affected data streams very badly. In such a scenario, Corollary 4.2 suggests that the detection delay of the mis-optimized proposed scheme is similar to that of the correctly optimized proposed scheme, and thus our proposed scheme is robust with respect to the assumption of the number of affected data streams. It is useful to add some remarks to better understand Corollary 4.1, as research is rather limited in the sequential change-point detection literature in the modern asymptotic regime when the number K of data streams goes to. If we compare the optimal soft-thresholding parameter b opt in (18) with the minimum detection delay in (19), the effects of the dimension K are the same, but the effects of the false alarm constraint γ are different. Thus different asymptotic scenarios may arise depending on the asymptotic orders of log K log γ m, log m extreme cases. First, let us consider the extreme case when log K m log γ and m, and below we consider several log γ << log m, i.e., K << log γ. This is consistent with the classical asymptotic regime when K is fixed and the false alarm constraint γ goes to. In this case, for our proposed scheme, the minimum detection delay in (19) is of order log γ m. To be more concrete for the idealized model with ɛ = 0, the optimal choice of = 0, λ(ɛ = 0, = 0) = 1 and I 1 (ɛ = 0, = 0) = I(f 1, f 0 ), the Kullback-Leibler divergence. Hence the delay of N =0 (b γ, d opt ) would be bounded above by 1+o(1) I(f 1,f 0 ) log γ m. Meanwhile, under the idealized model, for any scheme T satisfying the false alarm constraint γ in (4), it is well-known that D ɛ=0 (T ) 1+o(1) I(f 1,f 0 ) log γ m as γ goes to, see Mei (2010). This suggests that our proposed scheme with = 0 attains the classical asymptotic lower bound under the idealized model with ɛ = 0 in the classic asymptotic regime of K << log γ. Second, let us consider another extreme case when log K m log γ >> m, or equivalently, when log γ << m log K m. This may occur when the number m of affected data streams is fixed and log γ = o(log K), i.e., the false alarm constraint γ is relatively small as compared to K. In this case, both the optimal soft-thresholding parameter d opt in (18) and the minimum detection delay in (19) are of order log K m, and the impact of the false alarm constraint γ is negligible. In other words, our proposed scheme need to take at most O(log K) observations to detect the sparse post-change scenario when only m out of K data streams are affected. This is consistent with the modern asymptotic regime results in the off-line statistics that O(log p) observations can fully recover the signal in p-dimensional observation under the sparsity assumption, see Candes and Tao (2007). Third, the other extreme case is when both log K m and log log γ m have the same order. This can

18 18 occur if m = K 1 β and log γ = K ζ for some 0 < β, ζ < 1, which was first investigated in Chan (2017) under the idealized model for Gaussian data. It is interesting to compare our results with those in Chan (2017). Under the idealized model with ɛ = 0, the optimal choice of = 0, and thus our results in Corollary 4.1 showed the the detection delay of our proposed scheme is of order K ζ+β 1 + (ξ + 2β 1) log K, which is actually of order log K if 1 ζ 2 < β < 1 ζ but of order K ζ+β 1 if ζ + β > 1. These two cases are exactly the assumptions in Theorems 1 and 4 of Chan (2017). While the assumption of m << min(log γ, K) in Corollary 4.1 corresponds to ζ + β > 1, in which our detection delay bound is identical to the optimal detection bound in Chan (2017), it is not difficult to see that the proof of Corollary 4.1 can be extended to the case of 1 ζ 2 < β < 1 ζ, in which our results are only slightly weaker than that of Chan (2017) in the sense that the order is the same but our constant coefficient is larger. The latter is understandable because Chan (2017) used the Guassian assumptions extensively to conduct a more careful detection delay analysis than our results in (16), and his results are refiner for Gaussian data under the idealized model. Meanwhile, our results are more general as they are applicable to any distributions and the gross error models. More importantly, our results give an simpler and more intuitive explanation on those assumptions in the theorems of Chan (2017), and provide a deeper insight of online monitoring large-scale data streams under general settings Optimal choices of tuning parameters. Note that there are three tuning parameters, the robustness parameter, the shrinkage parameter d and the control threshold b in our proposed scheme N (b, d) in (8). For any given and d, the control threshold b is chosen to satisfy the false alarm constraint γ in (4). Also the (asymptotic) optimal choice of the shrinkage parameter d is given in (18), which is consistent with our intuition that the shrinkage parameter depends on the number m of affected data streams. Below we will focus on the choice of the robustness parameter that balances the tradeoff between statistical efficiency and robustness under the gross error model. By (19) in Corollary 4.1, an optimal choice of is to maximize λ(ɛ, )I 1 (ɛ, ). For the purpose of better illustration, we treat = 0 as the baseline since it corresponds to the classical CUSUM scheme that is optimal under the idealized model. Then relation (19) inspires us to define the asymptotic efficiency improvement of the proposed scheme N (b, d) with 0 as compared to the baseline scheme N =0 (b, d) as (24) e(ɛ, ) = λ(ɛ, )I 1 (ɛ, ) λ(ɛ, = 0)I 1 (ɛ, = 0) 1

19 SCALABLE ROBUST ONLINE MONITORING 19 Hence, the optimal choice of can also be defined by maximizing the efficiency improvement e(ɛ, ). That is (25) opt (ɛ) = arg max [λ(ɛ, )I 1(ɛ, )] = arg max[e(ɛ, )] 0 0 It is non-trivial to derive the theoretical properties of opt as a function of ɛ, as it will depend on the relationships between f 0, f 1 and the contamination density g. One possible approach is to investigate the local structure of e(ɛ, ) in the neighborhood of (ɛ, ) = (0, 0) by considering the second-order Taylor expansions of λ(ɛ, ) in (4.2) and I 1 (ɛ, ) in (4.1) at (ɛ, ) = (0, 0). This allows us to approximate λ(ɛ, )I 1 (ɛ, ) as a quadratic polynomial function of for a given ɛ. Maximizing such quadratic function allows us to see that opt (ɛ) = C 1 ɛ+o(ɛ) for some constant C 1 that depends only on f 0, f 1 and g. In other words, the optimal choice opt (ɛ) seems to be linearly dependent on ɛ for very small ɛ. Unfortunately this works only for very small ɛ, and the expression of such constant C 1 is too complicated to be useful in practice. It remains an open problem to derive a meaningful, theoretical characterization of opt for general value of ɛ, but the good news is that the numerical values of opt can be found fairly easy. Below let us provide an efficient algorithm to numerically compute e(ɛ, ) in (24) and the optimal opt in (25) for given ɛ and g. The main tool is the Monte Carlo integration and grid search, and our key idea to simplify computational complexity is to run Monte Carlo simulation once to compute λ(ɛ, ) in (11) and I 1 (ɛ, ) in (13) simultaneously for many possible combinations of (ɛ, ). When computing λ(ɛ, ) in (11), we first generate one set of m (e.g., = 10 4 ) i.i.d. random variables X (1) 1,..., X(1) m from density f 0, and another set of m i.i.d. random variables X (2) 1,..., X(2) m from density g. Next, we conduct grid search by specifying a list of values for ɛ > 0, > 0 and λ > 0. For each combination of those (ɛ,, λ), we compute the objective function { } H(ɛ,, λ) = 1 ɛ m exp λ [f 1(X (1) i )] [f 0 (X (1) i )] m i=1 { } + ɛ m exp λ [f 1(X (2) i )] [f 0 (X (2) i )] m i=1 Then for each fixed pair (ɛ, ), we estimate λ(ɛ, ) by numerically searching for λ such that H(ɛ,, λ) 1. This algorithm is computationally efficient, as it only needs to generate the random variables, X (1) i s and X (2) i s once, which can then be used repeatedly and simultaneously for all pairs (, ɛ) s via matrix operations. Similar ideas can be applied to efficiently estimate I 1 (ɛ, ).

20 20 This allows us to efficiently compute e(ɛ, ) for many combinations of (ɛ, ). Finally, for a given ɛ 0, a brute-force exhaustive search over to obtain the optimal value that maximizes e(ɛ, ). As an illustration, we consider a concrete example when f 0 is the pdf of N(0, 1), f 1 is the pdf of N(1, 1), g is the pdf of N(0, 3 2 ). We run the above-mentioned numerical algorithm with m = 10 4 random samples when ɛ varies from 0 to 0.15 with step size 0.01, varies from 0 to 0.9 with step size 0.01, and λ varies from 0.1 to 5 with step size The total number of different combinations of (ɛ,, λ) is , and the computation time is around 8 minutes on a Windows 10 Laptop with Intel i5-6200u CPU 2.30 GHz. Figure 2 plots e(ɛ, ) as a function of the tuning parameter for several fixed ɛ. From Figure 2, it is clear that when ɛ = 0, the e(ɛ = 0, ) curve (red curve) is linearly decreasing as a function of 0, and thus the optimal choice of is 0 for ɛ = 0. This is consistent with the optimality properties of the CUSUM statistic under the idealized model without outliers. Meanwhile, for any other contamination rate ɛ > 0, the e(ɛ, ) curve is first increasing and then decreasing as increases. Thus the optimal choice of opt is often positive when ɛ > 0. For instance, when ɛ = 0.1, Figure 2 (blue curve) shows that opt (ɛ = 0.1) 0.21, and e(ɛ = 0.1, = 0.21) This suggests that our proposed L -CUSUM based scheme with = 0.21 will be 63% more efficient than the baseline CUSUM based scheme under the gross error model when there are 10% outliers. Figure 3 shows the efficiency improvement of our proposed L -CUSUM based scheme with = 0.21 under different contamination ratio ɛ from 0 to From the plot, we can see that as compared to the classical CUSUM based method, our proposed L -CUSUM based scheme with = 0.21 will gain 40% 70% more efficiency when the contamination ratio ɛ [2%, 15%], and the price we pay is to lose 5% efficiency under the idealized model with ɛ = The breakdown point analysis. In the classical offline robust statistics, the breakdown point is one of the most popular measures of robustness of statistical procedures. At the high-level, in the context of finite samples, the breakdown point is the smallest percentage of contaminations that may cause an estimator or statistical test to be really poor. For instance, when estimating the parameter of a distribution, the breakdown point of the sample mean is 0 since a single outlier can completely change the value of the sample mean, whereas the breakdown point of the sample median is 1/2. This suggests that the sample median is more robust than the sample mean. Since the pioneering work of Hampel (1968) for the asymptotic definition of breakdown point, much research has been done to investigate the breakdown point for different robust estimators or

Large-Scale Multi-Stream Quickest Change Detection via Shrinkage Post-Change Estimation

Large-Scale Multi-Stream Quickest Change Detection via Shrinkage Post-Change Estimation Yuan Wang and Yajun Mei arxiv:1308.5738v3 [math.st] 16 Mar 2016 July 10, 2015 Abstract The quickest change detection