Monitoring Multivariate Data via KNN Learning

Size: px

Start display at page:

Download "Monitoring Multivariate Data via KNN Learning"

Diane Lloyd
5 years ago
Views:

1 Monitoring Multivariate Data via KNN Learning Chi Zhang 1, Yajun Mei 2, Fugee Tsung 1 1 Department of Industrial Engineering and Logistics Management, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong 2 H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology Abstract Process monitoring of multivariate quality attributes is important in many industrial applications, in which rich historical data are often available thanks to modern sensing technologies. While multivariate statistical process control (SPC) has been receiving increasing attention, existing methods are often inadequate as they either cannot deliver satisfactory detection performance or cannot cope with massive amounts of complex data. In this paper, we propose a novel k-nearest neighbors empirical cumulative sum (KNN-ECUSUM) control chart for monitoring multivariate data by utilizing historical data under in-control and out-of-control scenarios. Our proposed method utilizes the k-nearest neighbors (KNN) algorithm for dimension reduction to transform multi-attribute data into univariate data, and then applies the CUSUM procedure to monitor the change in the empirical distributions of the transformed univariate data. Simulation studies and a real industrial example based on a disk monitoring system demonstrates the effectiveness of our proposed method. Keywords: Category Variable, CUSUM, Empirical Probability Mass Function, KNN, Statistical Process Control 1

2 1 Introduction Due to the rapid development of sensing technologies, modern industrial processes are generating large amounts of real-time measurements taken by many different kinds of sensors to reflect system status. For instance, in chemical factories, the boiler temperature, air pressure and chemical action time are all recorded in real time during the boiler heating process. For the purpose of quality control, it is important to take advantage of these real-time measurements and develop efficient statistical methods that can detect undesired events as quickly as possible before the catastrophic failure of the system. The early detection of undesired events has been investigated in the subject of the subfield of statistical process control (SPC), and the corresponding statistical methods are often referred as control charts. In general, a control chart computes a real-valued monitoring statistic at each time step, and raises an alarm whenever this monitoring statistic exceeds a pre-specified control limit. In the SPC framework, the performance of a control chart is often measured by two kinds of average run lengthes (ARLs): one is in-control (IC) ARL, and the other is out-of-control (OC) ARL. Here the IC and OC ARLs are defined as the average amount of time/number of observations from the start of monitoring to the first alarm, respectively, when the system is IC or OC. For a given IC ARL, a control chart with a smaller OC ARL will be able to monitor processes more efficiently. In the context of monitoring multivariate data, many control charts have been proposed in the literature, including the multivariate cumulative sum (MCUSUM, Woodall & Ncube, 1985; Crosier, 1988), the multivariate exponentially weighted moving average (MEWMA, Lowry et al. 1992), the regression-adjusted control chart (Hawkins, 1991) and charts based on variable selection (Zou & Qiu, 2009; Wang & Jiang, 2009). We recommend Lowry and Montgomery (1995), Lu et al. (1998) and Woodall and Montgomery (2014) for excellent literature reviews. Most of the aforementioned multivariate control charts make parametric assumptions, e.g., the data are multivariate normally distributed. However, for the observed multivariate data in modern processes, the multivariate normal assumption is often violated, and it is nontrivial, if not impossible, to model their true parametric distributions or to infer the model parameters. Recently, some nonparametric or model-free control charts have been proposed in the literature, see Sun & Tsung (2003); Hwang et al. (2007); Camci et al. (2008); Qiu (2008); Sukchotrat et al. (2010); Zou and Tsung (2011); 2

3 Ning & Tsung (2012); Zou et al. (2012); and the references therein. Also see Chakraborti et al. (2001) for more reviews. However, these existing nonparametric or model-free charts often require that the data distribution is unimodal, and thus they might be inadequate for monitoring modern sensing systems when the unimodal or other assumptions are violated. Thus, more comprehensive studies on nonparametric or model-free charts are called for to cope with modern processes. A new feature of modern processes is that rich historical data are often available thanks to modern sensing systems. The historical data often include both in-control (IC) and outof-control (OC) data. A concrete motivating example is the application of the hard disk drive monitoring system (HDDMS). The HDDMS is a computerized system that records various attributes related to disk failure and provides early warnings of disk degradation. The HDDMS collects data on 23,010 IC hard disks and 270 OC hard disks. For each disk, 14 attributes are recorded once every hour for one week. In this way, large amount of historical IC and OC data are provided, and the challenge is how to make use of these rich historical IC and OC data. While the historical IC data have been extensively used by the traditional SPC methods, unfortunately the OC data are often overlooked, although they could give additional insights to develop efficient control charts in some scenarios (Zhang et al. 2015). In this paper, we develop a nonparametric or model-free control chart for monitoring multivariate data with the focus of efficiently detecting OC patterns that are similar to those in the historical OC dataset. Our method applies the k-nearest neighbors (KNN) algorithm to convert the multivariate data into a one-dimensional category variable, and a Phase II CUSUM control chart is constructed based on the estimated empirical probability mass function (p.m.f) of the one-dimensional category variable under both IC and OC states. Our proposed monitoring scheme has the following advantages: (i) it is data-driven and can be easily adapted to any multivariate distribution; (ii) it makes full use of historical IC and OC data and holds desirable performance; (iii) it is statistically appealing since it is based on the empirical p.m.f of the derived one-dimensional category variable; (iv) it is easy to implement and compute. The remainder of this paper is organized as follows. The problem of monitoring multivariate data is stated in Section 2, and our proposed KNN-based control chart is presented 3

4 in Section 3. Extensive simulation studies are reported in Section 4, and the case study involving the HDDMS data is presented in Section 5. Several concluding remarks are made in Section 6, and some of the technical or mathematical details are provided in the appendix. 2 Problem Description and Literature Review Following the standard SPC literature, we monitor a sequence of independent p-dimensional random vectors {x 1, x 2,..., x i,...} over time i = 1, 2,, from some process. Initially the process is in the IC state, and the cumulative distribution function (c.d.f.) of the x i is F (x; θ 0 ) for some IC parameter θ 0. At some unknown time point τ, the process is OC in the sense that the x i s have another c.d.f. F (x; θ 1 ) for some OC parameter θ 1. In other words, the multivariate data x i can be described by the following change-point model: x i { F (x; θ0 ) for i = 1,..., τ F (x; θ 1 ) for i = τ + 1,..., (1) where the change time τ is an unknown integer-valued parameter. When monitoring multivariate data process x i s, our goal is to detect the change time τ as soon as possible after the change occurs. This can be formulated as testing the simple null hypothesis H 0 : τ = (i.e., no change) against the composite alternative hypothesis H 1 : τ = 1, 2, (i.e., a change occurs at some finite time), but with the twist that we must test these hypotheses at each and every time step n until we feel we have enough evidence to reject the null hypothesis and declare that a change has occurred. A statistical procedure is defined as a control chart that raises an alarm at time N based on the first N observations, and the expectation, E(T ) is often referred as the IC or OC average run length (ARL), depending on whether the data is in the IC or OC state. One would like to find a control chart that minimizes the OC ARL, E oc (T ), subject to the following false alarm constraint under the IC state: E ic (T ) γ. (2) where γ > 0 is a pre-specified constant. In other words, the control chart is required to process at least γ observations on average before raising a false alarm when all observations are IC. 4

5 This problem is well studied under the parametric model when the distributions of x i under the IC and OC states are fully specified, and one classical method is the CUSUM procedure developed by Page (1954). To define the CUSUM procedure, denote by f ic ( ) and f oc ( ) the probability density functions (p.d.f.s) or p.m.f.s of the x i under IC and OC scenarios, respectively. Then the CUSUM statistic at time t can be defined recursively as for t = 1, 2, and W 0 W t = max{w t 1 + log f oc(x t ) f ic (x t ), 0} = 0. The CUSUM procedure will raise an alarm at the first time t whenever the CUSUM statistic W t L, where the pre-specified threshold L > 0 is chosen to satisfy the IC ARL constraint in (??). It is well known that the CUSUM statistic W t is actually the generalized likelihood ratio statistic of all observations up to time t, {x 1,, x t }, and the CUSUM procedure enjoys certain exactly optimal properties (Moustakides, 1986). Unfortunately the CUSUM procedure requires complete information of the IC and OC p.d.f.s, f ic ( ) and f oc ( ), and thus it may not be feasible in practice when it is non-trivial to model the distributions of multivariate data. In this article, we do not make any parametric assumptions on the c.d.f.s or p.d.f.s of the multivariate data x i under the IC or OC states. Instead we assume that two large historical datasets are available: one is an IC dataset, denoted by X IC = {x ic1, x ic2,...}, and the other is an OC dataset denoted by X OC = {x oc1, x oc2,...}. When monitoring the new observations x i, we are interested in detecting those OC patterns that are similar to the historical OC patterns. Our task is to utilize these historical IC and OC datasets to build an efficient control chart that can effectively monitor the possible changes in the new data x i, which can be regarded as the test data. As mentioned in the introduction, nonparametric control charts have been developed in the SPC literature (see Qiu, 2008; Zou et al., 2012), though they often only utilize the training IC dataset, and ignore the historical OC data. For the purpose of illustration and comparison, below we present the multivariate sign exponentially weighted moving average (MSEWMA) control chart proposed by Zou and Tsung (2011), which will be compared with our proposed method in our simulation studies in later sections. The MSEWMA control chart of Zou and Tsung (2011) is based on the sign test. The observed p-dimensional 5

6 multivariate vector x i is first unitized by the transformation and then the monitoring statistics are defined by v i = A 0(x i θ 0 ) A 0 (x i θ 0 ), (3) ω i = (1 λ)ω i 1 + λv i and Q i = 2 λ λ pω iω i. (4) In the MSEWMA control chart of Zou and Tsung (2011), an alarm is raised whenever Q i exceeds some prespecified threshold. The two parameters {θ 0, A 0 } in v i are estimated from the IC samples, and denotes the Euclidean norm. In general, the transformed unit vector v i in the MSEWMA control chart is uniformly distributed on the unit p-dimensional sphere when the observations are IC, and thus the statistic Q i is just the T 2 statistic of the Hotelling multivariate control chart applied to the p-dimensional univariate vectors v i. Clearly, the MSEWMA control chart of Zou and Tsung (2011), or in fact many other nonparametric methods, only utilize the training dataset from the IC scenario, and raise an alarm even if the observed data do not follow the historical IC patterns. By doing so, the benefit is to be able to detect any general OC patterns, but the disadvantage is to lose the statistical efficiency when one is interested in detecting specifical OC patterns that are similar to the historical ones. Our proposed nonparametric monitoring method utilizes both IC and OC data from the training datasets, and focuses on efficiently detecting OC patterns that are similar to those in the historical OC dataset. 3 The Proposed KNN-ECUSUM Control Chart Let us first provide a high-level description of our proposed k-nearest neighbors empirical cumulative sum (KNN-ECUSUM) control chart which is designed for monitoring p- dimensional observations x i based on the historical/training IC and OC datasets X IC and X OC. Our proposed KNN-ECUSUM control chart consists of the following three steps: Step 1: Use the KNN method to transform p-dimensional data x to one-dimensional category data z(x) = Z(x; X knn IC, X knn OC ), which is the proportion of the k nearest neighbors of x that are IC. Here X knn IC KNN classifier. and X knn OC are the subsets of training data for constructing the 6

7 Step 2: Estimate the empirical IC and OC probability mass functions (p.m.f.) of the one-dimensional category data z(x) by using the subsets of training IC and OC data X IC, X OC, which should be disjoint with the subsets X knn IC and X knn OC in Step 1. Step 3: Apply the classical CUSUM procedure to monitor the one-dimensional data z(x i ), where z(x i ) is computed for each new pieces of online data x i as in step 1, and the IC and OC p.m.f. of z(x i ) are estimated in Step 2. Below we will describe each component of our proposed KNN-ECUSUM control chart in detail and also provide some practical guidelines. The remainder of this section is organized as follows. Section 3.1 presents the construction of the transformation z(x) through the KNN method, and Section 3.2 discusses the estimation of the empirical p.m.f. of the transformed variable. In Section 3.3, the procedure for building the CUSUM chart for the one-dimensional data z(x i ), is elaborated. Section 3.4 gives a summary and practical guidance for our proposed control chart. 3.1 Construction of one-dimensional data z(x) In this subsection, we discuss the transformation of p-dimensional data x to one-dimensional data z(x) by applying the KNN algorithm to the IC and OC training data. For this purpose, we randomly select some training data, X knn IC and X knn OC, from the provided IC and OC training sets {X IC, X OC }. As a simple classification method, the KNN method is then applied to the datasets X knn IC and X knn OC to predict the label of any p-dimensional data x according to its nearest k neighbors. Here we extend the binary classification output of the KNN method to the k-category probability output. as To be more specific, the distance between two p-dimensional vectors {x i, x j } is defined dis(x i, x j ) = x i x j 2 where 2 denotes the Mahalanobis distance. Next, for any given p-dimensional vector x, let V (x, k) denote the k nearest points of x in the training datasets X knn IC and X knn OC under the above mentioned distance criterion. Note that the k points in the set V (x, k) may contain both IC and OC training samples in general. Then the transformed variable 7

8 z(x) is defined as the probability output of the KNN algorithm: z(x) = # of points in the intersection V (x, k) Xknn IC, (5) k i.e., z(x) is the proportion of IC samples in the k neighbors of x. Since the value of z(x) can only be i for an integer i = 0, 1, 2,, k, the transformed variable z(x) is a onedimensional category variable with k + 1 possible values. Hence the problem of k monitoring multivariate observations x i becomes a problem of monitoring one-dimensional category variables z(x i ) instead. Traditional multivariate control charts also convert multi-dimensional information into one-dimensional statistics that are often based on the probability distributions or the likelihood ratio test statistics of the multivariate data. However, from the statistical viewpoint, while the distribution of the aw p-dimensional observation x can also be estimated directly, e.g., by empirical likelihood (Owen 2001) and kernel density estimation (Terrell & Scott, 1992), such estimation generally becomes inefficient when the dimension p is medium or high. In our proposed method, we essentially use the KNN algorithm to conduct dimension reduction, so that the problem of monitoring the p-dimensional variables x i s is reduced to a problem of monitoring the one-dimensional variables z(x i ) s. 3.2 Estimation of the empirical probability mass function After using the KNN algorithm to convert the p-dimensional variable x to a one-dimensional category variable z(x), we face the problem of monitoring one-dimensional category data. Existing methods for monitoring category data are available from Marcucci (1985), Woodall (1997) and Montgomery (2009). However, these methods require counting the number of data points in each category, and are not suitable for our context where the category variable z(x) is calculated from a single observation x. Here our proposed method applies the classical CUSUM procedure to the one-dimensional category variable z(x), and for that purpose, we need to estimate the p.m.f. of z(x) under the IC and OC scenarios. Let us first review how the empirical p.m.f. of a category variable Y is estimated. Assume the category variable Y can take k + 1 possible values, say, t 0 t 1... t k, and assume that we observe G i.i.d. random samples y 1, y 2,..., y G of this variable. The 8

9 empirical p.m.f. of Y is ˆf(t j ) = 1 G n i=1 I(y i = t j ) = # of y i in j-th category ; j = 0, 1,..., k (6) G where I( ) denotes the indicator function. Next, let us illustrate how this empirical p.m.f. can be adapted to our context by applying the boostrapping or random sampling method to the IC and OC data. Taking the IC data as an example. Each time a subset of m IC observations are randomly (and possibly repeatedly) sampled from X IC (we exclude the observations in X knn IC that are used in step 1 of constructing the KNN classifier). After this subset of m pieces of IC data are input into the KNN classifier, we collect m transformed outputs z(x)s. This resampling and transformation procedure is repeated B times, and thus we will obtain a total of Bm category observations z(x)s. As in equation (??), the empirical p.m.f. of z(x) under the IC scenario is simply the proportion of Bm observations that fall under each category, and can be calculated as ˆf ic (z) = { # of z(x) s =z Bm 0 otherwise if z = j, j = 0, 1,..., k k (7) since z(x) equals only j, j = 0, 1,..., k. Also note that the empirical IC p.m.f. satisfies k k ˆf i=0 ic ( i ) = 1, which is the fundamental requirement for a mass function. Similarly, we k can derive the empirical OC p.m.f., ˆfoc (z), of the category variable z(x) under the OC scenario by applying the above procedure to the OC dataset X OC. Some issues on IC or OC samples for calculating the empirical densities will be discussed in more detail later in Subsection Monitoring the transformed variables z(x i ) After data transformation and p.m.f. estimation, our proposed control chart is to apply the classical CUSUM procedure to the transformed variable with the estimated IC and OC p.m.f.s. Specifically, when we monitor p-dimensional data x i in real time, we first use the KNN method in Step 1 to obtain the transformed variable z(x i ). In Step 2, when x i from either the IC or OC data distribution become available, the corresponding p.m.f. of the transformed variable z(x i ) can be estimated as ˆf ic (z) or ˆf oc (z). Thus the 9

10 problem of monitoring p-dimensional data x i can be simplified into one of monitoring the one-dimensional variables z(x i ) with a possible p.m.f. change from ˆf ic (z) to ˆf oc (z). Our proposed KNN-ECUSUM control chart simply involves applying the classical CUSUM procedure to the problem of monitoring the one-dimensional variable z(x i ) with the estimated IC and OC p.m.f.s, ˆfic (z) and ˆf oc (z). statistic as Our method defines the CUSUM W n = max(w n 1 + ˆf oc (z(x i )), 0) (8) ˆf ic (z(x i )) for n 1 with the initial value W 0 = 0. Our proposed KNN-ECUSUM control chart raises an alarm whenever W n exceeds a prespecified control limit L > 0. Recall that the CUSUM procedure is exactly optimal when the p.d.f.s/p.m.f.s under the IC and OC scenarios are completely specified. In our context, we utilize the historical IC and OC datasets to estimate the p.m.f. of the transformed variable. When the OC patterns are similar to those in the historical data, the empirical p.m.f. will be similar to the true p.m.f., and thus our proposed control chart would perform similarly to the optimal CUSUM procedure with the true density functions. 3.4 Summary and design issues of our KNN-ECUSUM method Our proposed KNN-ECUSUM control chart is summarized in Table 1. The novelty of our proposed control chart lies in combining four well established methods together: the KNN algorithm, the empirical probability mass function, bootstrapping, and CUSUM. The basic idea is to use the KNN algorithm as a dimension reduction tool. Rather than directly estimating the distributions of the raw p-dimensional variables, we monitor the transformed one-dimensional category variables, whose distribution can be easily estimated from their empirical densities. Our proposed KNN-ECUSUM control chart involves several tuning parameters, which must be chosen carefully when applying our control chart in practice. Below we will discuss how to choose these tuning parameters appropriately. 10

11 Table 1: Procedures for implementing our proposed KNN-ECUSUM control chart Algorithm 1: Procedure for Implementing the KNN-ECUSUM Chart Step 1 Step 2 Goal: To Build the KNN model for data transformation. Input: Training Samples X knn IC a given testing sample x. and X knn OC (sampled from X IC and X OC ); Output: The category variable z(x) = Z(x; X knn IC, X knn OC ). Goal: To calculate the empirical density of the transformed variable. Input: Samples X IC and X OC. Model: The KNN model z(x) built in the last step. Iterate: For the IC scenario, t = 1, 2,... B times. 1. Randomly select m IC samples from X IC where X knn IC are excluded. 2. The selected IC samples serve as input to z(x i ); collect the model outputs and determine the size of each category. Output: For the IC scenario, summarize the empirical densities after boostrapping; the densities are calculated from ˆf ic ( i k ) = #(testing result = i k ) Bm. Repeat: Using the same model, repeat the same iterative procedure for the OC scenario and estimate ˆf oc ( i k ). Step 3 Goal: To monitor online observations. Input: Online samples {x 1, x 2,..., }; control limit L. Charting Statistic: W n = max(w n 1 + ˆf oc (z(x i )) ˆf ic (z(x i )), 0), W 0 = 0. Output: If W n > L, the control chart raises an alarm; otherwise, the system is considered to be operating normally. 11

12 3.4.1 Selecting k in the KNN method In building the KNN model, the number k of neighbors is a very important parameter to set. Generally speaking, a small k may overfit the KNN training dataset and decrease the prediction accuracy. For the category variable z(x), a small k will also lead to imprecise output. On the other hand, a large k increases the computational load and may not necessarily bring a significant increase in the classification rate for prediction. Often people use cross-validation to achieve an appropriate k. Following the empirical rule of thumb, k is often chosen to be of order n knn, where n knn is the number of observations in the training IC and OC datasets reserved for the KNN method. This is used as the starting point in some KNN packages. Here we suggest choosing k = α n, where the value of α depends on practical conditions such as the differences between the IC and OC samples. Our experience from real-data studies suggests that α [0.5, 3] is reasonable Selecting the control limit When constructing the control chart, it is crucial to find an accurate control limit L for the monitoring statistics to satisfy the IC ARL constraint γ in (??). In our proposed KNN- ECUSUM control chart, the control limit L for the CUSUM statistics W n in (??) is related to both the IC and OC training datasets, and we propose a resampling method to find its value. For a given control limit L, we randomly select an observation from the IC training dataset X IC, and the CUSUM statistic W 1 is calculated as in (??). If the CUSUM statistic W 1 < L, another observation will be randomly selected from the IC training dataset X IC and the CUSUM statistic W 2 will be calculated. This procedure is repeated until the CUSUM statistic W T L after taking T observations from the IC training dataset. This completes one Monte Carlo simulation, and the number T of IC observations is recorded and is often called one realization of the run length. Then, we repeat the above steps n 0 times and obtain n 0 realizations of the run length, denoted by T 1,, T n0. The ARL of our proposed KNN-ECUSUM control chart with the control limit L is then estimated as ARL(L) ˆ = (T T n0 )/n 0. The control limit L is then adjusted through bisection search so that the obtained ARL(L) ˆ is close to the preset ARL constraint γ. In general, finding the control limit L is straightforward conceptually, 12

13 but unfortunately it can be very computationally demanding Sample size considerations The sample size is another factor that can affect the performance of our proposed KNN- CUSUM control chart. In the following, guidelines on choosing an appropriate sample size are provided. In Step 1 of our proposed control chart, it is crucial to have enough training samples for the KNN algorithm, since the quality of the transformed category variable z(x) can affect how our proposed method performs. However, limited by the computational capacity and the need to reserve sufficient training data for calculating empirical densities, it is also inappropriate for the KNN method to use all training data. In our studies below, when the size of the training dataset is N, the amount of training data for the KNN method in Step 1 is chosen as n knn O( N) for a large N or n knn [0.25, 0.75]N for a small N. In particular, for the real-data example in Section 5, we choose n knn 0.4N oc. In Step 2 of our proposed control chart, more samples will lead to more precise estimates of the p.m.f. of the transformed category variable. We suggest making use of much, or possibly all, of the training data especially when the training IC or OC dataset is not large enough. For a large training dataset, containing millions of observations, it might be acceptable to use a subset to estimate the p.m.f. It is important to note that the training data for building the KNN classifier in Step 1 should not overlap with those for estimating the empirical p.m.f. In the case of a small training dataset, repeated use of the training data might be acceptable, although it is generally not recommended. To better understand the effects of sample size, a simple simulation study is performed in later sections to see how the estimated densities approximate the true ones When multiple OC patterns exist So far we have only considered the binary classification when constructing the KNN classifier in Step 1, as we have assumed that all OC data exhibit a single OC pattern. However, sometimes multiple OC patterns may occur, and the historical OC data may contain more than one OC cluster. In this case, it might be inappropriate to keep the OC data in one 13

14 cluster, and more efforts are needed to investigate different OC clusters. It turns out that our proposed KNN-ECUSUM control chart can be easily extended to handle the case of multiple OC clusters. For that purpose, we need to divide the historical or training OC data into different clusters. This clustering procedure might be available during the Phase I analysis of data, or can be done by standard unsupervised learning methods such as the K-Means method, principal component analysis (PCA) and so on. From on our experience, while our proposed method can be extended to any number C of OC clusters, a smaller number of clusters will generally lead to better performance in monitoring changes, and thus a small number C 4 of OC clusters is suggested. Now we describe the extension of our proposed control chart. Denote the IC data and the C OC clusters from training data by X ic, X 1 oc..., X C oc. As in Step 1 in Table 1, we also build a KNN classifier but now it is a multi-class KNN classifier based on the subsets of these C + 1 training datasets. Next, we calculate the empirical p.m.f.s, ˆf ic ( ), ˆf oc( ), 1..., ˆf oc( ), C using boostrapping in Step 2. Notice that there are C +1 probability densities to be calculated, rather than two densities in the case of a single OC cluster. In Step 3, we construct the CUSUM statistic for each OC cluster and obtain C CUSUM statistics, where W 1,n = max(w 1,n 1 + ˆf 1 oc (d(x i)) ˆf ic (d(x i )), 0)),..., W C,n = max(w C,n 1 + ˆf C oc(d(x i )) ˆf ic (d(x i )), 0)). Then we define the monitoring statistics based on the maximum of all the C CUSUM statistics, W n = max(w 1,n, W 2,n,..., W C,n ), and raise an alarm when the maximum W n exceeds the control limit L. This control chart is intuitive as it is equivalent to raising an alarm if at least one CUSUM-type statistic W j,n exceeds the control limit L. 4 Simulation Studies In this section, we conduct extensive numerical simulation studies to demonstrate the usefulness of our proposed KNN-ECUSUM control chart. For the purpose of comparison, we also report the results of three alternative control charts. The two main competitors are the nonparametric MSEWMA chart in (??) proposed by Zou & Tsung (2011), and the traditional parametric MEWMA method (T 2 type). In addition, the performance of the classical parametric CUSUM procedure is also reported, since it is exactly optimal and provides the best possible bound when the distribution is known and correctly specified. 14

15 In Section 4.1, our KNN-ECUSUM method is compared with competitors for the case when there is only one cluster in the historical OC data. It is extended to the case of multiple clusters in Section 4.2. Section 4.3 contains sample size analysis, and we investigate how much training data are needed to estimate the empirical densities. Throughout the simulation, the IC ARL is fixed at 600. The MATLAB code for implementing the proposed method is available from the authors upon request. 4.1 When only one cluster exists in historical OC data In this subsection we will investigate the case when only one cluster exists in the historical OC data. Here we focus on detecting a change in the mean, and consider three different generative models of the p-dimensional multivariate data x i in order to evaluate the robust properties of control charts. The three different generative models are: the multivariate normal distribution N(0 p 1, Σ p p ), where p is the dimension; the multivariate t distribution t ζ (0 p 1, Σ p p ) with ζ degrees of freedom; the multivariate mixed distribution, rn(µ 1, Σ p p )+(1 r)n(µ 2, Σ p p ), i.e., a mixture of two multivariate normal distributions. In our simulation, we set r = 0.5, and the p 1 mean vector µ 1 (µ 2 ) is 3( 3) in the first component but is 0 for all other components. In our simulation, for all three generative models, the covariance matrices Σ p p of the p-dimensional variable x i are same: the diagonal elements of Σ p p are equal to 1, and the off-diagonal elements are set to 0.5. For each IC generative model, we consider the corresponding OC generative model that has a mean shift δ in the first component of the observations unless stated otherwise. We will consider different magnitudes of mean shift δ ranging from 0.2 to 2 with the step size 0.2. In addition, we will consider two choices of the dimension p: p = 6 and p = 20. The tuning parameter λ in the MEWMA method (T 2 type) and the MSEWMA chart of Zou & Tsung (2011) is set λ = 0.2 for simplicity. For the classic CUSUM procedure, the OC shift magnitude is assumed known beforehand in our simulation. Thus, the classic CUSUM method delivers the lowest bound in OC detection. 15

16 We emphasize that our proposed control chart does not use any information pertaining to the generative IC or OC model, and will only use the data generated from the IC or OC model: the KNN algorithm in Step 1 is based on 1000 pieces of IC and OC training data with the number of nearest neighbors k = 30, and the p.m.f. estimation in Step 2 is based on 100, 000 pieces of IC and OC training data with B = 1000 loops for bootstrapping. To illustrate the rationale of our proposed control chart, Figures 1-3 plot the estimated p.m.f. of the transformed one-dimensional category data z(x i ) under the three generative models. In Figure 1, two discrete densities of the KNN classifier s output are shown. From the generative model of the multivariate normal distribution, the estimated density of IC samples ( ˆf ic ) increases as the KNN output increases (except in some extreme situations), whereas the density of the OC cluster ( ˆf oc ) gradually decreases. This is consistent with our intuition that the neighboring samples of an IC sample are likely from the IC distribution, and the close neighbours of a OC sample have high OC probability densities. In Figures 2 and 3, a similar conclusion can be drawn from the multivariate t or multivariate mixed data. The significant differences in the estimated empirical IC and OC densities in Figures 1-3 demonstrate that the transformed variable z(x i ) can be used to detect changes in the original multivariate data x i, and thus our proposed KNN-ECUSUM control chart is reasonable. Tables 2-4 summarize the simulated ARL performance of different control charts under three different generative models. Table 2 presents the case of detecting mean shifts for the multivariate normal distribution, and it is clear that our proposed KNN-ECUSUM chart performs slightly worse than the optimal CUSUM chart, yet it has a much smaller OC ARL as compared to the MEWMA or MSEWMA control charts for all different kind of mean shift magnitudes. From Table 2, the performance of the MSEWMA control chart is similar to that of the MEWMA chart, but our proposed method has a much better performance in terms of a smaller OC ARL, especially when the mean shift magnitude is small. Our method is effective because it uses the OC information, which is missed in MEWMA or MSEWMA control chart. Thus, it is not surprising that our method can detect true changes more quickly than the alternatives. Table 3 reports the simulation results for multivariate t distributed data. The superiority of our method is maintained, e.g., our proposed KNN-ECUSUM chart still works 16

17 Table 2: (OC) ARL performance comparison for multivariate normal data Dimension Shift Size KNN-ECUSUM MEWMA MSEWMA CUSUM p δ λ = 0.2 λ = (595) 600(601) 602(594) 599(597) (126) 318(314) 322(323) 122(120) (34.3) 101(95.3) 104(97.3) 35.5(31.7) (13.5) 36.7(30.8) 39.5(33.0) 15.1(11.6) (6.47) 18.0(12.5) 20.2(13.4) 8.67(5.72) (3.76) 11.0(6.48) 13.3(7.24) 5.87(3.30) (2.44) 7.84(3.83) 9.80(4.36) 4.43(2.15) (1.74) 6.08(2.63) 7.97(2.87) 3.58(1.60) (1.32) 4.97(2.00) 6.93(2.18) 3.00(1.21) (1.09) 4.25(1.56) 6.24(1.78) 2.62(0.98) (0.89) 3.69(2.66) 5.76(1.45) 2.31(0.80) 0 600(586) 599(606) 599(589) 601(589) (127) 410(408) 396(400) 118(114) (36.0) 171(167) 163(158) 33.1(29.6) (13.6) 62.9(57.5) 62.1(54.1) 14.2(11.0) (6.37) 28.6(22.2) 29.1(21,7) 7.91(5.18) (3.91) 15.9(10.3) 16.8(10.2) 5.41(2.98) (2.50) 10.6(5.58) 11.7(5.71) 4.10(1.96) (1.86) 7.93(3.54) 8.94(3.68) 3.31(1.44) (1.74) 6.26(2.47) 7.41(2.59) 2.77(1.08) (1.16) 5.25(1.94) 6.40(1.97) 2.41(0.88) (0.93) 4.57(1.57) 5.71(1.58) 2.17(0.76) NOTE: Standard deviations are in parentheses 17

18 Table 3: OC performance comparison for multivariate t data Dimension Shift Size KNN-ECUSUM MEWMA MSEWMA CUSUM p δ λ = 0.2 λ = (603) 600(606) 600(587) 600(603) (146) 578(568) 342(339) 135(132) (41.4) 479(480) 115(108) 41.3(36.7) (14.7) 348(350) 44.7(37.4) 17.7(13.7) (7.33) 228(228) 23.1(16.6) 10.0(6.63) (4.10) 133(128) 15.1(8.74) 6.89(3.75) (2.86) 73.3(65.4) 11.1(5.32) 5.31(2.54) (2.04) 40.0(32.3) 9.05(3.86) 4.43(1.90) (1.64) 23.7(16.0) 7.77(2.88) 3.85(1.48) (1.27) 15.5(8.80) 6.94(2.36) 3.42(1.22) (1.10) 11.6(5.55) 6.38(1.95) 3.19(1.04) 0 601(589) 599(603) 600(589) 599(597) (135) 595(602) 420(414) 129(126) (39.6) 564(578) 180(171) 35.5(32.7) (14.4) 492(479) 33.1(25.8) 8.48(5.60) (4.04) 430(348) 19.1(12.5) 5.91(3.29) (2.67) 346(348) 13.1(7.19) 4.54(2.25) (1.94) 282(281) 10.0(4.63) 3.70(1.69) (1.49) 211(205) 8.25(3.35) 3.21(1.38) (1.25) 149(144) 7.10(2.59) 2.89(1.17) (1.10) 99.9(91.0) 6.34(2.09) 2.63(1.01) NOTE: Standard deviations are in parentheses 18

19 Table 4: OC performance comparison for multivariate mixed data Dimension Shift Size KNN-ECUSUM MEWMA MSEWMA CUSUM p δ λ = 0.2 λ = (600) 600(597) 601(583) 600(594) (128) 598(598) 584(583) 124(120) (36.3) 609(615) 565(562) 35.3(32.2) (13.5) 591(573) 528(530) 15.0(11.6) (6.39) 600(593) 516(514) 8.52(5.50) (3.76) 597(605) 471(467) 5.85(3.28) (2.44) 595(591) 455(446) 4.42(2.14) (1.73) 596(604) 427(423) 3.58(1.57) (1.38) 599(591) 414(407) 3.04(1.20) (1.11) 603(593) 382(381) 2.65(0.95) (0.92) 605(605) 377(373) 2.37(0.76) 0 602(594) 599(602) 600(581) 600(590) (128) 609(605) 604(615) 122(122) (37.1) 591(596) 595(589) 33.5(29.5) (13.3) 612(617) 585(580) 14.0(11.2) (6.68) 588(582) 584(583) 7.97(5.09) (3.88) 593(586) 568(564) 5.41(3.00) (2.57) 608(607) 556(546) 4.06(1.92) (1.84) 593(596) 549(559) 3.32(1.41) (1.41) 608(600) 539(546) 2.80(1.10) (1.13) 594(601) 528(529) 2.46(0.86) (0.96) 598(605) 524(525) 2.23(0.70) NOTE: Standard deviations are in parentheses 19

20 very effectively. It is also interesting to compare the results in Tables 2 and 3: when the true distribution of the data changes from multivariate normal to multivariate t, the performance of the MEWMA method deteriorates significantly. This is not surprising since the MEWMA method relies heavily on the assumption of the multivariate normal distribution. Meanwhile, the performances of MSEWMA and our proposed KNN-ECUSUM method are very stable and are not sensitive to the data distribution. From Table 3, our KNN-ECUSUM method has a smaller detection delay than the nonparametric MSEWMA chart in (??) proposed by Zou & Tsung (2011). Another issue that should be pointed out is the effect of data dimension. As the data dimension increases, both the MEWMA and MSEWMA charts require a longer ARL to detect small shifts, while our KNN-ECUSUM method achieves a robust performance. Table 4 reports the simulation results for mixed data. It can be seen that both the MEWMA and MSEWMA charts fail to detect the shift under the mixed distribution. One reason for their ineffectiveness is that the mixed data are not elliptically distributed, which is a fundamental requirement of both MEWMA and MSEWMA methods. Meanwhile, our proposed KNN-ECUSUM method still performs stably, and it is able to give satisfactory results under the mixed data scenario. To summarize, when only one cluster exists in historical OC data, our KNN-ECUSUM chart presents a better detection capability than its two competitors. The chart performs robustly and is adaptive to various distributions. The simulation studies reveal a similar detection efficiency between our method and the optimal CUSUM chart. Therefore, our KNN-ECUSUM method could be considered appealing. 4.2 When multiple clusters exist in historical OC data In this subsection, we conduct additional simulation studies to illustrate the usefulness of our proposed method in the case of multiple OC clusters in the OC training data. In the simulation, we only include the multivariate normal data and t data for simplicity, and the corresponding dimension ranges from p = 6 to p = 40. The MEWMA and MSEWMA schemes are also used for comparison. The simulation settings under multiple OC clusters are similar to the ones under the one OC cluster. The mean and covariance matrix of the IC data are still 0 p 1 and Σ p p ), and for each OC cluster, the shift magnitude is 1 but the 20

21 Table 5: OC performance comparison for the case of multiple clusters Dimenson # of Clusters Historical OC Mean Multivariate Normal Multivariate t p C KNN-ECUSUM MEWMA MSEWMA KNN-ECUSUM MEWMA MSEWMA [1,0,0,0,0,0] 12.7(7.68) 11.9(6.94) 13.2(7.09) 13.6(7.93) 136(130) 15.0(8.90) 6 2 [0,1,0,0,0,0] 9.89(5.57) 12.0(6.97) 13.4(7.19) 12.2(7.25) 132(124) 15.3(9.10) [1,0,0,0,0,0] 12.6(7.31) 11.9(6.94) 13.2(7.09) 13.7(7.65) 136(130) 15.0(8.90) 6 2 [0,1.5,0,0,0,0] 4.43(2.23) 5.74(2.37) 7.47(2.55) 5.61(2.93) 30.5(22.8) 8.45(3.38) [1,0,0,0,0,0] 19.9(13.4) 11.9(6.94) 13.2(7.09) 17.6(10.9) 136(130) 15.0(8.90) 6 3 [0,1,0,0,0,0] 14.5(8.76) 12.0(6.97) 13.4(7.19) 17.7(10.6) 132(124) 15.3(9.10) [0,0,1,0,0,0] 16.4(10.5) 11.9(6.99) 13.5(7.32) 17.9(10.7) 133(127) 15.4(9.08) 1 in the 1st com 12.8(7.84) 15.9(10.3) 16.8(10.2) 10.7(6.30) 420(424) 19.1(12.3) in the 2nd com 11.6(6.83) 16.0(10.3) 17.4(10.7) 12.7(7.70) 419(413) 19.7(13.1) 1 in the 1st com 11.3(6.66) 15.9(10.3) 16.8(10.2) 11.3(6.67) 420(424) 19.1(12.3) in the 2nd com 4.88(2.66) 6.96(3.00) 8.17(3.11) 4.88(2.66) 248(243) 9.20(4.03) 1.5 in the 1st com 10.1(6.11) 8.79(4.07) 9.82(4.18) 10.1(6.11) 405(408) 11.1(5.42) in the 2nd com 8.57(4.84) 8.73(4.00) 10.2(4.42) 8.57(4.84) 408(409) 11.1(5.47) 1.5 in the 3rd com 7.68(4.19) 8.77(4.02) 9.97(4.26) 7.68(4.19) 407(411) 11.2(5.44) NOTE: Standard deviations are in parentheses shift occurs in various dimensions in different clusters. For example, when the number of OC clusters C = 2 and data dimension p = 2, the mean of two OC clusters are [1, 0] and [0, 1]. The estimated densities of the KNN classifier s output are similar to those in Figures 1-3, and thus are omitted here for simplicity. The simulation results are shown in Table 5. As suggested in Section 3.4, the number of clusters is chosen to be small, and we consider C = 2, 3 here. The results in Table 5 demonstrate the detection competence of our method under the multi-cluster scenario. Although degrading a little as C increases, the performance of our chart is still desirable. All charts deliver similar performance levels under the normal distribution. The advantages of our chart become significant under the t data scenario. Also, the performances of both the MEWMA and MSEWMA charts are still influenced by data dimension, while our method appear to be more robust. Combining the results in this and the last subsection, we conclude that our method is quite a powerful tool for monitoring multivariate data, and it is more suitable for practice use than the alternative methods. 21

22 4.3 Sample size analysis When implementing the control chart, a practical issue is the sample size. As mentioned before, the sample size can influence the choice of the number k of neighbors, and the accuracy of the estimated density. Although the number of required samples depends on multiple factors such as data dimension and the true data distribution, it is still necessary to investigate how large a sample size is generally appropriate and how the estimated densities changes as the sample size increases. In this section, simulation studies are performed to demonstrate the effect of the sample size. As the KNN classifier delivers a probability estimator that the test sample is of a particular status, let us first consider how to achieve the theoretical true probablity. Denoting the true density functions of the IC and OC samples by f ic and f oc, according to the conditional probability formulation, the true probability that sample x is of a particular status can be written as P(x) = P(x IC x IC or OC) = Then the density of the above probablity f( ) can be derived as f ic (x) f ic (x) + f oc (x). (9) f(t) = and we call f( ) the true probability density. dp(p(x) t) dt (10) In the simulation, both the IC and OC samples are assumed to be normally distributed (p = 6) with mean µ 0 = 0, µ 1 = [1, 1, 1, 0, 0, 0], and the covariance matrix equals the identity matrix. Following equations?? and??, the true density is obtained after simple calculations (provided in the appendix) as f(t) = 1 π exp( 1 6 (3 2 log t 1 1 t )2 )( 1 t + 1 ). (11) t As we want to see how the estimated densities approximate the true densities, various sample sizes are tested and the results are shown in Figures 4 and 5. In the figures, numtrain and num-test denote the sample size for training the KNN model and the number of samples for calculating the densities, respectively. We find that the estimated density 22

23 approximates the true one better with more training samples. Comparing the three plots in Figure 4, it is evident that fewer test samples may lead to an unstable density, and the density curve becomes smoother as the number of test samples increases. Combining the results from all the plots, we see that at least 1000 training samples and test data points are required when the distribution is normal. Another interesting finding from Figures 4 and 5 is the choice of k. Recall that we suggested k = α n before. This general suggestion is validated by the results in these two figures. We can see that when the training sample size is equal to 1000, the density curve with k = 50 is the closest one to the true curve. Also, as the sample size increases to 4000, the performance levels of the k = 50 and k = 100 curves are quite similar. Futhermore, when the training sample size is not large enough, it is inappropriate to choose a large k, such as k = 500. Thus, the selection of k relies heavily on the training sample size. In summary, we can see that the sample size plays an important role in reaching an accurate empirical density. Since our method focuses on monitoring the one-dimensional output, the accuracy of its empirical density is essential to the performance of our control chart. Therefore, the accuracy of the estimated density must be ensured when implementing our method. 5 A Real Data Application In this section, we will use a real data example to demonstrate the usefulness of our proposed method by comparing it with other existing methods. Here we use the traditional MEWMA method and nonparametric MSEWMA chart for comparisons. The ARL values are obtained from 5,000 replicated simulations. Let us revisit the application of HDDMS. The HDDMS system records various attributes and provides early warnings of disks failure. The provided datasets (the IC and OC datasets) consist of the 14 attributes observed from IC disks and 270 OC disks that were continuously monitored once every hour for one week. Since the values of some attributes are constant or change very little in some disks, we select their average values to denote the working status (we also tried the full dataset in our experiment, and the results are similar). After the necessary data preprocessing, we delete four useless attributes. 23

24 This leads to a final dataset with p = 10 attributes. To reveal the nature of unknown distribution of the collected data, we pick three attributes (the 2nd, 3rd and 12th) and 200 random samples from the IC dataset for demonstration. The 200 samples are plotted in Figures 6 (a)-(c), while their normal Q-Q plots are shown in Figures 6 (d)-(f). It can be seen that the attributes do not follow the normal distribution, or any other commonly used distributions. Thus, our method is suitable and can be used in this scenario. In the Phase I procedure, although the IC and OC datasets are provided separately, we still need to cluster the OC dataset to find out possible shifts. In this study, the traditional K-Means clustering algorithm is applied. We find two main shifts in the OC dataset. The 270 disks are divided into two groups, containing 185 and 85 disks. Since more than one OC pattern exists in the historical data, we apply the multi-cluster strategy and build the KNN-ECUSUM chart in Phase II monitoring. After clustering the historical OC data, the KNN classifier can be constructed easily following Step 1 in Table 1. The number of training samples in each cluster is selected to be 50, and the number of nearest neighbors k equals 15. In Step 2, the empirical densities can be computed quickly using the provided OC dataset when the KNN classifier is provided. The result of the estimated category variable density is plotted in Figure 7. The conclusion that the empirical density of the IC cluster gradually increases while that of the OC cluster decreases still holds. Particularly, for the second OC cluster, its empirical density values are all 0 when the KNN output is larger than 2, which shows that the shift size of this 15 cluster is quite large compared to that of the first OC cluster. In Step 3, the KNN-ECUSUM chart can be implemented for online monitoring. We fix the IC ARL to 1000, and the control limit of our chart is found to be We compare our chart with two alternatives through a steady-state simulation, where each time 30 IC samples are selected before the change point. The results of this performance comparison are shown in Table 6. outperforms the competitors significantly. We can see that in detecting the first OC cluster, our method For the second OC cluster, the performance levels of our method and the MEWMA chart are close, and they both work slightly better than the MSEWMA chart. Thus, the superiority of the proposed method is proved. 24

A new multivariate CUSUM chart using principal components with a revision of Crosier's chart

Title A new multivariate CUSUM chart using principal components with a revision of Crosier's chart Author(s) Chen, J; YANG, H; Yao, JJ Citation Communications in Statistics: Simulation and Computation,