A Multi-Step, Cluster-based Multivariate Chart for. Retrospective Monitoring of Individuals

Size: px

Start display at page:

Download "A Multi-Step, Cluster-based Multivariate Chart for. Retrospective Monitoring of Individuals"

Maximilian Todd
6 years ago
Views:

1 A Multi-Step, Cluster-based Multivariate Chart for Retrospective Monitoring of Individuals J. Marcus Jobe, Michael Pokojovy April 29, 29 J. Marcus Jobe is Professor, Decision Sciences and Management Information Systems Department, Miami University, Oxford, Ohio 4556 ( Michael Pokojovy is a Mathematics Research Fellow, Fachbereich Mathematik und Statistik, Universität Konstanz, D Konstanz, Germany ( michael.pokojovy@uni-konstanz.de) Abstract The presence of several outliers in an individuals retrospective multivariate control chart distorts both the sample mean vector and covariance matrix making the classical Hotelling s T 2 approach unreliable for outlier detection. To overcome the distortion or masking, we propose a computer-intensive multi-step cluster-based method. Compared to classical and robust estimation procedures, simulation studies show that our method is usually better and sometimes much better at detecting randomly occurring outliers as well as outliers arising from shifts in the process location. Additional comparisons based on real data are given. Key Words: Breakdown point; Cluster analysis; Kernel estimation; Mahalanobis distance; Moving average and medoid; Outlier. 1

2 Introduction Individuals retrospective multivariate control charts are constructed to determine if in a multivariate sense, the already obtained, sequentially ordered data points X = {x i } i=1,...n R d are stable, i.e., free of outliers, upsets or shifts. Some refer to this kind of analysis as a retrospective Phase I analysis of a historical data set (HDS). The sample mean vector and covariance matrix are estimated from X. Based on these estimates, Hotelling s T 2 chart is constructed and used to flag outliers, upsets or shifts in the process (Mason and Young 22; Fuchs and Kenett 1998). For X, Hotelling s T 2 statistic at time i is T 2 i = (x i x) S 1 (x i x), (1) where x and S are the usual sample mean vector and covariance matrix based on X. If T 2 i exceeds the upper control limit (UCL) given by UCL = (n 1)2 B n (α; d 2, n d 1 ), (2) 2 then x i is determined to be an outlier and a special cause is assumed to have occurred at time i or before. Equation (2) assumes the x i s are independent and come from a multivariate normal distribution. However, the T 2 i s are correlated (Mason and Young 22) because they each depend on the same x and S. Mahmoud and Woodall (24) and Williams et al. (26) stated that because of the correlation, the approximate overall probability of a false alarm that performs well for a control limit like that given in equation (2) is α overall = 1 (1 α) n. Hence, α > in equation (2) becomes α = 1 (1 α overall ) 1/n and B (α; d 2, n d 1 ) is the (1 α) 2 percentile of the beta distribution (Tracy, Young and Mason 1992; Wierda 1994). T 2 i is the squared Mahalanobis distance from the i-th data point x i to the center of X described by the sample mean. However, if multiple individuals or clusters of data points are separated from a main group, the sample mean vector x, thought to represent the data center, will likely be pulled away from the middle of the larger group of points. Likewise, the sample covariance matrix will be distorted, T 2 i will misrepresent the squared Mahalanobis 2

3 distance from the center of the group to a data point in question and the UCL given in (2) will not be effective at outlier detection. These effects of outliers or groups of outliers on the sample mean and covariance matrix are typically referred to as masking effects. One natural approach to overcome these effects is to substitute into equation (1) estimators of the mean vector and covariance matrix which are not affected by outliers or groups of outliers (robust estimators). The resulting T 2 i together with a UCL determined via simulation could be used to effectively identify outliers or sustained shifts in the mean vector. Vargas (23) and Jensen et al. (27) evaluated the performance of several different retrospective multivariate control charts constructed in such a fashion. Each of the charts were based on a T 2 statistic calculated using a selected combination of robust mean vector and covariance matrix estimators. Most of the robust estimators Vargas (23) and Jensen et al. (27) considered do not take time into account, which is a shortcoming when it comes to detecting certain outlier configurations such as sustained shifts in the mean vector. The robust methods proposed by Rousseeuw (1984), Rousseeuw and Leroy (1987), Rousseeuw and van Zomeren (199) and Rousseeuw and van Driessen (1999) were considered by Vargas (23) and Jensen et al. (27). In particular, Vargas (23) selected robust estimators of the mean and covariance matrix referred to as the minimum volume ellipsoid estimators (MVE), the minimum covariance determinant estimators (MCD), estimators generated using a trimming approach and two alternatives developed by Sullivan and Woodall (1996). Jensen et al. (27) focused on the MVE and MCD methodologies. A thorough discussion of the five methodologies selected by Vargas (23) is given in the next section. Important limitations are noted when each are used for 1) detection of an important shift in the mean vector and 2) outlier detection in the context of individuals retrospective multivariate process data. The notion of a breakdown point is outlined and an overview of our proposed approach for detecting an important shift in the mean vector and outliers from individuals multivariate data occurring over time is given. Next, three methodologies needed for construction of our proposed cluster-based individuals retrospective multivariate control chart are presented. A synthesis of the three methodologies is then 3

4 set forth. Using simulation, we compare the performance of our proposed method to the simulation results determined by Jensen et al. (27) for the MVE and MCD and by Vargas (23) for the T 2 approach based on MVE, MCD, Hotelling s T 2 and the two Sullivan and Woodall robust estimators. An analysis and interpretation of these comparisons are stated as well. Further, comparisons of our method to the aforementioned methods are made based on applications to the same data set analyzed by Vargas (23). Limitations of Robust Estimators Robust estimators derived from the MVE and MCD methodologies are the more prominent candidates for reducing distortion or masking that occurs in individuals retrospective multivariate control chart applications. For a set X R d of n data points, the MVE method attempts to find the subset of g points such that the minimum volume ellipsoid containing those g points has the smallest volume among all ellipsoids containing any other subset of g points from X. For the set X, the MCD method attempts to find the subset of g points such that the determinant of the estimated covariance matrix derived from that subset of g points is the minimum among all determinants of the covariance matrix derived from any other subset of g points contained in X. The MVE and MCD robust estimators then become the corresponding mean and appropriately scaled covariance matrix calculated from the identified subset. Davies (1987), Rousseeuw and Leroy (1987) and Lopuhaä and Rousseeuw (1991) showed that g = g = [(n + d + 1)/2] should be used to obtain the largest breakdown point for both the MVE and MCD estimators. Note the Gauss bracket [ ] applied to x R denotes the greatest integer not exceeding x. Simply put, a breakdown point is the proportion (n g)/n of points such that if the number of outliers in the sample exceeds n g, estimators can be severely distorted. For any sample of interest, there is a subtle presupposition associated with the MVE and MCD estimators. That presupposition is there exists a baseline group of good points which exceeds 5% of the sample and one or several groups of bad points which are less than 5% of the sample. This assumption is not necessarily true for multivariate processes occurring in 4

5 time. In fact, there may occur some baseline high density, consistent group of points B which is not necessarily 5% or more of the sample and other potentially multiple groups of points which are shifted away from B; each with membership percentage and density less than that of B. These separate groups may occur in close proximity to B both by time and position, only in time, only in position scattered across the set of n time periods, perhaps in some sort of trend across time or far away scattered across time. The goal of the retrospective individuals multivariate control chart is to detect if some type of shift, upset or change has occurred away from the baseline set B of points. Since the MVE and MCD estimators determine the size g of set B without knowledge of the sequentially occurring multivariate data points X, the resulting MVE and MCD estimators can be very poor. Additionally, no provision is made for time with the MVE and MCD estimators. Robust estimators derived from a trimming algorithm have been proposed by Rousseeuw and Leroy (1987). These estimators have a breakdown point of γ = [1/(d + 1)]. When considered for use in a retrospective individuals multivariate control chart application, the downsides of the robust estimators provided by trimming are the same as for the MVE and MCD estimators along with a rather small breakdown point γ. Sullivan and Woodall (1996) proposed looking at differences between successive data points. Based on the successive differences, an estimator of the covariance matrix and the usual sample average based on X were suggested. Using the sample average from all data points has an obvious downside. Another downside has to do with the use of n 1 successive differences from the sequence X. The estimated covariance matrix is adversely affected by randomly occurring multiple outliers which likely produce many large successive differences. Sullivan and Woodall (1996) proposed another method for obtaining robust estimators of the mean and covariance matrix. This method is iterative and has the same downsides as the trimming method of Rousseeuw and Leroy (1987). In summary, the suggested robust estimators for the mean vector and covariance matrix based on sample sizes g < n are hampered by how the value of g is determined and lack of attention given to the time ordering of data. Rousseeuw and Leroy (1987, p. 263) 5

6 stated Both the MVE and MCD are very drastic, because they are intended to safeguard against up to 5% of outliers. If one is certain that the fraction of outliers is at most γ ( < γ.5), then one can work with estimators MVE(γ) and MCD(γ) obtained by replacing g by k(γ) = [n(1 γ)] The question then becomes, How do we decide γ and simultaneously take into account time? The answer to this question is at the heart of our proposed individuals retrospective multivariate control chart scheme for detecting sustained shifts in the mean and outliers in the presence of masking. We suggest letting the data tell us the answer by using a combination of carefully constructed moving averages, nonparametric kernel estimates of multivariate densities, a density-based clustering method and signal calculation based on a quadratic form determined from an identified bulk B. We combine these four tools into a two-step approach. An overview of our proposed method follows beginning with Step 1. Step 1 and Step 2 sections which appear later detail both steps. The reader should not confuse the two-step characterization of our method with the usual two-stage vocabulary within a Phase I analysis defined by Alt (1985). Our Step 1 and 2 are the two segments of our proposed algorithm. For our method, we transform each original data vector x orig X orig using the usual Mahalanobis standardization. Throughout the rest of this paper x X becomes x = S 1/2 orig (x orig x orig ) where x orig and S orig come from the original untransformed data X orig. In Step 1 of our outlier detection algorithm, we look for the presence of a sustained shift in the underlying data. Sullivan (22) labeled as a change point the point where a sustained shift in the mean occurs. Initially, we repeatedly transform the data points using a weighted moving average procedure. The repeated application of moving averages smoothes the scattered data but preserves discontinuities in the time trend. Distances between adjacent observations in the transformed space scaled by a certain normalization factor are considered. If the distance between two successive points exceeds a certain threshold, a sustained shift in the trend is assumed to have happened. Moving averages lying between each two sequential jumps are then assigned to the same group. By averaging the data lying in each group, l group centers c 1,..., c l are obtained. Each original data point x i is assigned to that c j that 6

7 has the minimal Euclidean distance to x i among all centers. The group of points associated with the center c j is denoted by X j. This procedure helps to detect whether a sustained shift in the process location has occurred based on the underlying data. Further, this methodology identifies potential clusters that would not ordinarily be recognized with the typical cluster analysis that ignores the time dimension of occurrence, i.e., perhaps three clusters of data exist, the first cluster occurring with the first n 1 time periods, the second cluster with the second n 2 time periods and the third cluster with the third n 3 time periods. If the n 1, n 2 and n 3 time periods are combined, the three clusters together may not be recognizable but separated according to time, the three clusters are more readily identified. In Step 2, we consider the largest group of points X j among all groups obtained in Step 1. A sequence of substeps are applied to X j that produces a subset of points upon which a preliminary sample covariance matrix is calculated. Using this matrix, a nonparametric kernel estimate of the multivariate density function for all points x i in X j is determined. The data contained in X j are then clustered using a nonparametric clustering by mode identification introduced by Li, Ray and Lindsay (27). Their clustering requires a nonparametric estimation of the multivariate density, modal expectation maximization algorithm and mode association methodology. The biggest cluster C of points is identified. Depending on dimension and size of C, a set of points having large values of a density symmetry measure we propose is peeled. The remaining points in C become the bulk B. The usual mean and covariance matrix are computed from B. A quadratic form based on these two estimators produces the outlier detection signal. We note that Rocke and Woodruff (21) proposed the use of cluster analysis for detecting certain outlier configurations which have been shown to be problematic for the MVE and MCD estimators. Harnish et al. (29) developed a time-based clustering method. Others such as Coleman and Woodruff (2) and Coleman et al. (1999) recommended a combined approach to outlier detection that includes clustering. The fundamental components of our two-step computer-intensive multi-step cluster-based control chart are presented in the following three sections. 7

8 Modified Moving Averages Consider again the transformed data set X. If the sample contains a certain trend (in particular, a discontinuous one), our aim is to reveal it by transforming this set into a new space of the same dimension d. A naïve approach would be to take a point x i and to average it with its neighbors in time. This approach would produce false information if the data tend to be inconsistent, i.e. when neighborhood points come from two or more different distributions with substantially different means. Even robust smoothing techniques such as Locally Weighted Scatter plot Smoothing (LOWESS) introduced by Cleveland (1979) often produce unsatisfactory results when applied to data with a discontinuous trend. For this reason, we introduce the notion of a modified moving average that combines the advantages of moving averages and medoids. Let s 1 be an arbitrary integer denoting the spread of a neighborhood over a certain point. We define a cyclical numeration ϕ : Z {1,..., n} mapping integers onto {1,..., n} by means of ϕ n (i) = 1 + ((i 1) mod n). (3) ϕ n is thus a periodic function with period n since ϕ n (i + kn) = ϕ n (i) for i, k Z. For each point x i we take its s neighbors to the left and to the right with respect to the cyclical indexing ϕ n. The s-neighborhood N i of x i is thus given by N i = {x ϕn(i s),..., x ϕn(i+s)}. Then, we find a medoid of N i, i.e. a point m i N i that minimizes the function δ i (x) = s j= s d(x, x ϕ n(i+j)) over N i. Note that d(, ) denotes here the standard Euclidean distance in R d and for i s negative, say, 3 mod 3, ϕ 3 ( 3) becomes 27. The point m i is on average the closest to all other points in N i. We take the s + 1 points x ji,1,..., x ji,s+1 from N i that lie closest to m i. Statistically speaking, these points are likely to come from the same population. The moving average x i is now given by x i = 1/(s + 1) (x ji,1 + + x ji,s+1 ). Altogether, we defined a mapping T : R d n R d n, {x i } i=1,...,n { x i } i=1,...,n. Figures 1 and 2 give a comparison of our modified moving average algorithm and LOWESS. 8

9 observation x i observation x i time period i time period i (a) Original sample and modified moving averages (b) Original sample and LOWESS Figure 1: Smoothing applied to {x i } i=1,...,3. Two univariate ordered samples {x i } i=1,...,3 and {w i } i=1,...,3 were generated, each of size n = 3. x i s were taken from the normal distribution N (, 1) for i = 1,..., 3 (see Figure 1) and w i s from the distribution N (µ i, 1) with µ i = for i = 1,..., 15 and µ i = 1 for i = 16,..., 3 (see Figure 2). The modified moving average algorithm with spread s = 9 as well as the LOWESS with the span specified as 33.33% of the data were then applied m = 5 times to the data (see the section on Choice of Factors and Thresholds later in this paper regarding the selection of s and m). (a) Original sample and modified moving averages (b) Original sample and LOWESS Figure 2: Smoothing applied to {w i } i=1,...,3. 9

10 In Figure 1, we see our moving averages are not as sensitive as LOWESS to data volatility. As can be seen in Figure 2, the moving averages defined as above effectively distinguish between points from different distributions whereas the LOWESS estimates do not readily respond to a sustained shift in the data. Multivariate Density Estimation and Bandwidth Selection Given a sample X selected independently from some underlying general population with an unknown density function f(x) in d-dimensional Euclidean space, the problem is to nonparametrically estimate f(x). We have chosen to develop our multivariate density estimation based on the well-known method by Parzen (1962). Throughout the following density estimation discussion, h will denote bandwidth or window size. Let the kernel K : R d R be a function satisfying certain regularity and moment properties. The estimator of f at x is then given by ˆf h (x) = 1 nh d n ( ) x xi K. (4) h i=1 The smoothing factor h > is typically referred to as the bandwidth or window size. If h depends only on the space dimension d and the sample size n, i.e. h = h(d, n), it is called a global bandwidth. If it depends on x, x i and {x j } j=1,...,n, i.e. h = h(x, x i, {x j } j=1,...,n ), it is referred to as a variable bandwidth. In the most general case, the bandwidth (usually written as H) can be a nonsingular d d-matrix. The estimator is then defined as ˆf H (x) = 1 n det H n K ( H 1 (x x i ) ). (5) i=1 The quality of the estimator depends thus on the choice of the kernel K and the bandwidth h. Scott (1992) stated that quality of a density estimate is widely recognized to be determined primarily by the choice of a smoothing parameter, and only in a minor way by the choice of a kernel. In the present paper, the normal probability density function ϕ(x) = (2π) d/2 exp( x 2 /2) is chosen as a kernel since ϕ has nice regularity properties and produces smooth estimators ˆf h. Here x = x x, the Euclidean norm on R d. 1

11 For a given K, one way to measure the quality of estimation at a point x is to use the Mean Square Error MSE(x) = E{ ˆf h (x) f(x)} 2. The overall error is described by the Mean Integrated Square Error MISE = MSE(x)dx. R d Härdle and Müller (1997) expressed the asymptotic mean integrated square error (AMISE) as a estimate of MISE for the limit case h for sufficiently smooth f and H = hh where H is a fixed matrix with det H = 1. Over the past few decades, the problem of optimal bandwidth selection has been extensively studied in statistical literature. Many automatic, data-driven bandwidth selection methods (Härdle and Müller (1997)) have been proposed. The most popular are the plugin methods (especially, normal reference rule-of-thumb and Scott s rule), cross-validation techniques as well as adaptive methods such as k-neighbors bandwidth. Whereas general plug-in methods are not widely used in multidimensional settings (Section 1.7 in Li and Racine (27)), the normal reference rule-of-thumb suggesting H = ( ) 1 4 d+4 n 1 d+4 Σ 1/2 d + 2 (6) is often preferred by practitioners. Σ is the covariance matrix of the underlying general population. The bandwidth determined by equation (6) is appealing for our purposes because it minimizes AMISE when the underlying distribution is normal. In the present work, a simple robust bandwidth determined by a modified normal reference rule-of-thumb is proposed. Our bandwidth estimator is obtained by plugging a preliminary robust covariance estimator S into equation (6). Modal Expectation Maximization and Mode Association Clustering The Modal EM (MEM) algorithm solves a local maximization problem of a mixture density by ascending iterations starting from any initial point. This procedure was introduced by Li, Ray and Lindsay (27) as a modification of the well-known EM algorithm by Dempster, Laird and Rubin (1977). Though the MEM algorithm is based on expectation and maximization steps similar to the EM, the aim of MEM is to find local maxima, i.e. modes, of a 11

12 given density function. For each point x X, MEM determines an ascending path to a local maximum. All points in X whose path ends with the same maxima are assigned to the same cluster. This assignment is referred to by Li, Ray and Lindsay (27) as Modal Association Clustering (MAC). See the appendix for details of the MEM/MAC iterative algorithm. We use MAC and MEM in Step 2 of our outlier detection scheme to identify the largest cluster C X j from Step 1. A peeling algorithm (see the Step 2 section below) is then applied to C which helps to ensure that bulk B has a symmetric probability distribution function. In the following two sections, our new individuals retrospective multivariate control chart methodology is presented. We continue to assume the sample X = {x i } i=1,...n R d is standardized according to Mahalanobis as described earlier. Standardizing the input data ensures the detection signals to be invariant under any nonsingular affine linear transformation. Step 1 Let s = [min( dn, n/2)] and m = [ d log 2 n]. For details about selection of s and m, see Choice of Factors and Thresholds section. (a) Identify the n neighboring groups N i each containing 2s + 1 points. Find the medoid in each of the n groups. See Modified Moving Average section for details. (b) Find the s + 1 x i s in each group which are closest to the medoid of that group. (c) Denote y i as the average of each such set of s + 1 x i s. Each of these is a suitably weighted moving average of the neighborhood of points N i where weights are either or 1/(s + 1). (d) Iteratively repeat steps (a) (c) m 1 additional times. (e) Find the Euclidean distances µ i = d(y ϕn(i), y ϕn(i+1)) between the resulting adjacent moving averages. Find the Euclidean distance ν i = d(x ϕn(i), x ϕn(i+1)) between the ordered x i X. The jump at position i is then given by τ i = µ i / ν where ν is the median of the ν i s. If τ i exceeds a threshold τ θ given in Table 3, a jump is assumed to have happened at position i. 12

13 (f) Let I = {i 1,..., i l } contain all jump positions. If I has at least two positions, sequential unshifted points can be presented as a collection of all sets Y 1,..., Y l given by Y 1 = {y ϕn (i 1 +1),..., y ϕn (i 2 )}, Y 2 = {y ϕn (i 2 +1),..., y ϕn (i 3 )},..., Y l 1 = {y ϕn (i l 1 +1),..., y ϕn (i l )}, Y l = {y ϕn (i l +1),..., y n, y 1,..., y ϕn (i 1 )}. Let c i denote the average over the set Y i. If I has less than two positions, then define c 1 = 1 n n i=1 y i. (g) Each data vector x i is assigned to that c j being closest in Euclidian distance. All points allocated to the same center c j are combined into a group referred to as X j. (h) The largest group amongst all X j s is identified as X j. If there are several such groups, we choose X j to be the group with the smallest j. (i) The group of data points X j becomes the input to Step 2. If X j contains less than d + 1 points, we set X j = X and c j = x =. We note that X j contains n j points and has center c j. Step 2 (a) Denote n = n j and c = c j. The center point c is a type of robust mean vector estimate. We trim points around c. Unlike peeling data values around a sample mean, trimming points having extreme values of an appropriately defined distance from c eliminates outliers without exhausting good points Let d 2 (x) = (x c) S 1 (x c) for every x X j, where S = 1 n 1 n i=1 (x i c)(x i c). A point x X j having maximum value of d 2 (x) is identified and a trimmed sample X j \{x} is considered. This is repeated until n q points from X j remaining set of q points be denoted by X 1. are peeled where q = [ n+d+1 2 ]. Let the (b) Calculate the usual x 1 and S 1 from X 1. (c) Calculate S 2 = s 1,θ (d, n) (c d,.5 ) 1 S 1, where c d,α = 1 d x R d : x 2 χ 2 d,α x 2 ϕ(x)dx (see Table 1) and s 1,θ (d, n) is a small sample correction factor (see Choice of Factors and Thresholds section for details and Table 3). Note that χ 2 d,α is the (1 α)-th quantile of the χ 2 d -distribution and ϕ is the probability density function of N (, I d). The correction factor c d,α is used to maintain consistency in the multivariate normal context. 13

14 d = 2 d = 3 d = 5 d = 1 α = α = Table 1: Values of c d,α (d) Since only q n/2 points are in X 1, to increase the number of points upon which to compute a preliminary estimate of Σ to be used in equation (6), we find X X j of all x such that (x x 1 ) S 1 2 (x x 1 ) χ 2 d,α. (To be consistent with our overall error probability of.5 we let α =.5.) (e) Let x be determined from X X 1 and α as in (d). Thus, S = s 2,θ (d, n) (c d,α ) 1 1 X X 1 1 where s 2,θ is a small sample correction factor listed in Table 3. x X X 1 (x x)(x x), (7) (f) Plug S from (e) into equation (6) giving ( ) 1 4 d+4 H = n 1 d+4 S 1/2. (8) d + 2 Since S is a preliminary robust estimator of Σ, our chosen bandwidth is more resistant to outliers and performs well for contaminated samples. (g) Estimate the multivariate probability density via ˆf H given by equation (5) using H in equation (8). Recall n is the number of data points in X j. (h) Apply the Mode Association Clustering (MAC) to ˆf H. Among the clusters C 1,..., C r determined in X j, the biggest cluster C and corresponding mode u are selected. In case of ambiguity, we pick that cluster C i with a smaller index i. (i) Let the set C contain the top 25% of points x C having the largest (x c) S 1 (x c) where c is defined in Step 2(a) and S is defined by equation (7). The 25% value was subjectively chosen. (j) For every x C, determine a mirror point with respect to u by x = 2u x. (k) Compute a density symmetry measure λ(x) = max{ ˆf H (x), ˆf H (x )} min{ ˆf H (x), ˆf. This helps to filter out H (x )} skewness which may dilute the effectiveness of our method. 14

15 (l) All x C with λ(x) > s 3,θ (see Table 3 for s 3,θ ) are assigned to C. (m) Define x robust = 1 C\C x C\C x, S robust = (n) The detection signal becomes 1 C\C 1 x C\C (x x robust )(x x robust ). (9) T 2 i = (x i x robust ) S 1 robust (x i x robust ), (1) and C\C is the bulk B. x robust and S robust are from equation (9). (o) Control limits for selected n, d and α overall =.5 are given in Table 4. Choice of Factors and Thresholds In this section, we discuss the choice of τ θ, s 1,θ, s 2,θ and s 3,θ necessary for our two-step approach. The jump threshold τ θ previously referred to in the Step 1 section is determined to be a certain percentile of the distribution of the maximal jump τ max = max i=1,...,n τ i from a randomly selected sample of size n from N (, I d ). The generated sample data are transformed with the Mahalanobis standardization procedure because sample mean and covariance in general differ from the assumed parameters. Maintaining a 5% overall false detection probability, we selected the 98-th percentile of the the maximal τ max obtained from a simulation of 2 samples each of size n. Table 2 lists the 98-th percentile for some d and n. d \ n Table 2: 98-th percentile of a maximum jump τ = max i=1,...,n τ i In the spirit of Rousseeuw and van Zomeren (199), we assume s 1,θ, s 2,θ and s 3,θ to depend on n/d describing spatial sparsity of a sample. Based on knowledge of the asymptotic 15

16 behavior lim n/d s i,θ = 1 along with numerical simulations, we determined a general expression for s 1,θ, s 2,θ and s 3,θ (see Table 3). The functions s 1,θ and s 2,θ are the small sample correction factors for covariance matrix estimation. For n/d 5, our preliminary simulations showed that larger correction factors s 1,θ and s 2,θ were necessary to decrease volatility of the preliminary covariance matrix estimator. This corresponds to the empirical obervations of Rousseeuw and van Zomeren (199) who stated that MVE becomes unreliable for n/d 5. Hence, we constructed s 1,θ and s 2,θ that performed well for most n/d considered. The factor s 3,θ is the symmetry threshold for the estimated empirical density function for Step 2(l). s 1,θ (d, n) = exp( n d ) s 2,θ (d, n) = exp( n d )( n d 1) 1 s 3,θ (d, n) = exp(.6767 n d )( n d 1) Table 3: Small sample correction factors and density symmetry threshold Recall s and m are determined according to s = [min( dn, n/2)] and m = [ d log 2 n]. We selected s and m to comply with the following properties: sm and sm/n for n required for the asymptotic consistency of the moving average estimator. Simulation, Analysis and Conclusions A simulation was initially conducted to determine appropriate control limits for the detection signal T 2 i in equation (1) where the estimated mean vector and covariance matrix were produced from our two-step approach outlined in Step 1 and Step 2 sections. For selected combinations of n and d, 5 sets of n data values, assumed to have come from an in-control process, were generated by simulation. We simulated an in-control process by generating data from a multivariate normal distribution with zero mean vector and identity covariance matrix. For each of the j = 1,..., 5 sets of n data values, T 2 j = max i=1,...,n T 2 ij recorded and the 475-th ranked T 2 j identified as the control limit with an overall α = 5% false alarm rate for each n and d combination. Mason and Young (22) noted this approach was necessary for determining the control limits because of the dependence among T 2 ij s within the was 16

17 d \ n Table 4: Control limits for Our two-step approach with α overall =.5 j-th set of n data points. The control limits for our two-step method are displayed in Table 4 as a function of n and d where the overall false alarm rate for each n, d combination is α = 5%. We simulated a variety of out-of-control or shifted situations and calculated the detection probability of our combined outlier detection scheme. Jensen et al. (27) and Vargas (23) noted that for affine linear equivariant signal computation procedures (see Rousseeuw and van Zomeren (199)) any out-of-control setting corresponding to a designated shift in the mean with the same covariance structure is dependent only on the non-centrality parameter ncp = (µ 1 µ ) Σ 1 (µ 1 µ ), where µ 1 is the mean vector of the shifted scenario and µ is the in-control mean vector. Hence, shifted, out-of-control or events that can distort the usual mean and covariance estimators were simulated by generating data from a multivariate normal with mean µ 1 and identity covariance matrix that would produce a selected ncp of interest. For comparison purposes, we selected the same n, d, ncp and simulation sizes as Vargas (23). Additional n and d values for ncp = 5, 15 and 25 were selected for comparison to the probabilities found by Jensen et al. (27) whose simulation sizes were much larger. Letting k equal the number of bad points generated from an out-of-control scenario indexed by a selected ncp value, we arranged in a random order the k bad points with n k good points coming from an in-control process. For example, suppose n = 3, d = 2, ncp = 4 and k = 1. We generated r = 15 sets of n = 3 data points. Each of the r sets had 29 bivariate in-control data points and a k = 1 bivariate data point generated from a bivariate normal distribution with µ 1 = ( 2, 2) and identity covariance matrix. For each of the j = 1,..., r sets of n = 3 points, the k = 1 bad point and n k good points were randomly arranged as to order and 3 T 2 ij s were calculated, each based on estimators 17

18 determined by our proposed combination approach. If for some j, one or more of the T 2 ij s exceeded the corresponding control limit, an out-of-control signal was assigned to that set of n = 3 data points. The detection probability for an out-of-control masking scenario indexed by ncp = 4, k = 1, n = 3 and d = 2 was estimated with the proportion of out-of-control signals from the r sets. Jensen et al. (27) and Vargas (23) performed simulations based on the usual Hotelling s T 2, a Ti 2 statistic calculated with MVE estimators and a Ti 2 statistic calculated with MCD estimators. In Figure 3(a), outlier detection probabilities are plotted versus ncp for each of the four methods where n = 3, d = 2 and k = 1. We see that our method is as good or better than the MVE and MCD methods and often much better for detecting arbitrarily occurring outliers. Further, for k = 1, our method is competitive but not necessarily better than the approach based on T 2 i in (1) and UCL in (2) where all data points are considered. We call this the usual method. Consider now Figures 3(b), 3(c) and 3(d). For k > 1, it appears our method of detecting outliers is usually better than each of the three other methods and often much better. An expanded set of outlier detection probabilities for our method and the MVE method (determined by Vargas (23)) are given in Table 5. Since for selected n, d, k and ncp values, Vargas (23) recommended the MVE method over the usual, MCD and the two methods of Sullivan and Woodall for detecting randomly occurring outliers, we only repeat the MVE outlier detection probabilities for comparison purposes. Examination of Figure 3, along with Table 5 suggests our method is almost always better than MVE and often much better for detecting randomly occurring outliers. Hence, it is reasonable to conclude our method for detecting multiple outliers also surpasses the effectiveness of the other four control chart methods, at least for the n, d, k and ncp values considered. One exception being for k = 1, the usual method does seem to be somewhat better than our proposed method. In addition to the arbitrary occurrence(s) of an upset in a process, we considered the scenario where the individuals multivariate process was consistent at some unknown mean vector µ and then shifted to a different unknown mean vector µ 1 after = n 2 time periods. 18

19 1 1 Outlier detection probability Usual Our MVE MCD Outlier detection probability Our MVE MCD Usual Non centrality parameter Non centrality parameter (a) k = 1 (b) k = Outlier detection probability Our MVE MCD Outlier detection probability Our MVE MCD Non centrality parameter Usual.1 Usual Non centrality parameter (c) k = 5 (d) k = 7 Figure 3: Estimated outlier detection probabilities for k outliers in d = 2 dimensions. The outlier detection method by Sullivan and Woodall based on a moving range estimator of the covariance matrix was identified by Vargas (23) to be the best at detecting a shift of this type. We did a simulation under these same conditions and applied our two-step outlier detection method. The estimated detection probabilities of our method are given in Table 6 along with the performance of Sullivan and Woodall s approach (SW) and MVE as reported by Vargas (23). For the ncps and shift considered in Table 6, the simulated detection probabilities of our method exceed those based on the MVE and SW for d = 2 and n = 3. Next, a formal comparison of the simulated probabilities in Figure 3, Tables 5 and 6 is presented. 19

20 d = 3 n = 3 n = 5 n = 1 method ncp \ k MVE Our MVE Our MVE Our d = 5 n = 3 n = 5 n = 1 method ncp \ k MVE Our MVE Our MVE Our d = 1 n = 3 n = 5 n = 1 method ncp \ k MVE Our MVE Our MVE Our Table 5: Estimated outlier detection probabilities for Our method and MVE as reported by Vargas (23). Non-shaded regions correspond to Our method being superior. Let ˆp our, ˆp MVE and ˆp SW represent the simulated outlier detection probabilities determined for our method, MVE and Sullivan and Woodall. Assuming negligible variation in the simu- 2

21 Method \ ncp MVE SW Our Table 6: Estimated outlier detection probabilities for a sustained shift in the mean after = 15 time periods for d = 2 and n = 3 (MVE and SW are as reported by Vargas (23)). Non-shaded regions correspond to Our method being superior. lated control limits, the approximate standard deviation of the difference ˆp our ˆp MVE can be represented as s = 1/15(ˆp our (1 ˆp our ) + ˆp MVE (1 ˆp MVE )). For selected n and d, 9% lower-one sided Bonferroni family-wise confidence intervals were computed for the corresponding differences Our-MVE or Our-SW in Figure 3, Tables 5 and 6. Those intervals with a negative lower bound are shaded in the tables and indicate that the particular difference is not statistically significant. Those differences not shaded are statistically significant, indicating that our approach is superior to the compared approach. Inspection of the shading indicates that our method is similar for smaller values of ncp and superior to MVE or SW for most other ncp values and k > 1. Independent of an n, d combination, the improved power of our method over the others to detect outliers in the presence of extreme masking (ncp > 5) is apparent. Further, for d = 2 and n = 3 for all levels of masking considered (ncp 4), our method is just as good and usually better than the others (except when the usual method is applied and k = 1). Vargas (23) recommended the simultaneous applications of two outlier detection approaches (MVE and SW). Thus, to maintain an overall false detection probability of α =.5, the actual outlier detection probabilities become less than the individual detection probabilities reported by Vargas (23) due to the necessary increase of the corresponding control limits. Hence, the performance of our method compared to the combined MVE and SW methods is even better than suggested by the previous assessment. Jensen et al. (27) performed extensive simulations for n, d and k combinations beyond 21

22 1.9 Our 1.9 Our Outlier detection probability MVE MCD Usual Outlier detection probability MVE MCD Non centrality parameter.1 Usual Non centrality parameter (a) k = 4 (b) k = Our 1.9 Our Outlier detection probability MCD MVE Outlier detection probability MCD MVE Non centrality parameter Usual.1 Usual Non centrality parameter (c) k = 12 (d) k = 16 Figure 4: Probability of a signal for Our, MVE, MCD and usual estimators where n = 5, d = 3 and k equals the number of outliers. that of Vargas (23). The more recent work of Jensen et al. (27) noted that for n 5, MCD tended to be a better method than MVE at detecting multiple outliers for ncp = 5, 1, 15, 2, 25. Further, for n < 5, their work determined MVE to be the preferred method of detecting multiple outliers. We have done extensive additional simulations using our method for ncp = 5, 15 and 25, d = 2, 3, 5 and 1 and n = 3, 5, 75, 1 and 125. For d > 2, all n and ncp, our method proved equal to or better than MCD. Setting d = 2, and all n and ncp, our method was equal to or better than MCD about 66% of the time. See Figure 4 for the relative performance of our method compared to MVE and MCD for n = 5, d = 3 22

23 and k = 4, 8, 12 and 16. Our method outperforms both MVE and MCD in this example. Further, MCD performs better than MVE for larger k and poorer than MVE for smaller k. Additionally, our method showed an ability and sometimes a strong ability to detect shifts of size ncp 15 when n exceeds the breakdown point of MCD and MVE. See Table 7 for our detection probabilities when n exceeds the breakdown point. The control limits we determined for the new simulation showed some slight variation from those listed in Table 4 for n = 3, 5 and 1. d = 2 d = 3 d = 5 d = 1 n\ncp Table 7: Detection probabilities for Our method for k larger than the breakdown point of MVE and MCD. In particular, k equals 5% of sample size n. Our simulations were carried out using Matlab 7 (26b). Pseudo-random numbers were produced using the ziggurat-algorithm implemented in the function mvnrnd of Matlab on a Pentium IV PC (2.6 GHz, 1 Gb RAM). For d = 2 and n = 3, the maximum average runtime over all ncp- and k-scenarios was approximately.31 sec. For d = 1 and n = 125, this number was 8.56 sec. The runtime grows superlinear in n and linear in d. Example For two different data sets previously analyzed by Vargas (23), we compared the outlier detection effectiveness of our two-step method to the MVE, usual, MCD and two methods of Sullivan and Woodall. The data can be found in Table 8. Originally the complete data set presented by Quesenberry (21) had 11 variables but for illustration purposes Vargas (23) considered only two of the variables. Our method and all but one of the methods applied by 23

24 Vargas (23) detected the same outlier for the data in Table 8. The MCD method failed to detect any outlier. Figure 5(a) displays the bivariate data and corresponding outlier detection ellipsoids determined by MVE and our two-step approach. For comparison purposes, observations 16 and 24 were modified by Vargas (23) to (.469, 56.23) and (.496, 56.8). Only the MVE method detected them as outliers (usual, MCD and both Sullivan and Woodall methods failed to detect the 2 new outliers). Our two-step approach applied to the same altered data correctly detected the 2 new outliers x x obs x obs 16 obs 2 obs x 1 (a) Original sample (b) Altered sample Figure 5: Outlier detection ellipsoid for two data sets MVE (dashed line), Our two-step method (solid line)..6 f(x 1, x 2 ) f(x 1, x 2 ) x x x x (a) Original sample (b) Altered sample Figure 6: Estimated density for two data sets. 24

25 Figure 5(b) gives the outlier detection ellipsoids determined by MVE and our two-step approach for the altered data. For both datasets, our ellipsoid has a smaller volume (1.985) than that of MVE (2.491), detects the same outliers as MVE and has an inclination that better matches the data. This is consistent with the fact that the MVE does not take time ordering into account and is restricted to a bulk size which does not use information from the data. Figures 6(a) and 6(b) are the estimated bivariate density functions determined by our method for the unaltered and altered data. The slight skewness seen in Figure 6(a) corresponds to the presence of a single outlier in the unaltered data. The secondary mound seen in Figure 6(b) reflects the presence of the 3 identified outliers for the altered data. i x x i x x i x x Table 8: Bivariate dataset The x robust and S robust determined by equation (9) for original data are x = ( ), S = ( ) ( and for altered data are x =.5459 ) ( , S = ). Table 9 lists the detection signals determined by equation (1) for the original and altered samples. The control limit from Table 4 is No jumps were found for either data set. Shaded regions in Table 9 correspond to signals (and observation indices) that exceed the control limit. Summary We have developed a two-step method of identifying the largest bulk of similar multivariate data from a time-ordered sequence of individual multivariate responses. Using the mean 25

26 i original altered i original altered Table 9: Detection signals for original sample and altered sample vector and covariance matrix from the data in this bulk, control limits have been developed for selected n and d to determine whether any of the individual multivariate data points in the selected time ordered sequence of size n suggest the observed multivariate process of interest has shifted from a standard represented by the identified bulk of data. Extensive simulations assuming a multivariate normal distribution have shown that our method is equal to and usually better than the MVE and MCD at detecting multiple outliers for d = 2, 3, 5 and 1 and n = 3, 5, 75, 1 and 125. Further, our method shows strong ability at outlier detection even when n exceeds the breakdown point for MCD and MVE. If outliers occur systematically, our method performs even better than if the outliers occur sporadically. Finally, the focus of the comparative aspect of our paper has been the detectability of up to 5% sample contamination. Actually, the flexibility of our method permits the detection of multiple shifts of different magnitudes in different directions where the total contamination could exceed 5%. Appendix A brief outline of the MEM algorithm according to Li, Ray and Lindsay (27) is given. Let a mixture density be defined as f(x) = n i=1 π if i (x) at every point x R d, where f i is the unimodal density of mixture component i and π i is its a priori probability. Given any initial value x () R d, MEM solves a local maximum of the mixture function by alternating the following two steps thus producing a sequence {x (r) } r : 26

27 1. Let p i = π if i (x (r) ), i = 1,..., n. f(x (r) ) 2. Update x (r+1) = arg max x R d n i=1 p i log f i (x). The first step is the expectation step where the a posteriori probability of each mixture component i at the current point x (r) is computed. The second step is the maximization step. The function n i=1 p i log f i (x k ) has a unique maximum due to the unimodality of f i. According to Wu (1983), if f k are normal densities, all the limit points of {x (r) } r are stationary points of f, i.e. gradf(x) = if x = lim r x (r) and f is smooth. It is possible that {x (r) } r converges to a stationary, but not locally maximal point. A detailed treatment of the convergence of the EM style algorithms can be found in Wu (1983). For the practical use of MEM, it is sufficient to define a termination rule, e.g. stop if small ε >. x (r+1) x (r) max{ x (r+1),1} < ε for a We present here a simplification of the nonparametric clustering algorithm of Li, Ray and Lindsay (27). Due to the selection of H in equation (8), the construction of a hierarchy of clusters by gradually increasing the bandwidth of Gaussian kernels can be omitted. Let X be the set of data to be clustered. A nonparametric density estimator is formed for a nonsingular H according to (5) ˆf H (x) = 1 n det H n ϕ (x x i, H), (11) where ϕ ( µ, Σ) is the probability density function of a normal random variable with the mean µ and covariance matrix Σ, i.e. ϕ (x x i, H) = ϕ(h 1 (x x i )) for ϕ(x) = (2π) n/2 exp( x 2 /2). The clustering algorithm reads as follows: 1. Form a kernel density ˆf H (x) as in (11). i=1 2. Use ˆf H (x) as the density function. Use each x i, i = 1,..., n, as the initial value in the MEM algorithm described earlier in the appendix. Let the mode identified by starting from x i be m H (x i ). 3. Extract distinctive values from the set {m H (x i ) i = 1,..., n} to form a set M. Label the elements in M from 1 to M. In practice, due to finite precision, two modes 27

A ROBUST METHOD OF ESTIMATING COVARIANCE MATRIX IN MULTIVARIATE DATA ANALYSIS G.M. OYEYEMI *, R.A. IPINYOMI **

ANALELE ŞTIINłIFICE ALE UNIVERSITĂłII ALEXANDRU IOAN CUZA DIN IAŞI Tomul LVI ŞtiinŃe Economice 9 A ROBUST METHOD OF ESTIMATING COVARIANCE MATRIX IN MULTIVARIATE DATA ANALYSIS G.M. OYEYEMI, R.A. IPINYOMI