The Algorithm for Multiple Outliers Detection Against Masking and Swamping Effects

Size: px

Start display at page:

Download "The Algorithm for Multiple Outliers Detection Against Masking and Swamping Effects"

Lenard Tate
5 years ago
Views:

1 Int. J. Contemp. Math. Sciences, Vol. 3, 2008, no. 17, The Algorithm for Multiple Outliers Detection Against Masking and Swamping Effects Jung-Tsung Chiang Department of Business Administration Ling Tung University, Taiwan No. 1 Ling-Tung Rd. Taichung city, Taiwan jungtsung@mail2000.com.tw Abstract The Gentleman and Wilk s method for k outliers detection is to find the subgroup of n k observations which has the minimum sum of squared residuals (Min:SSE). The proposed method here modifies it as which has the minimum prediction error sum of squares (Min:PRESS). Next, A fast algorithm to find the best construction data set (containing all good observations) is based on the absolute Jackknife residuals at the first stage. An entire data set is divided into two groups. The first group is a clean one to compute the predicted function, and the other one, containing the outliers, is examined by ADP s (Absolute deviation of predictions). A simulation study and two famous examples are presented. Keywords: OLS method, Masking effect, Swamping effect, Jackknife residuals, Cook s Distance, Data Structure 1 Introduction The OLS method is widely used in linear models to detect outliers. Basically, the outliers in a sample reflect several features, which are from either (1) the errors of measurement or (2) intrinsic variability ( mean shift, inflation of variances or others). In case(1), the outliers should be excluded from the sample or need to be corrected. In case (2), more methods and works will be developed if possible. Several authors studied multiple outliers detection. Gentleman and Wilk (1975)[3] proposed the Deletion Method to identify a subset of k outliers. The observations deleted sequentially that produces the largest reduction in the residual sum of squares. This is also equivalent to finding a n k subset of original data set with a minimum of residual sum of squared errors (MSSE).

2 840 J. T. Chiang A problem arises in substantial computations that may not be available for data with large number of observations. So, Marasinghe (1985)[8] took a new test statistic, F k for detecting multiple outliers in linear regression models. This statistic is derived from a sequential testing procedure. Paul and Fung (1991)[10] studied a general extreme studentized residual (GESR) procedures. They control the type I error and take a two-phase procedure to identify the outliers with high leverage values. Hadi and Simonoff(1993)[5] offered the adjusted residuals as criteria for outliers detection. They introduced two algorithms for detecting multiple outliers. Both methods the approximated MSSE subsets were tested for the no-outlier hypothesis with a combination of single linkage clustering (Hartigan 1981)[6] and back-stepping ( Rosner 1975[11]; Simonff 1984a [13]) to avoid masking effect and swamping effect. Recently, Wei and Fung (1999)[16] proposed the mean-shift outlier model for general weighted regression. In this study, we try to give an explicit and clean definition for outliers based on a best construction data set and find a fast algorithm via datasplitting to identify outliers. A reasonable data-splitting is required on this topic. The problem arises in an effective factor space established to get the least squared equation in which the interpolation rule is attained. Besides, the ordinary least squared method (OLS) must be modified since the distributions of outliers are different from that of the entire clean data set. Next, the influential points have an impact on OLS predicted function. They may suffer from masking and swamping effects (See Chiang, 2007 )[2]. Now, We shall focus on the cases of mean shift expected values (Mean-shift Models) throughout the article. Meanwhile, two famous examples and a simulation study are presented here. 2 The Formulation of the Methods for Multiple Outliers We consider the full rank linear model: Y = Xβ + ɛ (1) where X is a known n p (n p + 1) full rank matrix. β is an unknown p 1 vector, and ɛ is an n 1 error vector with i.i.d as N(0,σ 2 I n ). So the least squares estimated residuals e and its variance are: e = Y Ŷ =(I H)Y and Var(e) =(I H)σ 2, where the hat matrix H = X(X X) 1 X is symmetric and idempotent. Now, we delete the ith observation and use the remaining n 1 observations to calculate the fitted value of the ith case, Ŷi(i). The difference between the observed value Y i and Ŷi(i) is called the deleted residuals of the

3 The algorithm for multiple outliers 841 ith case, denoted by d i = Y i Ŷi(i). And the Jackknife residual r i is defined as r i = e i s (i) 1 hii, (i =1, 2,..., n) (2) where s 2 (i) = Y (i) (I H (i))y (i), and h n p 1 ii = X i (X X) 1 X i, the ith diagonal element of H. The ith observation is deleted and indicated by writing the index (i) in brackets. If rank(x (i) )=p and ɛ are i.i.d with N(0,σ 2 I n ), then the Jackknife residuals ri, (i=1, 2,..., n) are t n p 1 distributed ( Beckman et al., 1974 [1]). Next, the prediction error sum of squares is defined as PRESS = n i=1 d 2 i, where d i = Y i Ŷi(i) is the deleted residuals of the ith case. Then PRESS = n i=1 e 2 i (1 h ii ) 2 (3) The PRESS illustrates that each of all observations can be regarded as a new one when it is deleted from the data set. It is equivalent to cross-validation criterion CV (1) (Stone, 1973 [14]). A model with small PRESS is considered well-fitted. So a modified criterion for identifying a single outlier is given by Min : PRESS Gi,i=1, 2,..., n (4) where G i is the subgroup without the ith observation. For a large sample, the PRESS and SSE are asymptotically equivalent(mcquarrie, 1999 [9]). Thus, the equation (4) is also equivalent to Grubbs test(grubbs, 1950 [4]). Here, we consider a data set M of size n containing k outliers, and the entire clean data set C with n c, (n c = n k), good observations is a subset of M. Now suppose that all good observations of the subset C are from the target distribution as the linear models, i.e., Y c N(X c β,σ 2 I n ). The OLS estimator ˆβ c =(X cx c ) 1 X cy c is invariant if k new observations (X nc+1, Ŷn c+1),..., (X nc+k, Ŷn c+k) are added to the data set C. That is, ˆβ c =(X X) 1 XỸ (5) where X is an n p matrix,ỹ =(Y 1,..., Y nc, Ŷn c+1,..., Ŷn k +1), and Ŷn c+j = X n ˆβ c+j c,j =1,..., k. The new data set B of (X, Ỹ ) is associated with the residuals: (e 1,e 2,..., e nc, 0,..., 0). Next, let the original date set M be expressed as: {(X 1,Y 1 ),..., (X nc,y nc ), (X nc+1, Ŷn c+1 + d nc+1),..., (X nc+k, Ŷn c+k + d nc+k)} where d j is greater than 2σZ 1 α/2, j = n c +1,..., n c + k,and the last k observations of data set M are outliers. Obviously, Y M = Ỹ +ΔY if the ΔY =(0,..., 0,d nc+1,..., d nc+k). The OLS residuals of the original data set M is e =(I H)Y M

4 842 J. T. Chiang =(I H)(Ỹ +ΔY ) =(I H)Ỹ +(I H)ΔY =(e 1,..., e nc, 0,..., 0) +ΔY HΔY =(e 1,..., e nc,d nc+1,..., d nc+k) HΔY = e HΔY where H is an n squared hat matrix. e is an n 1 vector calculated by the clean data set C. Consequently, the associated studentized residuals r i corresponding to e i are r i = e i n j=nc+1 h ij d j s 1 h ii,if i=1, 2,..., n c (6) And r i = d i n h j=nc+1 ijd j s,if i= n c +1,..., n c + k (7) 1 h ii which shows that (1) For i =1, 2,..., n c, there exist some r is are greater than critical points, the swamping effect appears on these cases; that is, they are incorrectly regarded as outliers. (2) For i = n c +1,..., n c + k, there exist some r is are less than the critical points, the masking effect appears on these cases; that is, they are incorrectly regarded as good observations (inliers). As argued above, the masking and swamping effects depend on the locations h ii, correlations h ij the signs of d i, and the permutations of these outliers. Basically, the k outliers can be regarded as a perturbation effects on a clean data set C. In the linear model Y = Xβ + ɛ N(Xβ,σ 2 ), we suppose that the real function Y = Xβ is known. Then the entire subset of k outliers satisfies the condition: k Max : ɛ 2 I j,j =1, 2,..., k (8) j=1 where I j is an arbitrary subset of a sample of size k. It is also equivalent to n Min : ɛ 2 i (9) i I j Basically, there is only one entire subset of k outliers in a sample satisfying the above definition, if the real function is known. Probably, there exist two deleted subsets, say, I 1, and I 2, such that SSE (I1 ) = SSE (I2 ) = Minimum of SSE (Ii ) Thus, a good choice to pick up I 1 or I 2 as an entire subset of k outliers is based on the smaller Tr(X X) 1, which is derived from a shorter length of

5 The algorithm for multiple outliers 843 the confidence interval of ˆβ c. An ideal criterion for a best clean data set is to minimize the the predictive sum of squared errors below: Min : PRESS Gi,i=1, 2,..., C(n, n k) (10) where G i is an arbitrary subgroup with n k observations. Now, if the outliers come from the mean shifts of k observations, then the mean-shift model is expressed as Y = Xβ + D + ɛ (11) where D =(0,..., 0,d n k+1,..., d n ), and d i > 2Z 1 α/2 σ. The {(X i,y i )} n i=n k+1 are k outliers with the mean shifts {d i } n i=n k+1. The best clean data set, denoted by C, is {(X i,y i )} nc i=1, where n c = n k or less. The Predictive Confidence Interval I p computed by the clean data set C is : X i ˆβ C α s c 1+X i (X c X c) 1 X i (12) where s c = MSE c, X =(X c,x i ). Since X i (X cx c ) 1 X i = 1 h ii (See p127 of Applied Linear Regression by Weisberg, 1985 [15]), the I p can be shown to X i ˆβ C αs c 1 hii (13) where the approximate critical points C α = t(1 α/2n c,n c p 1) are calculated by the upper bound of the Bonferroni inequality and a t-distribution of SSE Jackknife residuals. If needed, an adjusted critical value could be t(1 PRESS α/2n c,n c p 1) instead. It depends on the data structure. So, the subset I of k outliers can be obtained by Y i / I p,i=1,..., n (14) Here, only i = n k +1,..., n will be correctly identified for the outlying cases. Besides, the C {i}, i I will form a new data set with only one outlier, and the absolute Jackknife residual ri of the observation is greater than the critical values C α. Note that the ideal subset C of good observations, the size n c = n k may be reduced to n k 1, n k 2 orn k 3. It depends on the subjective viewpoints of researchers on residuals plots and n c i=1 sign(e i ) 0 for a large sample of size. h ii 3 The Algorithm, Famous Examples, and Simulation Studies An approach to detection of outliers in linear model is available through data split. If an entire data set of size n contains k outliers, then all C(n, n

6 844 J. T. Chiang k) partitions of the entire data set are considered for construction data sets G i,( i =1, 2,..., C(n, n k)). A best construction data set with n k clean observations is based on: Min : PRESS Gi,i=1, 2,..., C(n, n k) However, these methods involve a deal of computational effort. algorithms of data-splitting is proposed below. So, a fast k = 1, one outlier A single outlier is easy to identify based on Maximum of Jackknife residuals. That is, leave-one-out procedure is the best way for it. Actually, the k is unknown, so we take another way of data-splitting to find the subset of k most likely outliers. Jackknife residuals ri and Cook s Distance D i are used to pick up the construction data set at first stage. ** Algorithm : k = unknown,?? Outliers Step 1: using all observations to calculate each Jackknife residual r i,(i = 1,..., n) of all observations. Step 2: Arrange the absolute r i,( r i ), in ascending order, r (1), r (2),..., r (n), where r (1) = min{ r i }n i=1, and r (n) = max{ r i }n i=1. Step 3: Choose the observations {r (i) }[0.qn] i=1,(0.8 q 0.95) as a construction data set C 1, and the remaining observations form a validation data set V 1. Step 4: Confirm the data set C 1 is a clean one based on the plot of Jackknife residuals. The maximum of absolute Jackknife residuals of C 1 is less than the critical points C α. Step 5: Calculate the ADP (absolute deviation of predictions) for n v observations ; that is, Y i,v1 Ŷi,C 1, i =1, 2,..., n v. Next, we remove some observations of V 1 to C 1 based on their smaller values of ADP s, ( try ADP s<c α MSEc ), and get a new construction data set C 2. The remaining observations is denoted by V 2. Thus, the new predicted value Ŷi,C 2 is computed with the new data C 2. Step 6: In a similar way, we repeat step 4 by removing the new observations with smaller ADP s to C 2, and get a new set C 3,..., and so on. Step 4 and Step 5 are used repeatedly to obtain the best construction data set of size n k. Thus, the remaining k observations are outliers since they do not belong to I p. Besides, if Cook s Distance D i replaces r(i) in step 1, step 2 and step 3, this

7 The algorithm for multiple outliers 845 may result in a slightly different outcomes. The relations between D i and ri is (n p)h ii (ri D i = )2 (15) p(1 h ii )(n p 1+(ri ) 2 ) which is derived from Cook s Distance, the relations of r and r. Some of these points will be elaborated in the following subsections. 3.1 Two Famous Examples (1) Hawkins, Bradu and Kass s Constructed Data The data set is reproduced by the above authors (See Table 1)[7], and the Table 2 and Table 3 are completed by Jackknife residuals and the related statistics (Fig 1 to Fig 3) to find the best construction subset. Actually, the data contains outliers at cases 1 10 but they are undetected owing to the masking effect. The Jackknife residuals of which are between 1.18 and And the good observations 11 an 12 are incorrectly identified as outliers owing to swamping effect. Their Jackknife residuals are 4.03 and 5.29, respectively. Besides, the OLS methods used for the entire data set fail to identify outliers except Hadi s algorithm. However, we propose a fast algorithm to find outliers listed below: Step 1: Use the original data C = {1, 2,..., 75}to calculate Jackknife residuals of all observations. Step 2: Delete the subset V 1 = {1, 2,..., 14, 44}, since their absolute Jackknife residuals are ranked at top 15 ones. The new construction data set C 1 = C V 1 is a clean one based on the plot of Jackknife residuals, and the corresponding predicted function is Ŷ = X X X3. Step 3: Compute the ADP s of V 1, i.e., { Y i,v1 Ŷi,C 1 },and move back the cases 11 14, 44 to C 1 owing to their values of ADP s less than C α MSEc1. Hence the construction data set C 2 = C 1 {11 14, 44} is newer one to obtain the new predicted function Ŷ = X X X3. Step 4: C 2 is a good clean data set based on the plot of residuals. (See Fig 4 ). Step 5: Examining the data sets C 3,C 4,..., C 10, there exists only one outlier for each subset based on the Jackknife residuals. It is worth noting that the observations 1 10 are significantly outlying, since the values of r are , ,..., much greater than the critical point C α. Thus we declare that observations 1 10 are outliers in the original data set. Similarly, we also analyzed Hadi and Simonoff data set with the satisfactory results. (2)Hadi and Simonoff Artificial Data

8 846 J. T. Chiang The data set is created by Hadi and Simonoff (1993)[5] (See Table 4). First, they picked two predictors X1 and X2 distributed as uniform (0, 15), and gave the regression function Y = X1+X2+ɛ,where ɛ N(0, 1). Secondly, let the observations 1 3 be added by the quantity 4 to form a new mean-shift model as Y = X1+X2+4+ɛ. The outliers 1 3 are unidentified due to the masking effects (See Table 5). The OLS method is a drawback for outliers detection here. The alternative methods, such as LMS and LTS (Rousseeuw, 1984)[12] fail to identify them. A modified method proposed by Hadi and Simonoff had detected the three outliers successfully but a little troublesome. In this example, we get a well-fitted regression function Ŷ = X X2 based on a clean data set, and the observations 1, 2, 3 are outliers. Basically, the masking effect appears in the entire data set, since the three outliers 1, 2, 3 are at the same side, and their leverage values are very close together. For detailed discussion, it refers to the section 2, and the mean-shift value of 4 corresponds to the situation of this example. 3.2 Simulation Studies for Samples with two and three Outliers Example 1: Y =1+X + ɛ, ɛ N(0, 1) The data set of size 51 is generated from the linear model Y = 1+X +ɛ, where X i U(0, 10) and ɛ N(0, 1). There are two outliers planted at observations of 1 and 20 by adding the quantity of 5 to the null data set (See Table 6). Example 2: Y =2+X1+X2+ɛ, ɛ N(0, 1) The data set of size 60 is generated from the linear model Y = 2+X1+X2+ɛ, where X 1i U(0, 10),X 2i U(0, 15) and ɛ N(0, 1). There are three outliers planted at observations of 1, 2, 3 by adding the quantity of 4.5 to the null data set (See Table 7). Still, we get a satisfactory result for outliers detection through the above simulation studies. 4 Conclusions The OLS method is widely used in linear regression models. The masking and swamping effects are still unavoidable if the data set contains a few high leverage points. In general, the data structure has provided much information on this topic. The Jackknife residuals and Cook s Distance are good indicators here. In section 2, we have shown that the locations of outliers, their signs of residuals and permutations of all outliers are the main factors for masking and swamping effects. Next, Gentleman and Wilk s method, i.e., Min : SSE Gi for multiple outliers can be modified to Min : PRESS Gi. In this study, a

9 The algorithm for multiple outliers 847 useful and fast algorithm to find multiple outliers is based on data-splitting and Jackknife residuals, and the two famous examples in section 3 illustrate it is much simpler than GESR, Multistage Procedure and Hadi s Algorithms. Meanwhile, the diagnosis of a single outlier in linear models can be extended to the multiple outliers detection, and hence the masking and swamping effects is not a problem. Acknowledgements The author would like to thank his advisor Dr. Kenny Ye, Prof. Nancy Mendell, and Prof. Hongshik Ahn for their assistances in modifying the paper. This paper is a part of his doctoral dissertation in August 2002, AMS Dept. of SUNY-Stony Brook, USA. References [1] R.J. Beckman and H.J. Trussell, The distribution of an arbitrary studentized residual and effects of updating in multiple regression, J. Amer. Statistic. Assn., 69(1974), [2] J.T. Chiang, The Masking and Swamping Effects Using the Planted Mean- Shift Outliers Models, Int. Journal of Contemp. Math. Sciences, Vol.2, 7(2007), [3] J.F. Gentleman and M.B. Wilk, Detecting Outliers:II Supplementing the direct analysis of residuals,biometrics, 31(1975a), [4] Grubbs and E. Frank, Sample Criteria for Testing Outlying Observations, Annals of Mathematical Statistics, 21(1950), [5] A.S Hadi and J.S Simonoff, Procedures for the Identification of Multiple Outlier in Linear Models, Journal of the American Statistical Association, 88(1993), Issue 424, [6] J.A. Hartigan, Consistency of Single Linkage for high-density Cluster, Journal of the American Statistical Association, 76(1981), [7] D.M. Hawkins, D. Bradu and G.V. Kass, Location of several outliers in multiple-regression data using elemental sets, Technometrics, 26(1984), [8] Marasinghe and G. Mervyn, A Multistage Procedure for Detecting Several Outliers in Linear Regression, Technometrics, 27 (1985),

10 848 J. T. Chiang [9] McQuarrie, D R Allan and Tsai, Chih-Ling, Regression and Times Series, Model Selection, World Scientific Publishing Co. Pte. Ltd.,1999. [10] S.R. Paul and Y. Fung, A Generalized Extreme Studentized Residual Multiple-Outlier-Detection Procedure in Linear Regression, Technometrics, 33(1991), [11] B.Rosner, On the Detection of Many Outliers, Technometrics, 17(1975), [12] P.J.Rousseeuw, Least median of squares regression. Journal of American Statistical Association, 79(1984), [13] J.S. Simonoff, The Calculation of Outlier Detection Statistics, Communications in Statistics, Part B-Simulation and Computation, 13(1984a), [14] M. Stone, Cross-validatory choice and Assessment of Statistical Predictions, Journal of the Royal Statistical Society, Ser. B, 36(1973), [15] Weisberg and Sanford, Applied Linear Regression, John Wiley and Sons, Inc., [16] W.H. Wei and W.K. Fung, The mean-shift outlier model in general weighted regression and its applications, Computational Statistics and Data Analysis, 30(1999),

11 The algorithm for multiple outliers 849 Table 1: Hawkins, Bradu and Kass s Constructed Data Obs Y X1 X2 X3 Obs Y X1 X2 X ,

12 850 J. T. Chiang Table 2: The Cook s Distance D i and related Statistics in Hawkins data. Obs D i r h ii CV R DFFITS *The observations of the top fifteen largest D i are 1 14, 43, a slightly different from that of the top fifteen largest ri, 1 14, 44. The fifteen cases are deleted at the first stage.

13 The algorithm for multiple outliers 851 Table 3: Obtaining the Best Construction Data Set step by step on Hawkin et al. data set V i Observations Max r of C i Construction data set C i V , (53) C 1 = C V 1 V (53) C 2 = C V 2 V (10) C 3 = C V 3 V 4 1-8, (9) C 4 = C V 4 V 5 1-7, 9, (8) C 5 = C V 5 V 6 1-6, (7) C 6 = C V 6 V 7 1-5, (6) C 7 = C V 7 V 8 1-4, (5) C 8 = C V 8 V 9 1-3, (4) C 9 = C V 9 V , (3) C 10 = C V 10 V 11 1, (2) C 11 = C V 11 V (1) C 12 = C V 12 V (12) C ( the original data set) *In the third column, the number in () presents the observation. The best construction of size 65 is C 2 = {11, 12,..., 75} here.

14 852 J. T. Chiang Table 4: Hadi and Simonoff s Artificial Data Obs Y X1 X

15 The algorithm for multiple outliers 853 Table 5: Deleting Subsets based on Jackknife residuals r Obs r r r r r r Outlier Yes Yes Yes No No No No No No No No No No No No No No No No No No No No No No *The best construction data set is C {1, 2, 3}, and Obs of 1, 2, 3 are outliers, computing the Hadi and Simonoff s data set.

16 854 J. T. Chiang Table 6: The Simulation Data Set with two outliers Obs Y X Obs Y X The sample of size 51 is generated by R-project, version of R1.4.1

17 The algorithm for multiple outliers 855 Table 7: The Simulation Data Set with three outliers Obs Y X1 X2 Obs Y X1 X The sample of size 60 is generated by R-project, version of R1.4.1

18 856 J. T. Chiang jack Index Figure 1: The scatter plot of Jackknife residuals of the Hawkins et al. data set shows that the linear model is not well-fitted for the entire one. *Several outliers make perturbation effects on this data set.

19 The algorithm for multiple outliers 857 Diagonal of hat matrix i Figure 2: Index plot of leverage measure of Hawkins et al. data set. *The horizontal line is at mean of h values, and the segments function joins pairs of points by a line. Obs 14 is a high leverage point.

20 858 J. T. Chiang The observations of larger values of Cook Distance Absolute Jackknife Residuals Cook s Distance Figure 3: The scatter plot of absolute Jackknife residuals vs. Cook s Distance using the Hawkins s data. (n p)h *Note that D i = ii (ri )2 p(1 h ii )(n p 1+(ri )2 ) is not a one to one correspondence if we map (h ii, ri ) to D i.

21 The algorithm for multiple outliers 859 Index plot of residuals Residuals Index Figure 4: The scatter plot of residuals shows that the linear model is well fitted for the subset C 2. Received: December 9, 2007

The Masking and Swamping Effects Using the Planted Mean-Shift Outliers Models

Int. J. Contemp. Math. Sciences, Vol. 2, 2007, no. 7, 297-307 The Masking and Swamping Effects Using the Planted Mean-Shift Outliers Models Jung-Tsung Chiang Department of Business Administration Ling