((n r) 1) (r 1) ε 1 ε 2. X Z β+

Size: px

Start display at page:

Download "((n r) 1) (r 1) ε 1 ε 2. X Z β+"

Edmund Ray
5 years ago
Views:

1 Bringing Order to Outlier Diagnostics in Regression Models D.R.JensenandD.E.Ramirez Virginia Polytechnic Institute and State University and University of Virginia der/home.html 0.1 INTRODUCTION Consider the model {Y i = β 0 + β 1 X i1 + + β k X ik + ε i (1 i n)} relating a response Y i to fixed regressors {X i1,x i2,...,x ik } through p = k + 1 unknown parameters β =[β 0, β 1,...,β k ] 0. THE FULL MODEL Y = X 0 β + ε is partitioned based on the proposed outliers {Y i : i I} of cardinality r. Y 1 Y 2 = X Z β+ ε 1 ε 2 ((n r) 1) (r 1) with least-squares estimator ˆβ (p 1). All matrices are assume to be of full rank. THE REDUCED MODEL Y 1 = Xβ + ε 1 with the proposed outliers deleted Y 1 = Xβ + ε 1 with least-squares estimator ˆβ I. 1

2 0.2 STANDARD OUTLIER DIAGNOSTICS GROUP I DIAGNOSTICS (t-like statistics) STUDENTIZED DELETED RESIDUAL TEST (Gold Standard) y i ŷ (i)i s (i) r1+x i (X 0 X) 1 x 0 i = y i ŷ (i)i s (i) r 1+hii /(1 h ii ) t(n p 1) (1) THE EXTERNALLY STUDENTIZED RESIDUALS (R STUDENT statistic) Easy to compute t i = y i ŷ i s (i) r1 x i (X 0 0 X 0) 1 x 0 i = y i ŷ i s (i) 1 hii t(n p 1) (2) DFFITS y DFFITS i = b i ŷ (i)i s (i) hii DFFITS i v u t h ii 1 h ii = = t i v u t h ii 1 h ii with h ii = with h ii = with h ii =0.95 2

3 DFBETAS DFBETAS (i)j = DFBETAS (i)j = t i USEFUL RELATIONSHIPS c β j β c (i)j s s (i) (X 0 0X 0 ) 1 jj [(X 0 0X 0 ) 1 X 0 ] ji s (X 0 0X 0 ) 1 jj (1 h ii) c β β c (i) = (X0 0X 0 ) 1 x i (y i y b i ) 1 h ii (y i y b i ) 2 = (n p)s 2 (n p 1)s 2 (i) 1 h ii (3) STANDARD RECOMMENDATION (Myers (1990)): As we indicated earlier, these two measures (DFFITS and DFBETAS) are t-like. Surely any analyst who is familiar with the concept of a standard error knows that if DFBETAS (i)j exceeds 2.0 in magnitude, the influence of the data point in unquestioned. Note: Y Y d or β c β c (i) are OUTLIER measures while h ii and 1/(1 h ii )aremeasuresofinfluence. CONCLUSION: All Group I diagnostics yield the same p-values: Jensen and Ramirez (1996) # 1 and 2, Jensen (1998, 2000) # 2, 3 and 3 and LaMotte (1999) # 1, 2, 3 and 3 3

4 0.2.2 GROUP II DIAGNOSTICS (heuristics based on s 2 I/s 2 ) The ratio s 2 I/s 2 is the Neyman-Pearson likelihood ratio test statistics. R-FISHER parallels the R-Student statistic (Jensen (1999, 2001) (Gold Standard) F I where = e0 I(I r Z(X 0 0 X 0) 1 Z 0 ) 1 e I rs 2 I = ((n p)s2 (n p r)s 2 I)/r s 2 I X(X 0 X)X 0 = H = H 00 H I0 = e0 I(I r H II ) 1 e I rs 2 (4) I F r,n p r H 0I H II MEAN SHIFT OUTLIER MODEL is based on Y 1 = X β+ 0 δ+ ε 1 Y 2 Z I r ε 2 Outliers can be checked jointly with model building. Outliers are tested based on the null hypothesis H 0 : δ = 0 with F I = (RSS(H 0) RSS(H a ))/r (5) RSS(H a )/(n p r) = ((n p)s2 (n p r)s 2 I)/r s 2 F r,n p r I With F (i) = t 2 i 4

5 OUT I (r 1) Barnett and Lewis (1984) OUT I = 1 s2 I (6) s 2 1 n p r OUT r(f I 1) I = rf I +(n p r) 1 COVRATIO (r 1) CR (i) = s2 (i)(x 0 X) 1 s 2 (X 0 0X 0 ) 1 = 1 1 h ii 1 1 h ii = Ã n p t 2 i +(n p 1)! p s 2 (i) s 2 p µ n p p n p 1 0 CR i = 1 h ii 1 h ii with h ii = with h ii = with h ii =0.95 (7) (8) AP i of Andrews and Pregibon (1978) (r 1) AP i = 1 n p 1 s 2 (i) (X 0 X) 1 n p s 2 (X 0 0X) 1 (9) h ii AP i = t2 i +(n p 1)h ii t 2 i +(n p 1) 1 5

6 FVARATIO of Belsley, Kuh and Welsch (1980) (r 1) 1 FV i = s2 (i) (10) 1 h ii s 2 0 FV i = (n p)/(t2 i + n p 1) (n p)/(n p 1) 1 h ii 1 h ii with h 1 ii =0.25 = with h ii = h ii with h ii =0.95 USEFUL RELATIONSHIPS c β β c I =(X 0 0X 0 ) 1 Z 0 (I H II ) 1 (Y I Y c I ) (Y I Y c I ) 0 (I H II ) 1 (Y I Y c I )=(n p)s 2 (n p 1)s 2 I CONCLUSION: All Group II statistics yield the same p-values: Jensen (1999, 2001) # 4, 6, 7, 9, 10, Jensen and Ramirez (1998) # 5 and LaMotte (1999) # 7 6

7 0.2.3 GROUP III DIAGNOSTICS (Distance measures) COOK S D I STATISTICS (1977) take the form D I (ˆβ, M,cˆσ 2 )= (ˆβ I ˆβ) 0 M(ˆβ I ˆβ) cˆσ 2 where M(p p) isnon-negativedefinite and ˆσ 2 is some estimator for the variance σ 2 and c is a user-defined constant. Standard Choices C i of Cook (1977): ˆσ 2 = s 2,c = p, and M = X 0 0X 0 = σ 2 cov(ˆβ) 1 C i = (ˆβ (i) ˆβ)X 0 0 X 0(ˆβ (i) ˆβ) = Y c (i) Y c 2 (11) ps 2 ps 2 (n p)h ii t 2 i C i = p(1 h ii ) t 2 i +(n p 1) with h h ii ii =0.25 = with h ii = h ii with h ii =0.95 WK i of Welsch and Kuh (1977): ˆσ 2 = s 2 I,c= p and M = X 0 0X 0 = σ 2 cov(ˆβ) 1 WK i WK i = h ii 1 h ii = = (ˆβ (i) ˆβ)X 0 0 X 0(ˆβ (i) ˆβ) h ii ps 2 (i) t 2 i p(1 h ii ) with h ii = with h ii = with h ii =0.95 = c Y (i) c Y 2 ps 2 (i) (12) 7

8 Our recommendation D I ( ˆβ, X 0 0X 0,rs 2 I) = (ˆβ I ˆβ) 0 X 0 0X 0 (ˆβ I ˆβ) rs 2 I (13) F r (t; γ 2 1,, γ 2 r; n p r) (14) {γ 2 1 γ 2 r > 0} are the ordered eigenvalues of Z(X 0 X) 1 Z 0 STANDARD RECOMMENDATION: Cook s D... is an F -like statistic with degrees of freedom p and n p. The critical value has been suggested by Cook to be F p,n p (0.50). W i of Welsch (1982): ˆσ 2 = s 2 I,c = p, and M = X 0 X = σ 2 cov(ˆβ I ) 1 W i W i = h ii = = (ˆβ (i) ˆβ)X 0 X(ˆβ (i) ˆβ) h ii t 2 i ps 2 (i) p 0.25 with h ii = with h ii = with h ii =0.95 (15) Our recommendation ˆσ 2 = s 2 I,c= r and M = X 0 X = σ 2 cov(ˆβ I ) 1 D I ( ˆβ, X 0 X,rs 2 I) = (ˆβ I ˆβ) 0 X 0 X(ˆβ I ˆβ) rs 2 I F r (t; λ 1,, λ r ; n p r) {λ 1 λ r > 0} are the ordered eigenvalues of Z(X 0 0 X 0) 1 Z 0, the canonical subset leverages. 8

9 D I of Jensen and Ramirez (1998) (Gold Standard) Normalized Cook Statistic ˆσ 2 = s 2 I,c= r, and M =((X 0 X) 1 (X 0 0 X 0) 1 ) + = σ 2 cov(ˆβ I ˆβ) + D I D I D i = (ˆβ I ˆβ)((X 0 X) 1 (X 0 0 X 0) 1 ) + (ˆβ I ˆβ) rs 2 I = F I F r,n p r = t 2 i CONCLUSION:: All Group III statistics yield the same p-values when r = 1 : Jensen (1999, 2001) # 11, 12, and 15, Jensen and Ramirez (1996) # 11 and LaMotte (1999) # 11. For r>1, the p-values are different. Their distribution is now known. Notation: {γ 2 1 γ 2 r} are the ordered eigenvalues of Z(X 0 X) 1 Z 0 {λ 1 λ r } are the ordered eigenvalues of Z(X 0 0 X 0) 1 Z 0 0 < λ i = γi 2 /(1 + γi 2 )=h ii < 1, 1 i r with h γi 2 ii =0.25 = with h ii = with h ii =0.95 9

10 SINGLE OUTLIER THEOREM (Jensen and Ramirez 1996): When r = 1, the following tests are all equivalent. y i ŷ (i) s (i) r1+x i (X 0 X) 1 x 0 i t(n p 1) y i ŷ i s (i) r1 x i (X 0 0 X t(n p 1) 0) 1 x 0 i F i F (1,n p 1) D i (ˆβ, X 0 0X 0,s 2 (i)) γ1f 2 (1,n p 1) D i (ˆβ, X 0 X,s 2 (i)) λ 1 F (1,n p 1) D i (ˆβ, σ 2 cov(ˆβ i ˆβ) +,s 2 (i)) F (1,n p 1) 10

11 GENERALIZED F DISTRIBUTION (Jensen and Ramirez, 1998) Suppose that elements of U =[U 1,,U r ] 0 are independent {N 1 (ω i, 1); 1 i r} random variables; let {α 1,, α r } be non-increasing positive weights; and identify T = α 1 U α r U 2 r If L(V )=χ 2 (ν) independently of U, then the cdf of is denoted by W = T/r V/ν F r (t; α 1,, α r ; ω; ν) THEOREM (Ramirez and Jensen (1991)) For ω = 0, thepdfof the central generalized F distribution (T/r)/(V/ν) = W has the representation in terms of central F distributions as X c i r i=0 δ r +2i f r w F( ; r +2i, ν) (16) r +2i δ with f F as the pdf of F (ν 1, ν 2 ), and whose coefficients {c i } are defined recursively with 0 < δ < α r. Ramirez (2000) gave the Fortran code for computing p-values for F r (t; α 1,, α r ; ν). Dunkl and Ramirez (2001) have developed a fast algorithm for computing the p-values for W without having to integrate the density function in terms of Lauricella F (r) D functions. STOCHASTIC BOUNDS (Jensen and Ramirez (1991)) for F r are F r (t; α 1 ; ν) F r (t; α 1,, α r ; ν) F r (t; α ; ν) with α thegeometricmeanof{α 1,, α r }. 11

12 THEOREM (Jensen and Ramirez (200x)) For ω 6= 0, thepdfof the non-central generalized F distribution (T/r)/(V/ν) = W has the representation in terms of central F distributions as X c i r i=0 δ r +2i f r w F( ; r +2i, ν) r +2i δ with f F as the pdf of F (ν 1, ν 2 ), and whose coefficients {c i } are defined recursively with 0 < δ < α r and where {c i } are functions of ω. NORMAL THEORY RESULTS (Jensen and Ramirez (1998)): If L(Y) =N N (X 0 β, σ 2 I n ), then with ν = n p r D I (ˆβ, X 0 0X 0,rs 2 I) F r (t; γ 2 1,, γ 2 r; n p r) where {γ 2 1 γ 2 r} are the ordered eigenvalues of Z(X 0 X) 1 Z 0 ; D I (ˆβ, X 0 X,rs 2 I) F r (t; λ 1,, λ r ; n p r) where {λ 1 λ r } are the ordered eigenvalues of Z(X 0 0 X 0) 1 Z 0 and λ i = γ 2 i /(1 + γ 2 i )=h ii, 1 i r; and D I (ˆβ, σ 2 cov(ˆβ I ˆβ) +,rs 2 I) F r (t;1,, 1; n k) 12

13 JOINT OUTLIER RESULTS: When r > 1, outliers can be tested with F I F r (t;1,, 1; n p r) D I (ˆβ, σ 2 cov(ˆβ I ˆβ) +,rs 2 I) = D I F r (t;1,, 1; n p r) D I (ˆβ, X 0 0X 0,rs 2 I) F r (t; γ 2 1,, γ 2 r; n p r) (17) D I (ˆβ, X 0 X,rs 2 I) F r (t; λ 1,, λ r ; n p r) (18) The mean shift statistic F I and the normalized Cook statistic D I yield the same p-values. For the Cook-like statistics, we prefer (18) over (17) for computationally reasons, since the number of terms required in the series expansion for the cdf of (T/r)/(V/ν) is smaller. This number increases with the condition numbers, which satisfy γ1/γ 2 r 2 λ 1 /λ r. 13

14 0.3 EXAMPLE We use the Drill Data Set from Cook and Weisberg (1982, p. 149) with n =31andk = 10. With r =1,wefind the influential rows (with p<.025) to be rows 9, 28, and row 31. The p-values are (necessarily) the same using D i (ˆβ, X 0 0X 0,s 2 i ), D i = D i (ˆβ, X 0 X,s 2 (i)), D i (ˆβ, σ 2 cov(ˆβ I ˆβ) +,s 2 (i)), or t i (RSTUDENT). Row s 2 (i) D i λ i p-value D i /λ i With r = 2, we find 39 influential pairs (with p <.01 and using the average of the p-bounds) using D I (ˆβ, X 0 0X 0, 2s 2 I), 35 pairs using D I (ˆβ, X 0 X, 2s 2 I), and 16 pairs (directly) using D I (ˆβ, σ 2 cov(ˆβ I ˆβ) +, 2s 2 I). All contain rows from {9,28,31} except for the pair (5,26). For this pair using D I (ˆβ, X 0 X, 2s 2 I), we have Rows s 2 I D I λ 1 λ 2 LB p UB 5, The p-values for this pair, using D I ( ˆβ,X 0 0X 0, 2s 2 I), D I ( ˆβ,X 0 X, 14

15 2s 2 I), and D I ( ˆβ, σ 2 cov( ˆβ I ˆβ) +, 2s 2 I): Test p-value N D I ( ˆβ,X 0X 0 0, 2s 2 I) D I ( ˆβ,X 0 X, 2s 2 I) D I ( ˆβ, σ 2 cov( ˆβ I ˆβ) +, 2s 2 I) Other Heuristics We now consider joint outliers for the Intercountry Life-Cycle Savings Data from Belsley et al. (1980, p. 41). We will only consider pairs of observations with I = {i 1,i 2 }. The extension to larger subsets is immediate. Belsley et al. (1980) used a number of different empirical procedures to identify 15 pairs of observations that may be influential. These procedures include MDFFIT (MD), COVRA- TIO (CR), RESRATIO (RR), MEWDFFIT (ME), Wilks Λ statistic (WL), and the Andrews and Pregibon statistic (AP). Table I shows those pairs of observations are of note based on these criteria. The last column records the p-values from D I ( ˆβ,X 0 0X 0, 2s 2 I). These empirical procedures do not agree with each other, nor do they detect the joints outliers which we are able to determine using D I as an omnibus test of model fit under the mean shift and variance shift outlier model. Outliers are often influential, however not all influential subsets are outliers (see Barnett and Lewis (1994, p. 317)). 15

16 Table I: Multiple-Row Influence Using Empirical Procedures Rows MD CR RR ME WL AP p-value 34,46 * * ,46 * * ,46 * * * * ,23 * ,49 * * *.081 7,46 * * ,49 * ,49 * , ,49 *.329 6, ,49 * ,49 * * ,49 * *.438 6,44 *.699 MDFFIT (MD) (numerator of D I (ˆβ, X 0 X, 2s 2 I)) (ˆβ I ˆβ) 0 X 0 X(ˆβ I ˆβ) COVRATIO (CR) CR (i) = s2 (i)(x 0 X) 1 s 2 (X 0 0X 0 ) 1 RESRATIO (RR) (same as the mean-shift statistic) F I = ((n p)s2 (n p r)s 2 I)/r s 2 I 16

17 MEWDFFIT (ME) ME = X Wilks Λ statistic (WL) WL =1 h ij i,j I Andrews and Pregibon (AP) AP I =1 n p r n p e i e j (1 h i )(1 h j ) n r(n r) 10 Z(Z 0 Z) 1 Z 0 1 s 2 I (X 0 X) 1 s 2 (X 0 0X) 1 17

18 0.4 R-Fisher distribution Assumptions A. A1. E(ε) =0 < n r and E(ε I )=δ < r ; A2. V (ε 0 )=Diag(σ 2, σ 2 1); and A3. L(ε, ε I δ) =N n (0,Diag(σ 2, σ 2 1)). This model allows for shifts in both location and scale at design points in Z. Y 1 Y 2 = X Z β+ 0 I r δ+ ε ε I δ With δ =0andκ = σ 2 1/σ 2 = 1 (Jensen (1999, 2001)) F I = e0 I(I r Z(X 0 0 X 0) 1 Z 0 ) 1 e I rs 2 I = ((n p)s2 (n p r)s 2 I)/r s 2 I = e0 I(I r H II ) 1 e I rs 2 I F r,n p r 18

19 For the general model with δ 6= 0andκ = σ 2 1/σ 2 > 1, we have THEOREM (Jensen (1999) for κ =1andJensenandRamirez (200x) for κ 6= 1)ConsidertheR-Fisher diagnostic F I under Assumptions A; let{λ 1 λ 2... λ r > 0} comprise the canonical subset leverages (the eigenvalues of H II ); set κ = σ 2 1/σ 2 1; let θ = Q 0 δ, such that Q 0 H II Q = Diag(λ 1, λ 2,, λ r ); and identify ν = n p r. i) The cdf of F I is given by F s (w; α 0 ; ω 0 ; ν), with weights {α i = κ (κ 1)λ i ;1 i r} satisfying {α r... α 1 1}, and with location parameters {ω i = θ i /σ[κ + λ i /(1 λ i )] 1/2 ;1 i r}. ii) Bounds for the cdf s in terms of Fisher s distribution are given by F (w/α r ; r, ν, λ) F s (w; α 0 ; ω 0 ; ν) F (w/α ; r, ν, λ) (19) where λ = P r i=1 θ 2 i /σ 2 [κ + λ i /(1 λ i )] and α =(α 1 α r ) 1/r is the geometric mean. 19

20 0.4.1 Case Study Data (Myers (1990)) regarding the administration of Bachelor Officers Quarters (BOQ) were reported for sites at n =25naval installations. Monthly man-hours (Y ) were related linearly to average daily occupancy (X 1 ), monthly number of check-ins (X 2 ), weekly service desk operation in hours (X 3 ), size of common use area (X 4 ), number of building wings (X 5 ), operational berthing capacity (X 6 ), and number of rooms (X 7 ). The data are reported in Myers (1990), p. 218 ff, together with detailed analyses using single-case deletion diagnostics. Subset deletion diagnostics are not reported there. We focus on sites {15, 20, 21, 23, 24}, having individual leverages {0.5576, , , , }, their individual R-Student values exceeding the widely used ±2 rule. Pairs of sites selected are S 1 = {20, 21}, S 2 = {15, 20}, and S 3 = {23, 24}, reflecting smaller, intermediate, and larger individual leverages. 20

21 1 σ Table II of Low, Medium, and High Leverages S 1 S 2 S 3 κ κ κ (6).3606(6) (6).2779(6) (4).0734(5) (10).5848(9) (9).4743(9) (5).0904(6) (15).8105(13) (13).7080(12) (6).1195(7) (21).9426(18) (18).8826(16) (7).1613(8)

22 0.4.2 Conclusions The single-case deletion diagnostics of Groups I, II, and III are all equivalent. Moreover, the Group II subset deletion diagnostics and the Group III diagnostic D I are all equivalent to the R-Fisher diagnostic F I. These subset deletion diagnostics have been studied under outliers arising from shifts in both location and scale. In these circumstances the distribution of F I has been shown to be a noncentral generalized F distribution. We have given series expansions for these distributions, as well as global error bounds for their partial sums. Bounds for the cdfs of noncentral generalized F distributions also have been given in terms of the noncentral Fisher distributions. The noncentral generalized F distributions have been used to computethepowerforthediagnosticf I, and thus for all equivalent diagnostics, under shifts in location and scale at selected subsets of the BOQ data. These case studies demonstrate that subsets with large canonical leverages tend to associate with tests having low power, so that shifts in both location and scale may be masked at points of high leverages. 22

23 References [1] Andrews, D. F. and Pregibon, D. (1978), Finding outliers that matter, J.RoyalStatist.Soc.B40, [2] Barnett, V. and Lewis, T. (1994). Outliers in Statistical Data. John Wiley and Sons, New York. [3] Belsley,D.A.,Kuh,E.andWelsch,R.E.(1980).Regression, Diagnostics: Identifying Influential Data and Sources of Collinearity. John Wiley and Sons, New York. [4] Cook, R. D. and Weisberg, S. (1982). Residuals and Influence in Regression. Chapman and Hall. [5] Dunkl, C. F. and Ramirez, D. E. (2001), Computation of the generalized F distribution, Australian and New Zealand Journal of Statistics, 43, [6] Jensen, D. R. and Ramirez, D. E. (1996), Computing the cdf of Cook s D I statistic,in Proceedings of the 12th Symposium in Computational Statistics, Barcelona, Spain, 1996, Prat, A. and Ripoll. [7] LaMotte, L. R. (1999), Collapsibility hypotheses and diagnostic bounds in regression analysis, Metrika 50, [8] Jensen, D. R. (1998), The use of standardized diagnostics in regression, Research Report R079, Department of Statistics, Virginia Polytechnic Institute and State University, October [9] Jensen, D. R. (1999), Properties of selected subset diagnostics in regression, Research Report R080, Department of Statistics, Virginia Polytechnic Institute and State University, March [10] Jensen, D. R. (2000), The use of Standardized diagnostics in regression, Metrika, 52, [11] Jensen, D. R. (2001), Properties of selected subset diagnostics in regression, Statist. Probab. Letters, 51,

24 [12] Jensen, D. R. and Ramirez, D. E. (1991), Misspecified Tˆ2 tests, I. Location and scale, Comm. Stat. Theory Meth. A 20, No. 1, [13] Jensen, D. R. and Ramirez, D. E. (1998), Some exact properties of Cook s D I, Handbook of Statistics, Vol. 16, Balakrishnan, N. and Rao, C. R., eds., pp Elsevier Science Publishers B. V. [14] Jensen, D. R. and Ramirez, D. E. (200x), Detecting mean-shift outliers via distances, Journal of Statistical Computation and Simulation. [15] Jensen, D. R. and Ramirez, D. E. (200x), Detecting shifts in location and scale in regression, South African Statistical Journal. [16] Myers, R. H. (1990). Classical and Modern Regression with Applications, Second ed. PWS-Kent Publishing Company, Boston, MA. [17] Ramirez, D E. (2000), The generalized F distribution, Journal of Statistical Software 5(1), [18] Ramirez, D. E. and Jensen, D. R. (1991). Misspecified T 2 tests. II. Series expansions. Commun. Statist. Simula. 20:

with M a(kk) nonnegative denite matrix, ^ 2 is an unbiased estimate of the variance, and c a user dened constant. We use the estimator s 2 I, the samp

with M a(kk) nonnegative denite matrix, ^ 2 is an unbiased estimate of the variance, and c a user dened constant. We use the estimator s 2 I, the samp The Generalized F Distribution Donald E. Ramirez Department of Mathematics University of Virginia Charlottesville, VA 22903-3199 USA der@virginia.edu http://www.math.virginia.edu/~der/home.html Key Words: