Bayesian Outlier Detection in Regression Model Younshik Chung and Hyungsoon Kim Abstract The problem of 'outliers', observations which look suspicious in some way, has long been one of the most concern in the statistical structure to experimenters and data analysts. We propose a model for an outlier problem and also analyze it in linear regression model using a Bayesian approach. Then we use the mean-shift model and SSVS(George and McCulloch, 993)'s idea which is based on the data augmentation method. The advantage of proposed method is to nd a subset of data which is most suspicious in the given model by the posterior probability. The MCMC method(gibbs sampler) can be used to overcome the complicated Bayesian computation. Finally, a proposed method is applied to a simulated data and a real data. KEYWORDS : Gibbs sampler; Latent variable; Linear mixed normal model; Linear regression model; Mean-shift model; Outlier; Variance-ination model;. INTRODUCTION The problem of 'outliers', observations which look suspicious in some way, has long been one of the most concern in the statistical structure to experimenters and data analysts. In this paper, we propose a model for an outlier problem and also analyze it using a Bayesian approach. The Bayesian approaches for outlier detection can be classied to two procedures such as using alternative model for outliers or not. For the method without having alternative model, the predictive distribution is used in Geisser(985) and ettit and Smith(985), or the posterior distribution used in Johnson and Geisser(983), Chaloner and Brant(988) and Guttman and ena(993). For alternative model, the mean-shift model or the variance-ination model is used in Guttman(973) and Sharples(990). Let Y be an observation vector from N(; ). The mean-shift model is that a suspicious observation is distributed as N( + m; ). If m is not a zero, the corresponding observation is decided as an outlier, Guttman(973) applied the mean-shift model to a linear model. The variance-ination model is that an observation y i be from N(; b i ) where the observation, y i, with b i >>, is treated as an outlier (Box and Tiao, Department of Statistics, usan National University, usan, 609-735 Korea Department of Statistics, usan National University, usan, 609-735 Korea
Younshik Chung and Hyungsoon Kim 968). Sharples(990) showed how variance ination can be incorporated easily into general hierarchical models, retaining tractability of analysis. In this paper, we use the mean-shift model in linear regression model. Specifically, we assume that for some particular (n p) matrix X of constants(or design matrix), it is intended to generate data Y = (Y ; ; Y n ) t, such that Y = X + (.) where = ( ; ; p ) t is a set of p unknown regression parameters, and where the (n ) error vector is normally distributed with mean 0 and variance-covariance matrix I n, where is unknown. Despite the fact that (.) is intended generation scheme, it is feared that departures from this model will occur, indeed that the observations y may be generated as follows; for i = ; ; n, y i = px j= x ij j + m i + i : (.) Therefore if m i = 0, then ith observation is not an outlier. Otherwise, it is considered as an outlier. For detecting outliers, we use SSVS(stochstic search variable selection) method in George and McCulloch(993). The SSVS was introduced as the method for selecting the best predictors in multiple regression model using Gibbs sampler. In general likelihood approches, we nd one outlier which is most suspicious and next the pair and the triple of outliers, and so on. Then we can not compare the one most outlier with a pair of most outliers and there exits sometimes masking. But our Bayesian method overcomes such problems. In this procedure, the most great advantage is that we can nd the set of outliers which is most suspecious among the data. The plan of this article is as follows. In section, inorder to nd the outliers, we introduce and motivate the hierarchical framework that depends on the SSVS in George and McCulloch(993). In section 3, we illustrate our proposed method on a simulated example and a real data set (Darwin's data: Box and Tiao, 973). Finally, in Section 4, we extend our proposed method to normal linear mixed model... SSVS For Outliers. Hierarchical Bayesian Formulation For detecting outliers in linear regression model, we apply the mean-shift model to linear regression model. Despite the fact that (.) is intended generation scheme, it is feared that departures from this model will occur, indeed that the
Bayesian outlier detection 3 observations y may be generated as follows; for i = ; ; n, y i = px j= x ij j + m i + i : (.) Therefore our model is reexpressed as follows; Y = X ~ ~ + (.) where X ~ = [X; In ] and ~ = ( ; ; p ; m ; ; m n ) t identity matrix. and I n denotes the n n by By introducing the latent variable j = 0 or, we represent our normal mixture m j j j (? j )N(0; j ) + j N(0; c j j ) (.3) and r( j = ) =? r( j = 0) = p j : (.4) This is based on the data augmentation idea of tanner and Wong(987). George and McCulloch(993) use the same structure (.3) and (.4) for selecting variables in linear regression model. Diebolt and Robert(994) have also successfully used this approach for selecting the number of components in the mixture distribution. When i = 0, m i N(0; i ), and when i =, m i N(0; c i i ). Following George and McCulloch(993), rst, we set i (> 0)small so that if i = 0, then m i would probably be so small that it could be "safely" estimated by 0. Second, we set c i large (c i > always) so that if i = then a non-zero estimate of m i means that the corresponding data y i should probably be an outlier in the given model. To obtain (.3) as the prior for m i j i, we use a multivariate normal prior m j N k (0; D RD ); (.5) where m = (m ; ; m n ), = ( ; ; k ), R is the prior correlation matrix, and D = diag[a ; : : : ; a k k ] (.6) with a i = if i = 0 and a i = c i if i =. For selecting the values of tunning factors c and i, see George and McCulloch(993) and section.. Also see George and McCulloch(993) for selection of R. As a particular interest, the identity matrix can be used for R. The choice of f() should incorporate any available prior information about which subsets of y ; ; y n should be outliers in the given model. Although this may seem dicult with n possible choices, especially with large n, symmetry
4 Younshik Chung and Hyungsoon Kim considerations may simplify this work. For example, a reasonable choice might have the 's independent with marginal distributions (.4), so that f( j p ; ; p n ) = ny i= p i i (? p i) (? i) : (.7) Although (.7) implies that the outlier of y i is independent on the outlier of y j for all i 6= j, we found it to work well in the various situations. The uniform or indierence prior f() =?n is the special case of (.7 ) where each y i has an equal chance to be an outlier. Also, it is assumed that the prior of = ( ; ; p ) is normal distribution with mean vector = ( ; ; p ) and variance-covariance matrix?. That is, Finally, we use the inverse gamma conjugate prior N(;? ): (.8) j IG( ; ): (.9) In this procedure, our main concern for embedding the normal linear model (.) into the hierarchical mixture model is to obtain the marginal posterior distribution f( j Y ) / f(y j )f(), which contains the information relevant to outliers. As mentioned as before, f() may be interpretated as the statistician's prior probability that the y i 's corresponding to an non-zero components of should be outliers in the given model. The posterior density f( j Y ) updates the prior probabilities on each of the n possible values of. Identifying each with a subset of data via that i = is equivalent to that y i is an outlier, those with higer posterior probability f( j Y ) identify the subset of data which is most suspicious by data and the statistician's prior information. Therefore, f( j Y ) provides a ranking that can be used to select a subset of the most suspicious data.. Choice of c j and j As an idea discussed in George and McCulloch(993), the choice of c j and j should be based on two considerations. First, the choice determines a neighborhood of 0 where SSVS treats m j as equivalent to zero. This neighborhood is obtained as (? j ; j ), where? j and j are the intersection points of the densities N(0; j ) and N(0; c j j ). Because (? j; j ) is the interval where N(0; j ) dominates N(0; c j j ), the posterior probability of j = 0 is more likely to be large when m j is in (? j ; j ). Thus SSVS entails estimating m j by 0, according to whether m j is in (? j ; j ). Second, the choice determines how dierent the two densities are. If N(0; j ) is too peaked or N(0; c j j ) is too spread out, the Gibbs sampler may converge too slowly when the data are not very informative.
Bayesian outlier detection 5 Our recommentation is to rst choose j. This may be obtained by practical considerations such as setting j equal to the largest value of j m j j for which a plausible change in y j would make no practical dierence. Once j has been chosen, we recommand choosing c j between 0 and 00. This range for c j seems to provide seperation between N(0; j ) and N(0; c j j ) which is large enough to yield a useful posterior and small enough to avoid Gibbs sampling convergence problems. When such prior information is available, we also recommand choosing c j so that N(0; c j j ) give reasonable probability to all plausible values of m j. Finally, j is obtained from j and c j by j = [log(c j )c j =(c j? )]? j. Note that for c j = 0; j = :5 j and for c j = 00; j = 3:04 j. For selecting the regression parameters in linear model, a similar semiautomatic strategy for selecting j and c j based on statistical signicance is described in George and McCulloch(993)..3. Full Conditional Distributions Densities are denoted generically by brackets, so joint, conditional, and marginal forms, for example, appear as [X; Y ]; [X j Y ] and [X], respectively. The joint posterior density of ; m; ; given y ; ; y n is given by [; m; ; j y ; ; y n ] / () n j I j expf? (Y? X ~ ) ~ t? (Y? X ~ )g ~ () j j expf? (? )t (? )g () n j D RD j expf? mt (D RD )? mg (V =) ny i=?(v =) expf? V ( ) V + p i i (? p i)? i (.0) In order to Gibbs sampler, the full conditional distributions are needed as follows; [ j m; ; ; y ; ; y n ] / expf? (? ~ t ~ X t ~ X ~?? ~ t ~ X t Y + t? t )g; g (.) [m j ; ; ; y ; ; y n ] / expf? (? ~ t ~ X t ~ X ~?? ~ t ~ X t Y + m t (D RD )? m)g; (.) [ j ; m; ; y ; ; y n ] = IG( n + V ; jy? X ~ j ~ + V ) (.3)
6 Younshik Chung and Hyungsoon Kim Finally, the vector is obtained componentwise by sampling consecutively from the conditional distribution i [ i j y ; ; y n ; ; m; ; (?i)] = [ i j m; ; (?i)] (.4) where (?i) = ( ; ; i?; i+ ; ; n ). The last equality in (3.5) holds because the independency of (3.5) on Y results from the hierarchical structure where aects Y only through m and is independent of by the assumption above. Since [m; i ; (?i); ] = [m j i ; (?i); ][ j i ; (?i)][ i ; (?i)], [ i = j y ; ; y n ; ; m; ; (?i)] = [m; i = ; (?i); ] k=0 [m; i = k; (?i); ] = a a + b (.5) where a = [m j i = ; (?i); ][ j i = ; (?i)][ i = ; (?i)]and b = [m j i = 0; (?i); ][ j i = 0; (?i)][ i = 0; (?i)]. Note that under the prior (.7) on and when the prior parameters for in (.9) are constant, then (3.6) can be obtained more simply by a = [m j i = ; (?i)]p i ; b = [m j i = 0; (?i)](? p i ): (.6) 3.. Simulate data 3. Illustrative Examples In this section we illustrate the performance of SSVS on simulated examples. We consider the simple linear regression model, that is, y i = 0 + x i + i ; i = ; ; n (3.) with n = 0 and let m i = 0 for i 6= 7 and m 7 = 0. x i N(0; ), i N(0; ) for i = ; ; 0 and ( 0 ; ) = (0:5; ). Therefore since m 7 = 0, we assume that observation 7 is an outlier in this data set. For the notational convenience, let and Then 0 = yi + ( 0 + )? ( x i + )? i n + = xi y i + ( 0 + )? 0 ( x i + )? i x i n + [ 0 j ; m; ; ; y ; ; y n ] = N( 0 ; ); (3.)
Bayesian outlier detection 7 For i = ; ; n, [ j 0 ; m; ; ; y ; ; y n ] = N( ; ); (3.3) [m i j 0 ; ; m (?i); ; ; y ; ; y n ] = N( mi ; ) (3.4)? + (a i i )? where mi = y i? 0? x i + (a i i )?. [ j ; m; ; y ; ; y n ] = IG( n ; jy? X ~ j ~ ) (3.5) [ i = j y ; ; y n ; ; m; ; (?i)] = a a + b where a = [m j i = ; (?i)]p i ; b = [m j i = 0; (?i)](? p i ): (3.6) Table 3.. outlier numbers observation proportion 0 0.6 0.0 3 0.0 4 0.0 6 0.0 7 0.36 0 0.05, 7 0.0, 7 0.0 4, 7 0.0 5, 7 0.0 7, 9 0.0 7, 0 0.03 9, 0 0.0 3,, 5 0.0, 6, 7 0.0 4, 4, 6, 0 0.0, 7, 8, 9 0.0 We applied SSVS to our model with the indierence prior f() = ( )0 : = = 0 = 0:5; c = = c 0 = 0, R = I, and = 0. In Table 3., observation 7
8 Younshik Chung and Hyungsoon Kim is considered as outlier since its correponding posterior probability is 0.36 which is highest among all data. Also, we may consider that there is no outliers in data set since its posterior probability is second highest. 3.. Darwin's Data Consider the analysis of Darwin's data on the dierence in heights of selfand cross-fertilized plants quoted by Box and Tiao(973, p53). The data consists of measurements on 5 pairs of plants. Each pair contained a self-fertilized and cross-fertilized plant grown in the same pot and from the same seed. Arranged for convenience in order of magnitude, the n=5 observations (on dierences in in heights in eighths of an inch of self-fertilized and cross-fertilized plants) are: -67, -48, 6, 8, 4, 6, 3, 4, 8, 9, 4, 49, 56, 60, 75. Guttman, Dutter and Freeman(978) re-examine the Darwin's data to detect outlier(s) using Bayesian approach with the model as follows; Y = + ; N(0; I n ) (3.7) where = (; : : : ; ) t. Guttman et al(978) mentioned that observations and, having values -67 and -48, are identied as spurious observations since they have the highest posterior probability. Table 3. shows that observations and, having values -67 and -48 respectively, are considered as outliers since they have the highest posterior probability 0.57. Also, observations, and 7 may be considered as outliers. Table 3.. outlier numbers observation proportion 0 0.00 0.00, 0.57 3,, 7 0.38,, 8 0.0 4,, 7, 8 0.04 4. Extension to Linear Mixed Normal Model In this section, we use the mean-shift model in linear mixed model. Specically, we assume that for some particular (n p) matrix X and (n l) matrix Z of constants(or design matrices), it is intended to generate data Y = (Y ; ; Y n ) t,
Bayesian outlier detection 9 such that Y = X + Zb + (4.) where = ( ; ; p ) t and b = (b ; : : : ; b l ). And we assume that the (n ) error vector is normally distributed with mean 0 and variance-covariance matrix I n, where is unknown and the (l ) vector b is assumed to be normal with zero mean vector and the variance-covariance matrix b I l. Also the independency of b and are assumed. For detecting the outliers, we consider that observations y may be generated as follows; for i = ; ; n, y i = px j= x ij j + lx j= z ij b j + m i + i : (4.) Therefore if m i = 0, then ith observation is not an utlier. Otherwise, it is considered as an outlier. For detecting outliers, we use SSVS(stochstic search variable selection) method in section. Therefore our model is reexpressed as follows; Y = ~ X ~ + Zb + (4.3) where X ~ = [X; In ] and ~ = ( ; ; p ; m ; ; m n ) t identity matrix. and I n denotes the n n As before, we use all forms in (.3) - (.7) into our model in (4.3). Also, it is assumed that the prior of = ( ; ; p ) is normal distribution with mean vector = ( ; ; p ) and variance-covariance matrix?. That is, Finally, we use the inverse gamma conjugate prior N(;? ): (4.4) j IG( ; ); b IG( b ; b b ): (4.5) 4.. Full Conditional Densities The joint posterior density of ; b; m; ; given y ; ; y n is given by [; b; m; ; ; b j y ; ; y n ] / expf? ( ) n (Y? X ~ ~? Zb) t (Y? X ~ ~? Zb)g j j expf? (? )t (? )g
0 Younshik Chung and Hyungsoon Kim () n j D RD j expf? mt (D RD )? mg V expf? ( ) V + g expf? V b b (b ) V b + b g ny i= p i i (? p i)? i (b )? l expf? bt b b g (4.6) In order to Gibbs sampler, the full conditional distributions are needed as follows; [ j b; m; ; ; y ; ; y n ] (4.7) / expf? ( ~ t X ~ t X ~ ~? ~ t X ~ t Y + ~ t X ~ t )? (t? t )g; [b j ; m; ; ; b ; y ; ; y n ] (4.8) / expf? [ b b (?Y t Zb + ~ t XZb ~ + b t Z t Zb) + b t b]g [m j ; b; ; ; b ; y ; ; y n ] (4.9) / expf? (Y? X ~ ~? Zb) t (Y? X ~ ~? Zb)? mt (D RD )? mg [ j ; b; m; b ; ; y ; ; y n ] = IG( n + V ; bt b + V b b ) (4.0) [ b j ; b; m; ; ; y ; ; y n ] = IG( l + V b Finally, the vector is dened in (.5) and (.6). 4.. Generated data ; jy? X ~ ~? Zbj + V ) (4.) In this section we illustrate the performance of SSVS on simulated example. We consider the simple linear mixed normal model, that is, y i = 0 + x i + b z i + b z i + i ; i = ; ; n (4.) with n = 0, x i ; z i ; z i N(0; ), i N(0; ) for i = ; ; 0 and ( 0 ; ) = (0:5; ) and b i N(0; ) for i = ;. And letm 7 = 0 and m i = 0 for i 6= 7. Therefore we assume that observation 7 is outlier in this data set. For the notational convenience, let 0 = yi + ( 0 + )? ( x i + )? b zi? b zi? m i n + ;
Bayesian outlier detection = ( 0 + )? 0 ( x i + )? (b x i z i + b x i z i + m i x i? x i y i ) x i + b = b [yi z i? ( 0 + x i + m i )z i? b z i z i ] b z i + and b = b [yi z i? ( 0 + x i + m i )z i? b z i z i ] b z i + Then [ 0 j ; b; m; ; ; y ; ; y n ] = N( 0 ; [ j 0 ; b; m; ; ; y ; ; y n ] = N( ; n + ); (4.3) x i + ); (4.4) [b j b ; ; m; ; ; y ; ; y n ] = N( b ; b b z i + ) (4.5) [b j b ; ; m; ; ; y ; ; y n ] = N( b ; For i = ; ; n, [m i j 0 ; ; m (?i); ; ; y ; ; y n ] = N( mi ; b b z i + ) (4.6) ) (4.7)? + (a i i )? where mi = y i? 0? x i + (a i i )?. [ j ; m; ; y ; ; y n ] = IG( n ; jy? X ~ j ~ ) (4.8) [ i = j y ; ; y n ; ; m; ; (?i)] = a a + b (4.9) where a = [m j i = ; (?i)]p i ; b = [m j i = 0; (?i)](? p i ): We applied SSVS to our model with the indierence prior f() = ( )0 : = = 0 = 0:5; c = = c 0 = 0, R = I, and = 0 and b = 0. Table 4. shows that observation 7 is considered as an outlier since its proportion (which is an approximate posterior probability) is highest. Also, it may be assumed that there are no outliers.
Younshik Chung and Hyungsoon Kim Table 4.. outlier numbers observation proportion 0 0.3 0.09 4 0.03 6 0.0 7 0.35 8 0.07 0 0.0, 4 0.0, 7 0.06, 9 0.0 4, 6 0.0 4, 8 0.0 6, 7 0.0 6, 0 0.0 3, 4, 8 0.0 4, 8, 7 0.0 4, 4, 9, 0 0.0, 7, 8, 9 0.0 References Box, G. E.. and Tiao, G. C. (968), A Bayesian Approach to Some Outlier roblems, Biometrika, 55, 9-9. Box, G. E.. and Tiao, G. C. (973), Bayesian Inference in Statistical Analysis, Addison-Wesley ublishing Co., U.S.A. Chaloner, K. and Brant, R. (988), A Bayesian Approach to Outlier Detection and Residual Analysis, Biometrika, 75, 65-659. Diebolt, J. and Robert, C. (994). Estimation of Finite Mixture Distributions Through Bayesian Sampling, Journal of royal Statistical Society, Ser. B, 363-375. Geisser, S. (985), On the redicting of Observables: A Selective Update, in Bayesian Statistics, Ed. Bernardo, J. M., DeGroot, M. H., Lindley, D. V. and Smith, A. F. M., 03-30, Amsterdam: North Holland.
Bayesian outlier detection 3 Gelfand, A. and Smith, A. F. M. (990), Sampling-Based Approaches to Calculating Marginal Densities, Journal of the American Statistical Association, 85, 398-409. George, E.I. and McCulloch, R.E. (993), Variable Selection Via Gibbs Sampling, Journal of American Statistical Association, 88, 88-889. Guttman, I. (973), Care and Handling of Univariate or Multivariate Outliers in Detecting Spuriosity - A Bayesian Approach, Technometrics, 5, 4, 73-738. Guttman, I., Dutter, R. and Freeman,. R. (978), Care and Handling of Univariate Outliers in the General Linear Model to Detect Spuriosity - A Bayesian Approach, Technometrics, 0,, 87-93. Guttman, I. and ena, D. (993), A Bayesian Look at Diagnostics in the Univariate Linear Model, Statistical Sinica, 3, 367-390. Johnson, W. and Geisser, S. (983), A redictive View of the Detection and Characterization of Inuential Observations in Regression Analysis, Journal of the American Statistical Association, 78, 37-44. ettit, L. I. and Smith, A. F. M. (985), Outliers and Inuential Observations in Linear Models, in Bayesian Statistics, Ed. Bernardo, J. M., DeGroot, M. H., Lindley, D. V. and Smith, A. F. M., 473-494, Amsterdam: North Holland. Sharples, L. D. (990) Identication and Accommodation of Outliers in General Hierarchical Models, Biometrika, 77, 3, 445-453. Tanner, M. and Wong, W. (987), The Calculation of osterior Distributions by Data Augmentation (with discussion), Journal of the American Statistical Association, 8, 58-550.