DETECTION OF OUTLIER PATCHES IN AUTOREGRESSIVE TIME SERIES Ana Justel? 2 Daniel Pe~na?? and Ruey S. Tsay???? Department ofmathematics, Universidad Autonoma de Madrid, ana.justel@uam.es?? Department of Statistics and Econometrics, Universidad Carlos III de Madrid, dpena@est-econ.uc3m.es??? Graduate School of Business, University of Chicago, rst@gsbrst.uchicago.edu Abstract This paper proposes a procedure to detect patches of outliers in an autoregressive process. The procedure is an improvement over the existing detection methods via Gibbs sampling. We show that the standard outlier detection via Gibbs sampling may be extremely inecient in the presence of severe masking and swamping eects. The new procedure identies the beginning and end of possible outlier patches using the existing Gibbs sampling, then carries out an adaptive procedure with block interpolation to handle patches of outliers. Empirical and simulated examples show that the proposed procedure is eective. Key words: Gibbs sampler, multiple outliers, sequential learning, time series. Short running title: Outlier patches in autoregressive processes. 2 Correspondence to: Ana Justel, Departamento de Matematicas, Universidad Autonoma de Madrid, 2849 Madrid, Spain. Phone:+34 9 397 537. Fax:+34 9 397 4889.
. INTRODUCTION Outliers in a time series can have adverse eects on model identication and parameter estimation. Fox (972) dened additive and innovative outliers in a univariate time series. Let fx t g be an autoregressive process of order p, AR(p), satisfying x t = + x t; + + p x t;p + a t (.) where fa t g is a sequence of independent and identically distributed Gaussian variables with mean zero and variance 2 a and the polynomial (B) = ; B ;; p B p has no zeros inside the unit circle. An observed time series y t has an additive outlier (AO) at time T of size if it satises y t = I (T ) t + x t, t = ::: n, where I (T ) t is an indicator variable such that I (T ) t =ift 6= T, and I (T ) t =ift = T. The series has an innovative outlier (IO) at time T if the outlier directly aects the noise process, that is, y t = (B) ; I (T ) t +x t, t = ::: n. Chang and Tiao (983) show that additive outliers can cause serious bias in parameter estimation whereas innovative outliers only have minor eects in estimation. We deal with additive outliers that occur in patches in an AR process. The main motivation for our study is that multiple outliers, especially those which occur closely in time, often have severe masking eects that can render the usual outlier detection methods ineective. Several procedures are available in the literature to handle outliers in a time series. Chang and Tiao (983), Chang, Tiao and Chen (988) and Tsay (986, 988) proposed an iterative procedure to detect four types of disturbance in an autoregressive integrated moving-average (ARIMA) model. However, these procedures may fail to detect multiple outliers due to masking eects. They can also misspecify \good" data points as outliers, resulting in what is commonly referred to as the swamping or smearing eect. Chen and Liu (993) proposed a modied iterative procedure to reduce masking eects by jointly estimating the model parameters and the magnitudes of outlier eects. This procedure may also fail since it starts with parameter estimation that assumes no outliers in the data, see Sanchez and Pe~na (997). Pe~na (987, 99) proposed diagnostic statistics to measure the inuence of an observation. Similar to the
case of independent data, inuence measures based on data deletion (or equivalently, using techniques of missing value in time series analysis) will encounter diculties due to masking eects. A special case of multiple outliers is a patch of additive outliers. This type of outliers can appear in a time series for various reasons. First and perhaps most importantly, as shown by Tsay, Pe~na and Pankratz (998), a multivariate innovative outlier in a vector time series can introduce a patch of additive outliers in univariate marginal times series. Second, an unusual shock may temporarily aect the mean and variance of a univariate time series in a manner that cannot be adequately described by the four types of outlier commonly used in the literature, or by conditional heteroscedastic models. The eect is a patch of additive outliers. Because outliers within a patch tend to interact with each other, introducing masking or smearing, is important in applications to detect them. Bruce and Martin (989) were the rst to analyze patches of outliers in a time series. They proposed a procedure to identify outlying patches by deleting blocks of consecutive observations. However ecient procedures to determine the block sizes and to carry out the necessary computation have not been developed. McCulloch and Tsay (994) and Barnett, Kohn and Sheather (996, 997) used Markov Chain Monte Carlo (MCMC) methods to detect outliers and compute the posterior distribution of the parameters in an ARIMA model. In particular, McCulloch and Tsay (994) showed that the Gibbs sampling provides accurate parameter estimation and eective outlier detection for an AR process when the additive outliers are not in patches. However, as clearly shown by the following example, the usual Gibbs sampling may bevery inecient whenthe outliers occur inapatch. Consider the outlier-contaminated time series shown in Figure. The outlier-free data consist of a random realization of n = 5 observations generated from the AR(3) model, x t =2:x t; ; :46x t;2 +:336x t;3 + a t t = ::: 5 (.2) where a 2 =. The roots of the autoregressive polynomial are.6,.7 and.8, so that 2
the series is stationary. A single additive outlier of size ;3 has been added to the time index t = 27, and a patch of four consecutive additive outliers have been introduced from t = 38 to t = 4, with sizes ( 9 ). The data are available to author request. Assuming that the AR order p = 3is known, we performed the usual Gibbs sampling to estimate model parameters and to detect outliers. Figure gives some summary statistics of the Gibbs sampling output using the last, samples from a Gibbs sampler of 26, iterations (when the Markov chains are stabilized as shown in Figure -c). Figure -a provides the posterior probabilities of being an outlier for each data point, and Figure -b gives the posterior means of outlier sizes. From the plots, it is clear that the usual Gibbs sampler easily detects the isolated outlier at t = 27, with posterior probability close to one and posterior mean of outlier size ;3:3. Meanwhile, the usual Gibbs sampler encounters several diculties. First, it fails to detect the inner points of (the outlying patch as outliers the outlying posterior probabilities are very lowatt = 39 and 4). This phenomenon is referred to as masking. Second, the sampler misspecies the \good" data points at t = 37and 42 as outliers because the outlying posterior probabilities of these two points are close to unity. The posterior means of the sizes of these two erroneous outliers are ;3:42 and ;2:9, respectively. In short, two \good" data points at t = 37 and 42 are swamped by the patch of outliers. Third, the sampler correctly identies the boundary points of the outlier patch at t = 38 and 4 as outliers, but substantially underestimates their sizes (the posterior means of the outlier sizes are only 3:3 and 4:5, respectively). The example clearly illustrates the masking and swamping problems encountered by the usual Gibbs sampler when additive outliers exist in a patch. The objective of this paper is to propose a new procedure to overcome these diculties. Limited experience shows that the proposed approach is eective. The paper is organized as follows. Section 2 reviews the application of the standard Gibbs sampler to outlier identication in an AR time series. Section 3 proposes a new adaptive Gibbs algorithm to detect outlier patches. The conditional posterior distributions of blocks of observations are obtained and used to expedite the convergence 3
2 AR(3) artificial time series outlier posterior probabilities 2.8.6.4.2 (a) 2 3 4 5 t 2 3 4 5 t (b) 5 4 4 σ 2 2 outlier sizes 3 2 2 3 4 2 3 4 5 t 2 φ 3 φ 2 2 φ 2 2 4 2 φ 3 (c).5.5 2 2.5 x 4 iteration Figure : Top: AR(3) articial time series with ve outliers at periods t = 27 and 38{4 (marked with a dot). Bottom: results of the Gibbs sampler after 26, iterations (a) posterior probabilities for each data point to be outlier (b) posterior mean estimates of the outlier sizes for each data and (c) convergence monitoring by plotting the parameter values drawn in each iteration. of Gibbs sampling. Section 4 illustrates the performance of the proposed procedure in two examples. 2. OUTLIER DETECTION IN AN AR PROCESS 2.. AR model with additive outliers Suppose the observed data, y = (y ::: y n ) are generated by y t = t t + x t, t = ::: n, where x t is given by (.) and =( ::: n ) is a binary random vector of outlier indicators that is, t = if the t-th observation is contaminated by an addtive outlier of size t,and t = otherwise. For simplicity, assume that x ::: x p are xed and x t = y t for t = ::: p, i.e. there exist no outliers in the rst p observations. The indicator vector of outliers then becomes = ( p+ ::: n ) and the size vector is = ( p+ ::: n ). Let t = ( x t; ::: x t;p ) and = ( ::: p ). The 4
observed series can be expressed as amultiple linear regression model given by y t = t t + t + a t t = p + ::: n: (2.3) We assume that the outlier indicator t and the outlier magnitude t are independent and distributed as Bernoulli() and N ( 2 ) respectively for all t, where and 2 are hyperparameters. Therefore, the prior probability of being contaminated by an outlier is the same for all observations, namely P ( t =)=, for t = p + ::: n:the prior distribution for the contamination parameter is Beta( 2 ), with expectation E() = =( + 2 ). Abraham and Box (979) obtained the posterior distributions for the parameters in model (2.3), under Jerey's reference prior distribution for ( 2 ). In this case the conditional posterior distribution of, given any outlier conguration r, is a multivariate-t distribution, where r is one of the 2 n;p possible outlier congurations. Then the posterior distribution of is a mixture of 2 n;p multivariate-t distributions: P ( j y) = w r P ( j y r ) (2.4) where the summation is over all 2 n;p possible outlier congurations and the weight r, i.e. w r = P ( r j y). For such a model, we can identify the outliers using the posterior marginals of elements of, p t = P ( t =jy) = P ( rt j y) t = p + ::: n (2.5) where the summation is now over the 2 n;p; outlier congurations rt with t = (the posterior probabilities P ( rt j y) are easy to compute). The posterior distributions of the outlier magnitudes are mixtures of Student-t distributions: P ( t j y) = w r P ( t j y r ) t = p + ::: n: (2.6) Therefore, it is possible to derive the posterior probabilities for the parameters in model (2.3). In practice, however, the computation is very intensive, even when the sample size is small. Since these probabilities are mixtures of 2 n;p or 2 n;p; distributions. The approach becomes infeasible when the sample size is moderate or large and some alternative approach must is needed. One alternative isto use MCMC methods. 5
2.2. The standard Gibbs sampling and its diculties McCulloch and Tsay (994) proposed to compute the posterior distributions (2.4) ; (2.6) by Gibbs sampling. The procedure requires full conditional posterior distributions of each parameter in model (2.3) given all the other parameters. Barnett, Kohn and Sheather (996) generalized the model to include innovative outliers and order selection. They used MCMC methods with Metropolis-Hasting and Gibbs Sampling algorithms. For ease of reference, we summarize conditional posterior distributions, obtained rst by McCulloch and Tsay (994), with conjugate prior distributions for and 2 a. We use non-informative priors for these parameters. Note that, as the priors for and are proper, even if we assume improper priors for ( 2 a) the joint posterior is proper. If the hyperparameters, 2 and 2 are known, the conditional posterior distribution of the AR parameter vector is multivariate normal N p+ (? 2 a ), where the mean vector and the covariance matrix are? = n t=p+ t x t and = n t=p+ t t The conditional posterior distribution of the innovational variance 2 a is Inverted ; Gamma((n ; p)=2 ( P n t=p+ a2 t )=2). The conditional posterior distribution of depends only on the vector, it is Beta( + P t 2 +(n ; p) ; P t )). The conditional posterior mean of can then be expressed as a linear combination of the prior mean and the sample mean of the! ; data: E( j ) =!E()+(;!) where! =( + 2 )=( + 2 + n ; p). The conditional posterior distribution of j, j = p + ::: n, is Bernoulli with probability P ( j =j y 2 a (j) )= + : ; ( ; ) B (j) (2.7) where (j) is obtained from by eliminating the element j and B is the Bayes factor. 6
The logarithm of the Bayes factor B is where T j log B (j) = 2 2 a 2 4 T j t=j e t () 2 ; T j t=j e t () 2 3 5 (2.8) = min(n j + p), e t ( j ) = ~x t ; ; P p i= i~x t;i, and ~x t = y t ; t t is the residual at time t when the series is corrected by the identied outliers in (j) and j. It is easy to see that e t () = e t () + t;j j, where = ; and j = j for j = ::: p. The probability (2.7) has a simple interpretation. The two hypotheses j =(y j is contaminated by an outlier) and j =(y j is outlier free), given the data, only aect the residuals e j ::: e Tj. Assuming parameters are known, we can judge the likelihoods of these hypotheses by (a) computing the residuals e t () for t = j ::: T j (b) computing the residuals e t () and (c) comparing the two sets of residuals. The Bayes factor is the usual way of comparing the likelihoods of the two hypotheses. Since the residuals are one-step-ahead prediction errors, (2.8) compares the sum of prediction errors in the periods j j+ ::: T j when the forecasts are evaluated under the hypothesis j = and under the hypothesis j =. This is equivalent to the Chow test (96) for structural changes when the variance is known. The conditional posterior distributions of the outlier magnitudes j, for j = p + ::: n are N ( j? j 2 j ), where? j = 2 j 2 a e j () ; e j+ () ;; Tj ;je Tj () and 2 j = 2 2 a =( 2 2 T j ;j j + 2 a), with 2 T j ;j =(+ 2 + + 2 T j ;j). When y j (2.9) is identied as an outlier and there is no prior information about the outlier magnitude ( 2! ), the conditional posterior mean of j tends to ^ j = ;2 T j ;j [e j() ; e j+ () ; ::: Tj ;je Tj ()], which is the least squares estimate when the parameters are known (Chang, Tiao and Chen (988)) the conditional posterior variance of j tends to the variance of the estimate ^ j. The conditional posterior mean in (2.9) can also be seen as a linear combination of the prior mean and the outlier magnitude estimated from the data. The magnitude estimate is the dierence between 7
the observation y j and the conditional expectation of y j given all the data, ^y jjn, which is the linear predictor of y j that minimizes the mean squared error. Then (2.9) can be expressed as? j = 2 T 2 j ;j 2 T 2 + (y j ; ^y jjn )+ j ;j 2 a 2 a 2 2 T j ;j + 2 a (2.) where is the prior mean of j (zero in this paper). For the AR(p) model under study, the optimal linear predictor ^y jjn is a combination of the past and future p values of y j ^y jjn = ;2 T j ;j ~ T j ;j ; ;2 T j ;j @ p i= T j ;j;i t= t t+i x j;i + T j ;j i= T j ;j;i t= t t+i x j+i A (2.) where ~ t =; ;; t for t p. Using the truncated autoregressive polynomial Tj ;j(b) =(; B ;; Tj ;jb T j ;j ) and the \truncated" variance, 2 T j ;j =(+ 2 + + 2 T j ;j) of the dual process x D t = ~ p + a t ; a t; ;; p a t;p (2.2) the lter (2.) can be written as a function of the \truncated" autocorrelation generating function D T j ;j(b) = ;2 T j ;j p(b) Tj ;j(b ; ) of the dual process: ^y jjn = ;2 T j ;j ~ T j ;j; [ ; D T j ;j(b)]x j. We may use the above results to draw a sample of a Markov chain using Gibbs sampling. This Markov chain converges to the joint posterior distribution of the parameters. When the number of iterations is suciently large, the Gibbs draw can be regarded as a sample from the joint posterior distribution. These draws are easy to obtain because they are from well-known proability distributions. However, as shown by the simple example in.2, such a Gibbs sampling procedure may fare poorly when additive outliers appear in patches. To understand the situation more fully, consider the simplest situation in which the time series follows an AR() model with mean zero and there exists a patch of three additive outliers at time T ;, T, T +. To simplify the analysis, we rst assume that the three outliers have the same size, t = for t = T ; T T +,andthe AR 8
parameter is known. Checking if an observation is contaminated by an additive outlier amounts to comparing the observed value with its optimal interpolator derived using the model and the rest of the data. For an AR() process the optimal interpolator (2.) is ^y tjn = ( + 2 ) ; (y t; + y t+ ). The rst outlier in the patch is tested by comparing y T ; = x T ; + T ; with ^y T ;jn = ^x T ;jn + ( + 2 ) ; T. It is detected if T is suciently large, but its size will be underestimated. For instance, if is close to unity the estimated size will only be half the true size. For the second outlier, we compare y T = x T + T with ^y T jn = ^x T jn + (+ 2 ) ; ( T ; + T + )=^x T jn +(+ 2 ) ; (2 T ). If is close to one the outlier cannot be easily detected, because the dierence between the optimal interpolator and the observed value is x T ; ^x T jn +(; ) 2 ( + 2 ) ;, small when is close to unity. Note that in practice the masking eect is likely to remain even if we correctly identify an outlier at T ; and adjust y T ; accordingly, because, as mentioned above, the outlier size at T ; is underestimated. In short, if is close to unity the middle outlier at time T is very hard to detect. The detection of the outlier at time T +is also dicult and its size is underestimated. Next consider the case that the AR parameter is estimated from the sample. Without outliers the least squares estimate of the AR parameter is ^ = P x t x t; ( P x 2 t ) ; = r x (), the lag- sample autocorrelation. Suppose the series is contaminated by a patch of 2k +additive outliers at times T ; k ::: T ::: T + k, of sizes t = o t s x, P where s 2 x = x 2 t =n is the sample variance of the outlier-free series. In this case, the least squares estimate of the AR coecient based on the observed time series y t given, dropping terms of order o(n ; ), by P P ^ y = r x() + n ; k ;k o T P +j(~x T +j; +~x T +j+ )+n P ; k ;k o T +j o T +j; +2n ; k ;k o T +j ~x T +j + n ; k ;k (o T +j )2 where ~x t = x t =s x. If the outliers are large, with the same sign and similar sizes, then for a xed sample size n, ^y makes the identication of outliers dicult. will approach unity as the outlier sizes increase. This Note that characteristics of the outlier patch will appear clearly if the length of the patch is known and one interpolates the whole patch using observations before and 9 is
after the patch. This is the main idea of the adaptive Gibbs Sampling proposed in the next section. For the standard outlier detection procedure using Gibbs sampling, proper \block" interpolation can only occur if one uses ideal initial values that identify each point ofthe outlier patch as an outlier. Figure 2 shows the evolution of the masking problem when the Gibbs sampling iteration starts for the simulated data in (.2). At each iteration, the Gibbs sampler checks if an observation is an outlier by comparing the observed value with its optimal interpolator using (2.8) and (2.). The outlier sizes for 37 to 42 generated in each iteration are represented in Figure 2 as continuous sequences grey shadows represent iterations where the data are identied as outliers. The outliers at t = 38 and 4 are identied and corrected in few iterations, but their sizes are underestimated. The central outliers, t = 39 and 4, are identied in very few iterations and their sizes on these occasions are small. We can see that four outliers are never identied in the same iteration. If we use ideal initial values that assign each point of the outlier patch as an outlier, the sizes obtained at an iteration are the dots in Figure 2 (horizontal lines represent true outlier sizes). At the right side of the graphs, the estimation of the posterior distributions are represented for standard Gibbs sampling (the ll curve) and for the Gibbs sampler with \block" interpolation. One can regard these diculties as a practical convergence problem for the Gibbs sampler. This problem also appears in the regression case with multiple outliers, as shown by Justel and Pe~na (996). High parameter correlations and the large dimension of the parameter space slow down the convergence of the Gibbs sampler (see Hills and Smith (992)). Note that the dimension of the parameter space is 2n + p +3 and, for outliers in a patch, correlations are large among the outlier positions and among the outlier magnitudes. When Gibbs draws are from a joint distribution of highly correlated parameters, movements from one iteration to the next are in the principal components direction of the parameter space instead of parallel to the coordinate axes.
β 37 β 38 β 39 β 4 β 4 β 42 2 3 4 iteration Figure 2: Articial time series: outlier sizes from 37 to 42 generated in each iteration. Continuous sequences for the standard Gibbs sampler, and dots for the Gibbs sampler with \block" interpolation. The grey shadows are used to show the iterations where the data are identied as outliers. The horizontal lines represent the true values. 3. DETECTING OUTLIER PATCHES Our new procedure consists of two Gibbs runs. In the rst run, the standard Gibbs sampling of Section 2 is applied to the data. The results of this Gibbs run are then used to implement a second Gibbs sampling that is adaptive in treating identied outliers and in using block interpolation to reduce possible masking and swamping eects. In what follows, we divide the detail of the proposed procedure into subsections. The discussion focuses on the second Gibbs run and assumes that results of the rst Gibbs run are available. For ease in reference, let ^ (s), ^ (s) a, ^p (s), and ^ (s) be the posterior means based on the last r iterations of the rst Gibbs run which uses s
iterations, where the j-th element of^p (s) is ^p (s) p+j, the posterior probability thaty p+j is contaminated by an outlier. 3.. Location and joint estimation of outlier patches The biases in ^ (s) induced by the masking eects of multiple outliers come from several sources. Two main sources are (a) drawing values of j one by one and (b) the misspecication of the prior mean of j, xed to zero. One-by-one drawing overlooks the dependence between parameters. For an AR(p) process, an additive outlier aects p + residuals and the usual interpolation (or ltering) involves p observations before and after the time index of interest. Consequently, an additive outlier aects the conditional posterior distributions of 2p +observations see (2.8) and (2.9). Chen and Liu (993) pointed out that estimates of outlier magnitudes computed separately can dier markedly from those obtained from a joint estimation. The situation becomes more serious in the presence of k consecutive additive outliers for which the outliers aect 2p + k observations. We make use of the results of the rst Gibbs sampler to identify possible locations and block sizes of outlier patches. The tentative specication of locations and block sizes of outlier patches is done by a forward-backward search using a window around the outliers identied by the rst Gibbs run. Let c be a critical value between and used to identify potential outliers. An observation whose posterior probability of being an outlier exceeds c is classied as an \identied" outlier. More specically, y j is identied as an outlier if ^p (s) j > c. Typically we use c =:5. Let ft ::: t m g be the collection of time indexes of outliers identied by the rst Gibbs run. Consider patch size. We select another critical value c 2, c 2 c to specify the beginning and end points of a \potential" outlier patch associated with an identied outlier. In addition, because the length of an outlier patch cannot be too large relatively to the sample size, we select a window of length 2hp to search for the boundary points of a possible outlier patch. For example, consider an \identied" outlier y ti. First, we 2
check the hp observations before y ti and compare their posterior probabilities ^p (s) j with c 2. Any point within the window with ^p (s) j > c 2 is regarded as a possible beginning point of an outlier patch associated with y ti. We then select the farthest point from y ti as the beginning point of the outlier patch. Denote the point by y ti ;k i. Second, do the same for the hp observations after y ti and select the farthest point from y ti with ^p (s) j >c 2 as the end point of the outlier patch. Denote the end pointby y ti +v i. Combine the two blocks to form a tentative candidate for an outlier patch associated with y ti, denoted by (y ti ;k i ::: y ti +v i ). Finally, consider jointly all the identied outlier patches for further renement. Overlapping or consecutive patches should be merged to form a larger patch if the total number of outliers is greater than n=2, where n is the sample size, increase c 2 and re-specify possible outlier patches if increasing c 2 cannot suciently reduce the total number of outliers, choose a smaller h and re-specify outlier patches. With outlier patches tentatively specied, draw Gibbs samples jointly within a patch. Suppose that a patch of k outliers starting at time index j is identied. Let j k =( j ::: j+k; ) and j k =( j ::: j+k; ) be the vectors of outlier indicators and magnitudes, respectively, for the patch. To complete the sampling scheme we need the conditional posterior distributions of j k and j k, given the others. We give these distributions in the next theorem, derivations are in the Appendix. Theorem. Let y =(y ::: y n ) be a vector of observations according to (2.3), with no outliers in the rst p data points. Assume t Bernoulli(), t = p + ::: n, and P ( 2 a ) / ;2 a ; ( ; ) 2; exp ; 2 2 n t=p+ where the parameters, 2 and 2 are known. Let e t ( j k )=x t ; ; P p i= ix t;i be the residual at time t when the series is adjusted for all identied outliers not in the interval [j j + k ;] with the outliers identied in j k, and T j k =minfn j + k + p;g. Then the following hold. a) The conditional posterior probability of a block conguration j k, given the sample 3 2 t!
and the other parameters, is where s j k p j k = C s j k ( ; ) k;s j k exp @ ; 2 2 a T j k t=j e t ( j k ) 2 A (3.3) P j+k; = t=j t, and C is a normalization constant so that the total probability of the 2 k possible congurations of j k is one. b) The conditional posterior distribution of j k given the sample and other parameters is N k ;? j k j k,? j k = j k @ ; j k = 2 a @ Dj k @ a 2 T j k t=j T j k t=j e t ()D j k t;j + t;j t;j A 2 (3.4) A Dj k + I 2 ; A (3.5) where D j k is a k k diagonal matrix with elements j ::: j+k;, and t = ( t t; ::: t;k+ ) is a k vector, with = ;, i = i for i = ::: p, and i =for i< or i>p. After computing the probabilities (3.3) for all 2 k possible congurations for the block j k, the outlying status of each observation in the outlier patch will be classied separately. Another possibility, suggested by a referee, is to engineer some Metropolis- Hasting moves by using Theorem. An advantage of our procedure is that we can generate large block congurations from (3.3) without computing C, but an optimal criterion (in the sense of acceptance rate and moving) should be found. This possibility will be explored in future work. P Let W = a ;2 Tj k j k D j k t=j t;j t;jd j k and W 2 = ;2 j k. Then? j k can be written as? j k = W ~ j k +W 2, where W +W 2 = I, implying that the mean of the conditional posterior distribution of j k is a linear combination of the prior mean vector and the least squares estimate (or the maximum likelihood estimate) of the 4
outlier magnitudes for an outlier patch ~ j k = T j k @ Dj k t=j ; T j k t;j t;jd j k A @ ; t=j e t ()D j k j k A : (3.6) Pe~na and Maravall (99) proved that, when t =, the estimate in (3.6) is equivalent to the vector of dierences between the observations (y j ::: y j+k; ) and the predictions ^y t = E(y t j y ::: y j; y j+k ::: y n ) for t = j ::: j + k ;. The matrix = P T j k t=j t;j t;j is the k k submatrix of the \truncated" autocovariance generating matrix of the dual process in (2.2). Specically, where D i j coecient of B i = B @ = 2 j D i j, 2 j T 2 j k ;j D T j k ;j; ::: D k; T j k ;j;k+ D ; T j k ;j T 2 j k ;j; ::: D k;2 T j k ;j;k+...... D ;k+ T j k ;j D ;k+2 T j k ; ::: 2 T j k ;j;k+ C A is the \truncated" variance of the dual process and D i j is the in the \truncated" autocorrelation generating function of the dual process, i.e. D j (B) = ;2 j p (B) j (B ; ). 3.2. The Second Gibbs Sampling We discuss the second adaptive Gibbs run of the proposed procedure. The results of the rst Gibbs run provide useful information to start the second Gibbs sampling and to specify prior distributions of the parameters. The starting values of t are as follows: () t =if^p (s) t > :5, i.e., if y t belongs to an identied outlier patch otherwise, () t =. The prior distributions of t are as follows. (s) a) If y t is identied as an isolated outlier the prior distribution of t is N ( ^ t 2 ), where ^ (s) t is the Gibbs estimate of t from the rst Gibbs run. b) If y t belongs to an outlier patch the prior distribution of t is N ( ~ (s) t 2 ), where ~ (s) t is the conditional posterior mean given in (3.6). 5
c) If y t does not belong to any outlier patch, and is not an isolated outlier, then the prior distribution of t is N ( 2 ). For each outlier patch, the results of Theorem are used to draw j k and j k in the second Gibbs sampling. The second Gibbs sampling is also run for s iterations, but only the results of the last r iterations are used to make inference. The number s can be determined by any sequential method proposed in the literature to monitor the convergence of Gibbs sampling. We use a method that can be easily implemented, based on comparing the estimates of outlying probability for each data point, computed with non-overlapping segments of samples. Specically, after a burn-in period of b =5 iterations, we assume convergence has been achieved if the standard test for the equality of two proportions is not rejected. Thus, calling ^p (s) t the probability that y t is an outlier computed with the iterations from s ; r + to s, we assume convergence if for all t = p + ::: n the dierences ^p (s) t ; ^p (s;r) t p are smaller than =3 :5 2 =s r. An alternative procedure for handling outlier patches is to use the ideas of Bruce and Martin (989). This procedure involves two steps. First, select a positive integer k in the interval [ n=2] as the maximum length of a outlier patch. Second, start the Gibbs sampler with n;k ;p parallel trials. In the jth trial, for j = ::: n;k ;p, the points at t = p+j to p+k+j are assigned initially as outliers. For other data points, use the usual method to assign initial values. In application, one can use several dierent k values. However, such a procedure requires intensive computation, especially when n is large. 4. APPLICATIONS Here we re-analyze the simulated time series of Section andthen consider some real data. We compare the results of the usual Gibbs sampling, referred to as standard Gibbs sampling, with those of the adaptive Gibbs sampling to see the ecacy of the latter algorithm. The example demonstrates the applicability and eectiveness of the adaptive Gibbs sampling, and it shows that patches of outliers occur in applications. 6
Parameter 27 37 38 39 4 4 42 True Value -3 9 Standard GS -3.3-3.42 3.3.5 -.5 4.5-2.9 Adaptive GS -3.6 -.9.97.63.43.9 -.23 Table : Outlier magnitudes: true values and estimates obtained by the standard and adaptive Gibbs samplings. 4.. Simulated Data Revisited As shown in Figure, standard Gibbs sampling can easily detect the isolated outlier at t =27of the simulated AR(3) example, but it has diculty with the outlier patch in the period 38 ; 4. For the adaptive Gibbs sampling, we choose hyperparameters = 5, 2 = 95 and = 3, implying that the contamination parameter has a prior mean = :5, and the prior standard deviation of t is three times the residual standard deviation. Using =:47 to monitor convergence, we obtained s =26 iterations for the rst Gibbs sampling and s =7 iterations for the second, adaptive Gibbs sampling. All parameter estimates reported are the sample means of the last r = iterations. For specifying the location of an outlier patch, we chose c =:5, c 2 = :3, and the window length 2p to search for boundary points of possible outlier patches, where p = 3 is the autoregressive order of the series. Additional checking conrms that results are stable over minor modications of these parameter values. The results of the rst run are shown in Figure and summarized in Table. As before, the procedure indicates a possible patch of outliers from t = 37 to 42. the second run the initial conditions and the prior distributions are specied by the proposed adaptive procedure. The posterior probability of outlier for each data point, ^p (s) t, is shown in Figure 3-a, and the posterior mean of the outlier sizes is shown in Figure 3-b. Adaptive Gibbs sampling successfully species all outliers, and there are no swamping or masking eects. In Figure 3-c we compare the posterior distributions of the adaptive (shadow area) and the standard (dotted curve) Gibbs sampling for the error variance and the autoregressive parameters in Table we compare some of the In 7
outlier posterior probabilities.8.6.4.2 (a) 2 3 4 5 t outlier sizes 5 5 (b) 5 2 3 4 5 t (c) σ 2.5.5 2 2.5 3 3.5 4 φ.5.5 φ 2.5.5 2 φ 2 2.5.5 2.5 φ 3.5.5 Figure 3: Adaptive Gibbs sampling results with 7, iterations for the articial time series with ve outliers: (a) posterior probabilities for each data point to be outlier, (b) posterior mean estimates of the outlier sizes for each data, and (c) kernel estimates of the posterior marginal parameter distributions the dotted lines are estimates from the rst run and vertical lines mark true values. outlier sizes with the true values. One clearly sees the ecacy and added value of the adaptive Gibbs sampling in this way. Figure 4 shows the scatterplots of Gibbs draws of outlier sizes for t = 37 ::: 42. The right panel is for adaptive Gibbs sampling whereas the left panel is for the standard Gibbs sampling. The plots on the diagonals are histograms of outlier sizes. Adaptive Gibbs sampling exhibits high correlations between sizes of consecutive outliers, in agreement with outlier sizes used. On the other hand, scatterplots of the standard Gibbs sampling do not show adequately correlations between outlier sizes. Finally, we also ran a more extensive simulation study where the locations for the isolated outlier and the patch were randomly selected. Using the same x t sequence, we ran standard and adaptive Gibbs sampling for 2 cases with the following results.. The isolated outlier was identied by standard and adaptive Gibbs sampling procedures in all cases. 2. The standard Gibbs sampler failed to detect all the outliers in each of the 2 simulations. A typical outcome, as pointed out in Section 3, had the extreme points of the outlier patch and their neighboring points identied as outliers observations in the middle of the outlier patch were subjected to masking eects. Occasionally, the standard Gibbs sampler did correctly identify three of the four 8
β 37 β 38 β 39 β 4 β 4 β 42 5 5 5 β 37 β 38 β 39 β 4 5 β 4 β 42 5 5 2 5 5 5 5 β 37 5 5 2 5 5 5 5 β 38 β 39 β 4 β 4 β 42 Figure 4: Scatterplots of the standard (left) and adaptive (right) Gibbs sampler output for 37 to 42 and the histograms of each magnitude in the diagonal. outliers in the patch. Figure 5-left shows a bar plot for the relative frequencies of outlier detection for each data point in the 2 runs. We summarize the relative frequency of identication for each outlier in the patch (bars st ; 4th) and their two neighboring \good" observations. 3. In all simulations, the proposed procedure indicates a possible patch of outliers that includes the true outlier patch. 4. The adaptive Gibbs sampler failed in 3 cases, corresponding to randomly selected patches in the same region of the series. Figure 5-right shows the positive performance of the adaptive algorithm in detecting all outliers in the patch. 4.2. A Real Example Consider the data of monthly U.S. Industry-unlled orders for radio and TV, in millions of dollars, as studied by Bruce and Martin (989) among others. We use the logged series from January 958 to October 98, and focus on the seasonally adjusted series, 9
Standard GS Adaptive GS.8.8.6.6.4.4.2.2 st 2 nd 3 th 4 th outlier patch st 2 nd 3 th 4 th outlier patch Figure 5: Bar plots for the relative frequencies, in 2 simulations, of outlier detection of each data in the patch (bars st; 4th), previous and next \good" observations. Left: standard Gibbs sampling. Right: adaptive Gibbs sampling. where the seasonal componentwas removed by the well-known -ARIMA procedure. The seasonally adjusted series is shown in Figure 6, and can be download from the same web site as the simulated data. An AR(3) model is t to the data and the estimated parameter values are given in the rst row oftable 2. The residual plot of the t, also shown in Figure 6, indicates possible isolated outliers and outlier patches, especially in the latter part of the series. The hyperparameters needed to run the adaptive Gibbs algorithm are set by the same criteria as those of the simulated example: = 5, 2 = 95, = 3 a = :273. In this particular instance, r = and the stopping criterion =:47 is achieved by 96 iterations in the rst Gibbs run and by iterations in the second. As before, to specify possible outlier patches prior to running the adaptive Gibbs sampling, the window width is set to twice the AR order, c = :5 and c 2 = :3. In addition, we assume that the maximum length of an outlier patch is months, just below one year. Using.5 as the cut-o posterior probability to identify outliers, we summarize the results of standard and adaptive Gibbs sampling in Table 3. The standard Gibbs algorithm identies 3 data points as outliers. Nine isolated outliers and two outlier patches both of length 2. The two outlier patches are 6-7/968 and 2-3/978. The outlier posterior probability for each data point is shown in Figure 7. On the other 2
6.5 log of TV & radio orders 6 5.5.5 residuals.5 958 96 962 964 966 968 97 972 974 976 978 98 Figure 6: Top: Seasonally adjusted series of the logarithm of U.S. Industry-unlled orders for radio and TV. Bottom: Residual plot for the AR(3) model tted to the series. hand, the second, adaptive Gibbs sampling species data points as outliers six isolated outliers, and two outlier patches of length 2 and 3 at 2-3/978 and 3-5/979, respectively. The outlier posterior probabilities based on adaptive Gibbs sampling are also presented in Figure 7. Finally, Table 4 presents outlier posterior probabilities for each data point in the detected outlier patches, for both the standard and adaptive Gibbs samplings. For the possible outlier patch from January to July 968, the two algorithms show dierent results: standard Gibbs sampling identies the patch 6-7/968 as outliers adaptive Gibbs sampling detects an isolated outlier at 5/68. For the possible outlier patch from September 976 to March 977, the standard algorithm detects an isolated outlier at /77, while the adaptive algorithm does not detect any outlier within the patch. For the possible outlier patch from March to September 979, the standard algorithm identies two isolated outliers in April and September. On the other hand, the adaptive algorithm substantially increases the outlying posterior probabilities for March and April of 979 and, hence, changes an isolated outlier into a patch of three outliers. The isolated outlier in September remains unchanged. Based on the estimated outlier sizes in Table 3, the standard Gibbs algorithm seems to encounter 2
Parameter 2 3 a Initial model.28.6.9.5.9 Standard GS.8.83.9 -.5.62 Adaptive GS.8.78.23 -.4.62 Table 2: Estimated parameter values with the initial model and those obtained by the standard and the adaptive Gibbs sampling algorithms. Date 5/68 6/68 7/68 /73 9/76 /77 3/77 8/77 Standard GS Outlier probability.3.74.66.54.99.53.99.99 Outlier size -.7.7.2 -.9 -.22 -.8 -.23.2 Adaptive GS Outlier probability.62.45.27.37.99.24.99. Outlier size -.5.4.3 -.6 -.2 -.3 -.2.2 Date 2/78 3/78 9/78 3/79 4/79 5/79 9/79 Standard GS Outlier probability....5.36.. Outlier size -.26 -.27.37.9.6 -.4.26 Adaptive GS Outlier probability....92.89.. Outlier size -.28 -.28.37.24.23 -.28.32 Table 3: Posterior probabilities for each data point tobeanoutlierby the standard and the adaptive Gibbs sampling algorithms. Estimated outlier sizes by the standard and the adaptive Gibbs sampling algorithms. severe masking eects for March and April of 979. Figure 8 shows the time plot of posterior means of residuals obtained by adaptive Gibbs sampling and should be compared with the residuals we obtained without outlier adjustment in Figure 6. The results of this example demonstrate that (a) outlier patches occur frequently in practice and (b) the standard Gibbs sampling to outlier detection may encounter severe masking and swamping eects while the adaptive Gibbs sampling performs reasonably well. 22
outlier posterior probabilities.8.6.4.2 Standard GS 958 96 962 964 966 968 97 972 974 976 978 98 outlier posterior probabilities.8.6.4.2 Adaptive GS 958 96 962 964 966 968 97 972 974 976 978 98 Figure 7: Posterior probability for each data point to be outlier. with the standard (top) and the adaptive (bottom) Gibbs sampling..2. Adaptive GS residuals..2 958 96 962 964 966 968 97 972 974 976 978 98 Figure 8: Mean residuals with the adaptive Gibbs sampler 23
Standard Gibbs Sampling Adaptive Gibbs Sampling.5.5.3.3 /68 3/68 5/68 7/68 /68 3/68 5/68 7/68 Date /68 2/68 3/68 4/68 5/68 6/68 7/68 Standard GS.35.24.7.293.38.738.66 Adaptive GS.3.5.4.489.62.447.269 Standard Gibbs Sampling Adaptive Gibbs Sampling.5.5.3.3 9/76 /76 /77 3/77 9/76 /76 /77 3/77 Date 9/76 /76 /76 2/76 /77 2/77 3/77 Standard GS.999.4.3.47.526.97.99 Adaptive GS.999..3.59.244.36.989 Standard Gibbs Sampling Adaptive Gibbs Sampling.5.5.3.3 3/79 5/79 7/79 9/79 3/79 5/79 7/79 9/79 Date 3/79 4/79 5/79 6/79 7/79 8/79 9/79 Standard GS.5.356..23.25.36. Adaptive GS.98.893..56..66. Table 4: Posterior outlier probabilities for each data point in the three larger possible outlier patches. ACKNOWLEDGEMENTS The research of Ana Justel was nanced by the European Commission in the Training and Mobility of Researchers Programme (TMR). Ana Justel and Daniel Pe~na acknowledge research support provided by DGES (Spain), under grants PB97{2 and PB96{, respectively. The work of Ruey S. Tsay is supported by the National Science Foundation, the Chiang Chien-Kuo Foundation, and the Graduate School of Business, University of Chicago. 24
APPENDI: PROOF OF THEOREM The conditional distribution of j k given the sample and the other parameters is P ( j k j y j k ) / f(y j j k j k ) s j k ( ; ) k;s j k : (4.) The likelihood function can be factorized as f(y j j k j k )=f(y j; p+ j j k ) f(y T j k j j y j; p+ j k j k ) f(y n T j k + j y T j k p+ j k ) where y k j = (y j ::: y k ). Only f(y T j k j j y j; p+ j k j k ) depends on j k and it is the product of the conditional densities: f(y j j y j; p+ j k j ) / exp f(y j+k; j y j+k;2 p+ j k j k ) / exp f(y j+k j y j+k; p+ j k j k ) / exp. ; (e j () ; j j ) 2 2a 2 ; 2a 2 ; 2a 2 (e j+k; () ; j+k; j+k; + + k; j j ) 2 (e j+k ()+ j+k; j+k; + + k j j ) 2 f(y Tj k j y T j k; p+ j k j k ) / exp. ; (e 2a 2 Tj k ()+ Tj k ;j;k+ j+k; j+k; + + + Tj k ;j j j ) 2 : Hence the likelihood function can be expressed as f(y j j k j k )/exp@ ; 2 2 a @ j+k; t=j and the residual e t ( j k )isgiven by e t ( j k )= 8 >< >: e t ()+ e t ()+ t;j i= t;j t;j (e t ()+ i= i t;i t;i ) 2 + T j k t=j+k (e t ()+ i t;i t;i if t = j ::: j+ k ; i=t;j;k+ i t;i t;i 25 if t>j+ k ; t;j i=t;j;k+ i t;i t;i ) AA 2 (4.2)
where = ;, i = i for i = ::: p and i = for i < and i > p. Therefore, (4.2) can be written as f(y j j k j k )/exp@ ; 2 2 a T j k t=j e t ( j k ) 2 A : Then by replacing in (4.) we obtain (3.3) for any conguration of the vector j k. 2 The conditional distribution of j k given the sample and the other parameters is P ( j k j y j k ) / f(y j j k j k ) P ( j k ): Using (4.2) f(y j j k j k ) / exp Therefore, P ( j k j y j k ) / exp @ ; 2 2 a @ ; 2 2 a exp ; / exp @ ; 2 ;2 / exp @ ; a 2 T j k t=j T j k t=j @ j k (e t ()+ t;jd j k j k ) (e t ()+ t;jd j k j k ) A : (e t ()+ t;jd j k j k ) (e t ()+ t;jd j k j k ) A @ 2 a 2 2 ( j k ; ) ( j k ; ) T j k t=j T j k t=j e t () t;jd j k + D j k t;j t;jd j k + I 2 2 ; 2 ( j k ;? j k) ; j k ( j k ;? j k) A j k AA A j k where j k and? j k are dened in (3.4) and (3.5) respectively. 2 26
REFERENCES Abraham, B. and Box, G.E.P. (979). Bayesian analysis of some outlier problems in time series. Biometrika 66, 229{236. Barnett, G., Kohn, R. and Sheather, S. (996). Bayesian estimation of an autoregressive model using Markov chain Monte Carlo. Journal of Econometrics 74, 237{254. Barnett, G., Kohn, R. and Sheather, S. (997). Robust Bayesian estimation of autoregressive-moving average models. Journal of Time Series Analysis 8, {28. Bruce, A.G. and Martin, D. (989). Leave-k-out diagnostics for time series (with discussion). Journal of the Royal Statistical Society, B 5, 363{424. Chang, I. and Tiao, G.C. (983). Estimation of time series parameters in the presence of outliers. Technical Report 8, Statistics Research Center, University of Chicago. Chang, I., Tiao, G.C. and Chen, C. (988). Estimation of time series parameters in the presence of outliers. Technometrics 3, 93{24. Chen, C. and Liu, L. (993). Joint estimation of model parameters and outlier eects in time series. Journal of the American Statistical Association 88, 284{297. Chow, G.C. (96). A test for equality between sets of observations in two linear regressions. Econometrica 28, 59{65. Fox, A.J. (972). Outliers in time series. Journal of the Royal Statistical Society, B 34, 35{363. Hills, S.E. and Smith, A.F.M. (992). Parameterization issues in Bayesian inference". In Bayesian Statistics, 4 (Eds. J.M. Bernardo, J. Berger, A.P. Dawid and A.F.M. Smith), 64{649. Oxford University Press. Justel, A. and Pe~na, D. (996). Gibbs sampling will fail in outlier problems with strong masking. Journal of Computational and Graphical Statistics 5, 76{89. McCulloch, R.E. and Tsay R.S. (994). Bayesian analysis of autoregressive time series via the Gibbs sampler. Journal of Time Series Analysis 5, 235{25. Pe~na, D. (987). Measuring the importance of outliers in ARIMA models. New Perspectives in Theoretical and Applied Statistics (Eds. Puri et al.), 9{8. John 27
Wiley, New York. Pe~na, D. (99). Inuential observations in time series. Journal of Business & Economic Statistics 8, 235{24. Pe~na, D. and Maravall, A. (99). "Interpolation, Outliers and Inverse Autocorrelations".Communication in Statistics (Theory and Methods) 2, 375{386. Sanchez, M.J. and Pe~na, D. (997). The identication of multiple outliers in ARIMA models. Working Paper 97-76, Universidad Carlos III de Madrid. Tsay, R.S. (986). Time series model specication in the presence of outliers, Journal of the American Statistical Association 8, 32{4. Tsay, R.S. (988). Outliers, level shifts, and variance change in time series. Journal of Forecasting 7, {2. Tsay, R.S., Pe~na D. and Pankratz A.E. (998). Outliers in multivariate time series. Working Paper 98-96, Universidad Carlos III de Madrid. 28