Discrimination and Classification of Nonstationary Time Series Using the SLEX Model

Size: px

Start display at page:

Download "Discrimination and Classification of Nonstationary Time Series Using the SLEX Model"

Damian Reeves
6 years ago
Views:

1 JASA jasa v.2003/0/20 Prn:23/03/2004; 5:42 F:jasatm034r2.tex; (Aiste) p. Discrimination and Classification of Nonstationary ime Series Using the SLEX Model 60 Hsiao-Yun HUANG, Hernando OMBAO, and David S. SOFFER 4 63 Statistical discrimination for nonstationary random processes is important in many applications. Our goal was to develop a discriminant scheme that can extract local features of the time series, is consistent, and is computationally efficient. Here, we propose a discriminant scheme based on the SLEX (smooth localized complex exponential) library. he SLEX library forms a collection of Fourier-type bases 8 that are simultaneously orthogonal and localized in both time and frequency domains. hus, the SLEX library has the ability to extract 67 9 local spectral features of the time series. he first step in our procedure, which is the feature extraction step based on work by Saito, is to 68 0 find a basis from the SLEX library that can best illuminate the difference between two or more classes of time series. In the next step, we 69 construct a discriminant criterion that is related to the Kullback Leibler divergence between the SLEX spectra of the different classes. he 70 discrimination criterion is based on estimates of the SLEX spectra that are computed using the SLEX basis selected in the feature extraction step. We show that the discrimination method is consistent and demonstrate via finite sample simulation studies that our proposed method 3 performs well. Finally, we apply our method to a seismic waves dataset with the primary purpose of classifying the origin of an unknown 72 4 seismic recording as either an earthquake or an explosion KEY WORDS: Kullback Leibler divergence; Likelihood ratio; Nonstationary random process; Seismic time series; SLEX library INRODUCION he extension of classical pattern-recognition techniques to experimental time series is a problem of great practical interest. A series of observations indexed in time often produces a pattern that may form a basis for discriminating between different classes of events. As an example, Figure shows regional (00 2,000 km) recordings of a typical Scandinavian earthquake, a mining explosion, and an event of unknown origin measured by stations in Scandinavia. he unknown event took place near the Russian nuclear test facility in Novaya Zemlya. he problem of discriminating between mining explosions and earthquakes is a reasonable proxy for the problem of discriminating between nuclear explosions and earthquakes. his latter problem is of critical importance for monitoring a comprehensive test-ban treaty. An extensive discussion of this problem can be found in Shumway and Stoffer (2000), chapter 5; the data are available on the website for the text, and its mirrors. ime series classification problems are not restricted to geophysical applications, but occur under many and varied circumstances in other fields. raditionally, detecting a signal embedded in a noise series has been analyzed in the engineering literature by statistical pattern recognition techniques. he historical approaches to the problem of discriminating among different classes of time series can be divided into two distinct categories. he optimality approach, as found in the engineering and statistics literature, makes specific Gaussian assumptions about the probability density functions of the separate groups and then develops solutions that satisfy welldefined minimum error criteria. ypically, in the time series case, we might assume the difference between classes is expressed through differences in the theoretical mean and covariance functions, and use likelihood methods to develop an optimal classification function. A second class of techniques, which might be described as a feature extraction approach, proceeds more heuristically by looking at quantities that tend to be good visual discriminators for well-separated populations and have some basis in physical theory or intuition. Less attention is paid to finding functions that are approximations to some welldefined optimality criterion. When analyzing time series data, both time domain and frequency domain approaches are typically available; this is true in the case of discrimination and classification. For relatively short stationary series, a time domain approach that follows conventional multivariate discriminant analysis is viable; an example is given in Shumway and Stoffer (2000), example 5.. For longer stationary time series, a frequency domain approach is computationally easier because it reduces the dimension of the problem. For nonstationary time series, the reduction in dimension is essential for computational efficiency. here has been extensive research on the discrimination problem of stationary time series. For example, Shumway (982) reviewed many different discriminant methods for time series, including time domain and frequency domain approaches. Kakizawa, Shumway, and aniguchi (998) proposed the method of discrimination for multivariate time series by using the Kullback Leibler discrimination information and the Chernoff information measure. Pulli (996) considered using the ratio of spectra to process the discriminant problem between earthquakes and explosions. hese methods are developed for stationary time series. A discussion of the current state of the art for stationary series can be found in Shumway and Stoffer (2000), section 5.7. In many practical problems, however, the time series are realizations of nonstationary random processes. For example, it is clear that the seismic waves displayed in Figure have variance and spectrum that change over time. Various models of 22 8 Hsiao-Yun Huang is Assistant Professor, Department of Statistics and Information Science, Fu-Jen Catholic University, aiwan ( stat202@mails. erature. Priestley (965) was the first to introduce the concept 52 nonstationary random processes have been proposed in the lit- 53 fju.edu.tw). Hernando Ombao is Assistant Professor, Department of Statistics, University of Illinois, Champaign, IL 6820 ( ombao@uiuc.edu). of a Cramér representation with time-varying transfer function. 2 David S. Stoffer is Professor, Department of Statistics, University of Pittsburgh, his idea was later refined by Dahlhaus (997), who established Pittsburgh, PA 5260 ( stoffer@pitt.edu). his research was supported 56 in part by National Science Foundation grant 0025 and National Institute of 5 57 Mental Health grant 62298, and it is based on Hsiao-Yun Huang s Ph.D. Dissertation, University of Pittsburgh. he authors thank Professor aniguchi for Journal of the American Statistical Association 0 American Statistical Association 6 supplying an advanced copy of Sakiyama and aniguchi (2003), and thank the???? 0, Vol. 0, No. 0, heory and Methods 59 associate editor and reviewers for suggestions that improved the presentation. DOI 0.98/00 8

2 JASA jasa v.2003/0/20 Prn:23/03/2004; 5:42 F:jasatm034r2.tex; (Aiste) p. 2 2 Journal of the American Statistical Association,???? Figure. Example of Seismic Recordings an asymptotic framework for locally stationary processes. Recently, Ombao, Raz, von Sachs, and Guo (2002) introduced the SLEX (smooth localized complex exponentials) model of a nonstationary random process. he SLEX model uses the SLEX vectors (orthogonal and localized Fourier vectors) as stochastic building blocks in the Cramér representation. Sakiyama and aniguchi (2003) and Shumway (2003) discussed discrimination for Dahlhaus locally stationary time series. We approach the problem of discrimination and classification of nonstationary time series using the SLEX model. One very important consideration in using the SLEX library is that it employs computationally efficient algorithms. In particular, it uses the fast Fourier transform algorithm to compute the SLEX transform and the best basis algorithm (BBA) of Coifman and Wickerhauser (992) to search for the best basis for discrimination between classes. Our discriminant scheme contains two parts: a feature extraction part and a classification part. he feature extraction step consists of selecting a basis from the SLEX library that illuminates the most differences between the groups. We follow Saito (994), who proposed a criterion that is based on the Kullback Leibler divergence. his method is aimed at selecting the basis from any library of orthogonal bases that can best present the time series as clouds in space with maximal distance (distance is between the normalized time-varying spectrum). Because the SLEX functions are local, they are able to extract local spectral features of the time series. After selecting a basis using training datasets, we compute the SLEX periodogram of the time series that we need to classify. We propose a classification criterion which we show to be asymptotically consistent. he essential idea is that an observed time series is assigned to a class (population or group) c if the Kullback Leibler divergence between the estimated spectrum (SLEX periodogram computed from the data) and the spectrum of c is smaller than that between the estimated spectrum and the spectra of any other class. In the next section we present our procedure for selecting the best local discriminant basis from the SLEX library. In Section 3, we discuss our discriminant criterion. Finally, in Section 4, we present some simulation studies and an analysis of seismic recordings. 2. SELECING HE BES LOCAL DISCRIMINAN BASIS FROM HE SLEX LIBRARY he first step in our proposed method is to choose a basis from the SLEX library, which is a collection of bases; each basis consists of orthogonal and localized basis vectors. he SLEX vector is a time and frequency localized orthogonal generalization of the Fourier (complex exponential) basis vectors. It is ideal for analyzing nonstationary time series. Ombao, Raz, von Sachs, and Mallow (200) used the SLEX to estimate the spectral density matrix of a bivariate nonstationary time series. Moreover, Ombao et al. (2002) introduced a model of a nonstationary random process that has a spectral representation in terms of the SLEX. he SLEX library is a rich collection of localized bases. hus it is able to extract local spectral features of the time series and is well suited to the problem of discrimination and classification of nonstationary time series. 2. he SLEX Basis Vectors Here, we give a brief overview of the SLEX transform. For details, we refer the reader to Ombao et al. (200) for specific applications to time series and to Wickerhauser (994) for a broader discussion on local trigonometric transforms. Fourier basis vectors are perfectly localized in frequency and hence are ideal for representing stationary time series. However, they cannot adequately represent nonstationary time series, that is, the time series with spectra that change over time. In this article, we will use the SLEX basis vectors, which are simultaneously orthogonal and localized in time and frequency. hey are constructed by applying a projection operator on the Fourier vectors. It turns out that the action of a projection operator on any periodic vector is identical to applying two specially constructed smooth windows to the Fourier vectors. A SLEX basis vector ψ S,ω (t) that has support on the discrete time block

JASA jasa v.2003/0/20 Prn:23/03/2004; 5:42 F:jasatm034r2.tex; (Aiste) p. 3 Huang, Ombao, and Stoffer: Discrimination and Classification Using the SLEX Model 3 60 4 63 Figure 2.

We denote the j = /2 j. he SLEX 70 S ={α 0 ɛ +,...,α ɛ} and oscillates at frequency ω has the form ( ψ S,ω (t) = S,+ (t) exp i2πω t ) S ( + S, (t) exp i2πω t ), (2.

3 JASA jasa v.2003/0/20 Prn:23/03/2004; 5:42 F:jasatm034r2.tex; (Aiste) p. 3 Huang, Ombao, and Stoffer: Discrimination and Classification Using the SLEX Model Figure 2. Real and Imaginary Parts of a SLEX Waveform With Support on Block Indexed by ime t {52,...,,023} With ɛ = 32. block b on level j to be S(j,b) and M ɛ = /2 J +, which is the same for all levels j. We denote the j = /2 j. he SLEX 70 S ={α 0 ɛ +,...,α ɛ} and oscillates at frequency ω has the form ( ψ S,ω (t) = S,+ (t) exp i2πω t ) S ( + S, (t) exp i2πω t ), (2.) S where ω [ /2, /2], S =α α 0, and ɛ is a small overlap between two consecutive time blocks (the size of ɛ is given in Sec. 2.2). he windows S,+ (t) and S, (t) are the two particularly constructed smooth windows, which take the form ( ) ( ) t S,+ (t) = r 2 α0 r 2 α t, (2.2) ɛ ɛ ( ) ( ) t α0 α0 t S, (t) = r r ɛ ɛ ( ) ( ) t α α t r r, (2.3) ɛ ɛ 22 8 where r( ) is called a rising cut-off function. In our implementation, we use the sine rising cut-off function ( ) π r(u)= sin ( + u), where u [, ]. (2.4) 4 34 t 93 Other types of rising cut-off functions may be used. See Wickerhauser (994) for details. he SLEX basis vectors are defined at dyadic blocks that overlap. One important property of the SLEX vectors is that they remain orthogonal despite the overlap. Moreover, in our implementation, we use the fundamental frequencies ω k = k/ S, k = S /2 +,..., S /2, where S is the length of time block S. An example of a SLEX basis vector is given in Figure he SLEX Library he SLEX library is a collection of bases, each having orthogonal vectors with time support that is obtained by segmenting the time series, of length, in a dyadic manner. he library is constructed by first specifying the finest resolution level J (smallest time block has length /2 J ). At resolution level j (where j = 0,...,J), the time series is divided into 2 j overlapping blocks. he amount of overlap is set to vectors on block S(j,b) are allowed to oscillate at different fundamental frequencies ω k = k/m j (where k = M j /2 +,...,M j /2). o illustrate this further, consider the bottom of Figure 3, where the SLEX library is constructed by setting J = 2. In this example, the SLEX library consists of five orthogonal bases and we enumerate the support of these basis vectors: () S(0, 0); (2) S(, 0) S(, ); (3)S(2, 0) S(2, ) S(2, 2) S(2, 3); (4) S(, 0) S(2, 2) S(2, 3); (5)S(2, 0) S(2, ) S(, ). Clearly, the SLEX basis vectors are allowed to have different lengths of support (or different time and frequency resolutions). Moreover, each basis captures local spectral features of the time series that are useful for discrimination and classification. In the latter part of this section, we discuss our criterion for selecting a basis from the SLEX library. 2.3 he SLEX Coefficients and Periodograms he SLEX transform consists of the set of coefficients that correspond to all the SLEX vectors defined in the library. he SLEX coefficients on block S = S(j,b) are defined by θ S,k = X t, ψ S,ωk (t), (2.5) Mj where the fundamental frequency is ω k = k/m j and k = M j /2 +,...,M j /2. he SLEX periodogram, an analog of the Fourier periodogram, is defined to be α S,k = θ S,k 2. (2.6) 2.4 he Discriminant Criterion We use the best local discriminant criterion that was described by Saito (994) as a criterion for selecting the best basis. In the actual search over the SLEX library, we can employ either the algorithm used by Saito or the BBA of Coifman and Wickerhauser (992). Both of these are computationally efficient algorithms, each of order O(). Intuitively, a best basis for discrimination is that which gives the largest disparity between two or more different classes. One well-known measure of disparity is relative entropy (also known as cross-entropy, Kullback Leibler divergence, or I divergence), which we now define. Let p ={p i } n i= and q = {q i } n i= be two nonnegative sequences that satisfy i p i = i q i =. he relative entropy between p and q is defined to be 56 5 Figure 3. he SLEX Library Constructed With Level J = 2. he n 58 shaded blocks S(, 0) S(2, 2) S(2, 3) represent one basis (out of 7 I(p, q) = p i log p i. 59 the five in this library). q i 8 i=

4 JASA jasa v.2003/0/20 Prn:23/03/2004; 5:42 F:jasatm034r2.tex; (Aiste) p. 4 4 Journal of the American Statistical Association,???? 0 By convention, we define log 0 =,log(x/0) =+ for x>0, and 0 (± ) = 0. Relative entropy satisfies I(p, q) 0 and the equality holds if and only if p q. his quantity is not a metric because it is not commutative and does not satisfy the triangle inequality. However, it measures the discrepancy of p from q. Moreover, relative entropy, an additive measure in the sense that I(p, q) = n i= I(p i,q i ), enables us to use this divergence measure in the best basis algorithm. For C 2 classes of time series, we define C 70 I(p,...,p C ) = C I(p l, p k ). Now, let {x c i }N c i= be a set of N c training signals belonging to class c, forc =,...,C. hen the time frequency energy map of class c, denoted by Ɣ c, is a table of real values specified by the triplet (j,b,k)as / 78 N c Nc criterion at the parent block is greater than or equal to the corresponding sum at the children blocks, then the children blocks Ɣ c (j,b,k) α (j,b),k i x c i 2, (2.7) i= i= do not separate the populations better than the parent block, and 22 8 where α i 23 (j,b),k 82 is the periodogram of the ith time series computed at level j = 0,...,J in block b = 0,...,2 j, and frequency ω k,wherek = M j /2 +,...,M j /2. Note that the time frequency energy map for a class c is in fact average normalized periodograms, that is, for any basis B in each class c, (j,b,k) B Ɣ c (j,b,k) =. he time frequency energy map gives the location (in time and frequency) where the energy in a particular class is mostly concentrated. hus, our procedure for classification is based on the time frequency energy concentration of the signals. After computing the time frequency energy map of each class, we then calculate the divergence between the C signals at block S(j,b) as 37 M 96 j /2 j,b = I[Ɣ (j,b,k),...,ɣ C (j,b,k)]. 39 k= M j /2+ 98 In this case, j,b gives a local measure of disparity between 42 the time frequency energy maps of the C classes in each time 0 X t, = M i /2 θ i,k, ψ i,k (t)z i,k, (2.8) 43 block S(j,b). Our procedure selects the basis that consists of Mi 02 Si B k= M i /2+ blocks S(j,b), where, essentially, the distance between classes, j,b, is maximized. 2.5 he Algorithm for Selecting the Best Basis In this section, we give the specific steps for selecting the best basis. Saito (994) developed the local discriminant basis algorithm (LDBA), which we apply to the SLEX library. Let A j,b represent the best basis (maximizer of the local discriminant criterion) and let B j,b denote the SLEX vectors at level j in block b. he algorithm for selecting the best basis is as follows: 56 5 Step 0. Specify the maximum depth of decomposition J. Step. Construct time frequency energy maps Ɣ c for c =,...,C. Step 2. Set A J,b = B J,b. Determine the A j,b for b = 0,...,2 j andj = J,...,0 by the following rule: If j,b j+,2b + j+,2b+,thena j,b = B j,b, else A j,b = A j+,2b A j+,2b+ and set j,b = j+,2b + j+,2b+. Step 3. Extract the blocks S(j,b ) that are defined by A(j, b) in the previous step. he SLEX periodograms computed at the blocks defined by the best basis will be used to construct the classification rule. he algorithm essentially compares a parent block against its children blocks; for example, the parent block S(j,b) ver- 3 l= k=l+ 72 sus its children blocks S(j +, 2b) S(j +, 2b + ). Ifthe value of the local discriminant criterion at the parent block is smaller than the corresponding sum at the children blocks, then the children blocks separate the populations better than the parent block. hus the children blocks are chosen in favor of their parent in the best basis B. If the value of the local discriminant the parent block would be selected in favor of its children. Finally, we point out the connection between LDBA and the BBA. he basis that maximizes the I divergence is equivalent to the basis that minimizes the negative I divergence. One may use the BBA to search for the best local discriminant basis in the SLEX library. hus, our procedure is computationally efficient and can handle massive time series datasets. 2.6 he SLEX Model We now describe the SLEX model for a nonstationary time series X t, for t = 0,...,. Let B be the best local discriminant basis selected from the SLEX library and let S i be the blocks in B. Note that S i is a particular dyadic segmentation of the time series. Define M i to be the number of points on the block S i.letj be the highest time resolution level in B, that is, the smallest time block in B has length /2 J. he frequencies defined on S i are the grid frequencies ω ki = k i /M i for k i = M i /2 +,...,M i /2. he spectral representation of X t, is where θ i,k, is the transfer function on time block S i and frequency k; ψ i,k is the SLEX basis vector oscillating at frequency k and having support at block S i ;andz i,k is an orthonormal random process with finite fourth moment he SLEX Spectrum. he SLEX spectrum is defined analogously to the spectrum of a stationary process. It is the square of the modulus of the time-varying transfer function. It is defined on rescaled time [0, ]. Letu be in an interval I [0, ] such that [u ] is in some time block S i on B.he SLEX spectrum is f (u, ω k ) = θ i,k, 2 [u ] S i.note that for a fixed frequency ω k, f (u, ω k ) is constant within each time subinterval (or block). his is because for each fixed, the SLEX model gives an explicit partitioning of the time frequency plane as determined by the blocks i S i in the basis B. 59 8

5 JASA jasa v.2003/0/20 Prn:23/03/2004; 5:42 F:jasatm034r2.tex; (Aiste) p. 5 Huang, Ombao, and Stoffer: Discrimination and Classification Using the SLEX Model 5 Ombao et al. (2002) developed asymptotic theory on the SLEX model. In this article, we discuss only the ideas that are essential in proving the consistency of our discrimination 60 4 procedure. As, f (u, ω k ) approaches the limiting 63 5 spectrum f(u,ω), which is independent of. he limiting = M i /2{ } α i,k log f (u i,ω k ) +. (2.0) 64 f (u i,ω k ) 6 Si B k=0 65 log-spectrum is defined subsequently. As increases, the number of observations in each block also increases. However, the number of blocks in B should also be allowed to increase if we want to model a limiting spectrum that, as a function of time, has an infinite wavelet expansion. hus, we need to allow J to increase as increases, but at a rate that is slower than K = log 2 ( ). he infinite wavelet expansion in Ombao et al. (2002) was defined on log f(u,ω) instead of f(u,ω) to guarantee positivity of the spectrum. Let the Haar wavelet basis of L 2 ([0, ]) be {ϕ0,0 } {ϕ l,m} l 0,m 0,whereϕ is the father Haar wavelet and ϕ is the mother wavelet. he infinite wavelet expansion of the log-limiting SLEX spectrum is log f(u,ω)= lim log f (u, ω) ( = lim β,0 (ω)ϕ0,0 (u) 22 is consistent, that is, the probability of incorrectly classifying a 8 24 J 2 l 83 + β l,m (ω)ϕ l,m (u) ), (2.9) where the coefficients are β,0 (ω) = log f(u,ω)ϕ0,0 (u) du, 0 respectively. Our rule is to classify the time series x into if p [ α(x)] p 2 [ α(x)]. In terms of the log-likelihood ratio, the β l,m (ω) = log f(u,ω)ϕ l,m (u) du Ombao et al. (2002) demonstrated that for a given, log f(u,ω), and SLEX basis B with finest time resolution level J,logf (u, ω) arises as a finite wavelet expansion of log f(u,ω)truncated to a level J. As a final remark, Ombao et al. (2002) showed that the SLEX model and the Dahlhaus (997) model of a locally stationary process are asymptotically mean-squared equivalent. he equivalence implies nonstationary processes that have smoothly time-varying spectrum that can be modeled using the SLEX model. For example, one may model autoregressive moving average processes with parameters that vary smoothly over time by a SLEX model with a spectrum that varies with time but is piecewise constant within small time blocks he Likelihood of the SLEX Periodograms. We now 48 derive the likelihood of the SLEX periodograms. When the orthonormal random increment processes, z i,k,arecomplex = M i /2{ } log f (u α i,k i,ω k ) + 08 f 50 Si B k=0 (u (3.2) i,ω k ) 09 Gaussian, then the SLEX coefficients, θ i,k, are independent complex Gaussian. Moreover, the SLEX periodograms α i,k = θ i,k 2 are distributed as f (u i,ω k )V i,k,wherev i,k are independent over i and k 0 and are distributed as χ2 2/2when 54 k 0,M i /2, and χ 3 2 when k = 0,M i/2. Because the periodograms at each level j and block b are symmetric about the = M i /2{ } 55 log f 2 (u α i,k i,ω k ) + 4 f 2 56 Si B k=0 (u. (3.3) i,ω k ) 5 zero frequency (k = 0), we consider only the periodograms at nonnegative frequencies. Denote α ={ α i,k }.Letp( α) be the joint density of the periodograms under the SLEX process and let f be the spectrum. hen the log-likelihood of the periodograms is l(f α) In the next section, we develop a classification rule that is based on the log-likelihood ratio. 3. HE CLASSIFICAION RULE Let x ={x 0,...,x } be a time series that we wish to classify as a realization of either or 2 with time-varying spectra f or f 2, respectively. We consider only the case where there are C = 2 classes of time series, although our method is applicable for C>2. Our classification criterion is based on a frequency domain approach, that is, we use the spectrum as the basic signature for classifying time series. We extract the SLEX periodograms computed from the blocks in the basis selected in the feature extraction step. In this section, we derive our criterion using both the log-likelihood ratio and the Kullback Leibler divergence. We then show that this criterion time series goes to zero as the length of the time series. 3. Classification Rule as Log-Likelihood Ratio 26 l=0 m=0 he basic idea of our approach is as follows. We first ex- 85 tract the SLEX periodograms, which we denote as α(x), from 28 the blocks of the best basis selected. We then form the like- 87 lihood under and 2, denoted as p [ α(x)] and p 2 [ α(x)], discriminant statistic is D (f,f2 ; x) = logp [ α(x)] p 2 [ α(x)] 59 8 (3.) and the classification rule is to assign x as a realization of if D 0. Otherwise, it is assigned to 2. In this section, we derive the discrimination statistic (loglikelihood ratio) and then show that this criterion is consistent, that is, the probability of incorrectly classifying a time series goes to zero as the length of the time series. 3.. he SLEX Likelihood Ratio. Denote α ={ α i,k }.Let p ( α) and p 2 ( α) be the density under processes and 2, respectively, and let f and f 2 be the spectra under these two processes. hen the log-likelihoods under these two densities are, respectively, l(f α) and l(f 2 α) Our classification rule is based on the likelihood ratio, that is, we classify x into if l(f α) l(f 2 α); otherwise,itis classified into 2. Following (3.) (3.3), the discriminant sta-

6 JASA jasa v.2003/0/20 Prn:23/03/2004; 5:42 F:jasatm034r2.tex; (Aiste) p. 6 6 Journal of the American Statistical Association,???? 0 tistic is 60 M i /2 D (f,f2 ; x) = { log f 2(u i,ω k ) f (u i,ω k ) 4 63 i 5 S i B k= M i /2+ 64 [ ]} + α i,k f 2(u i,ω k ) f (u. (3.4) i,ω k ) We classify x into if D (f,f2 ; x) 0. Remark. First, we note that D (f,f2 ; x) 2 [l(f α) l(f 2 α)]. In the computation of the log-likelihood, we do not explicitly include the negative frequencies, because the periodograms are symmetric about k = Remark 2. o simplify the computation of the log-likelihood 74 KL( α,f 6 l(f 75 ) and l(f 2 ) = M i /2 α i,k log ), we ignored the fact that the distribution of i S f i B k= M i /2+ (u i,ω k ) α i,k /f (u i,ω k ) at k = 0andM i /2isχ 2, whereas for k =,...,M i /2, the ratio is χ2 2/2. However, for M i sufficiently large, we expect the difference to be practically negligible. α i,k Remark 3. he derivation of D (f,f2 ; x) in the SLEX 22 approach is inspired by the work of Sakiyama and aniguchi 8 23 (2003), who derived a discriminant statistic for the Dahlhaus KL( α,f 2 ) = M i /2 { α i,k log (997) model of a locally stationary random process. Denote 83 i S f i B k= M i /2+ 2(u i,ω k ) the spectrum for the Dahlhaus model as f D (u, ω). he likelihood is obtained by segmenting the time series into N time blocks, each having length M, and then computing the Fourier periodograms P M (b, ω),atblocksb =,...,N. heir classification criterion is 27 α i,k 86 D (f,f 2 ; x) 32 /2 9 = N N log f 2 D (u j,ω) his leads to the same classification rule discussed in Sec- 34 j= /2 fd (u j,ω) tion 3.. following (3.4), namely classify x into if D 0; 93 [ ] + P M (u j,ω) fd 2 (u j,ω) fd (u dω, (3.5) j,ω) which is comparable to (3.4). Remark 4. In practice, the true spectra, f and f 2, are unknown. We replace these quantities in D by the SLEX periodograms that are averaged across time series replicates. Suppose that there are N l independent time series from class l for l =, 2. Denote the periodograms computed from the We now state the results pertinent to the consistency of our classification scheme based on the SLEX model. Denote the probability of misclassifying a time series x to process 2, given that the true process is, to be P (2 ) = P {D (f,f2 ; x) <0 }. Similarly, we denote P ( 2) = P {D (f,f2 ; x) 0 2}. hen, under appropriate condi- 45 mth time series in the training dataset from class l to be α l,m i,k. tions, 04 hen the estimate of f l (u i,ω k ) is f 49 l(u i,ω k ) = N l α l,m lim i,k, l=, 2. (3.6) P ( 2) = 0. (3.0) N l 08 m= 50 are replaced 09 Essentially, we perform averaging across subjects, which gives the similar desired effect of reducing the variance when smoothing over frequency. However, if N is small, we might not achieve the desired reduction in the variance and thus we may need to do an extra smoothing step, which is over frequency. Hence, we may use the weighted form 56 5 f l(u i,ω k ) = M W s f l (u i,ω k s ), where the 2M + is the span of the weights, W s 0, and Ms= M W s =. he estimate f 2(u i,ω k ) can be obtained in a similar manner if N 2 is small. 3.2 he Classification Rule as Kullback Leibler Divergence he discriminant statistic, D given in (3.4), can be derived using the Kullback Leibler divergence. As before, let f and f 2 represent the spectra of and 2, respectively. Let the periodogram of the time series x,tobeclassifiedintoeither and 2, be denoted by α ={ α i,k }. he Kullback Leibler divergences between α and f and f 2 are 59 8 s= M and { [ + α i,k f (u i,ω k ) ]} [ + α i,k f 2(u i,ω k ) ]}. We classify the series x into if its periodogram α is closer to f than f 2;thatis,ifKL( α,f ) KL( α,f2 ) or if D (f,f2 ; x) = KL( α,f2 ) KL( α,f ) 0. otherwise, classify x into Consistency heorem lim P (2 ) = 0, (3.9) Moreover, (3.9) and (3.0) hold when f and f 2 by the consistent estimates defined in (3.6) if, in addition, we have N l for l =, 2. he conditions and proofs of these results are given in Appendix B. 4. SIMULAION RESULS AND DAA ANALYSIS 4. Simulation Results A number of numerical experiments were performed to test how well our proposed method was able to classify time series.

7 JASA jasa v.2003/0/20 Prn:23/03/2004; 5:42 F:jasatm034r2.tex; (Aiste) p. 7 Huang, Ombao, and Stoffer: Discrimination and Classification Using the SLEX Model 7 able. Misclassification Rate for the First Simulation Study Where Π 60 2 Is Gaussian White Noise and Π 2 Is AR() With Parameter φ 6 φ Error rate We discuss a few of simulation experiments here. In the first experiment we had two classes, where processes from were Gaussian white noise and processes from 2 were stationary first-order autoregressive with parameter φ. he length of each time series was set to =,024 and the training set for each group consisted of eight time series. Several simulation studies were performed by varying φ. Using the training datasets, we selected the best local discriminant basis and then set up the discriminant criterion. We then generated 50 time series datasets Figure 4. he Coefficient a t;τ versus t for τ =.5 ( ), τ =.4 ( ), 75 7 from each of and 2. In able we report the error rate, τ =.3 ( ), and τ =.2 ( ). 76 which is the number of incorrect classifications out of a total of 00 trials. Note that there is a small misclassification rate when the value of φ is close to 0. For example, when φ =±., there is some misclassification error because these AR() processes are close to white noise. In the second experiment, processes from were defined by { () Y t if t / Y t = 27 Y (2) (4.) 86 t if /2 + t, where Y t () is white noise and Y (2) is AR() with parameter φ =.. he processes from 2 were defined by { X () t if t /2 X t = (4.2) 29 t 88 X t (2) if /2 + t, t is AR() with parameter where X t () is white noise and X (2) φ =.. We generated N = 8, 6 training datasets from each group, and 2, with = 52,,024, 2,048. In each case Figure 5. A Simulated Slowly Changing AR(2) Dataset From Π. we generated 50 time series datasets from and another 50 from 2. he misclassification rates are displayed in able 2, and again we note the favorable results. In the third experiment, the processes from and 2 were slowly time-varying AR(2) processes. he processes from were generated as Y t = a t;τ Y t.8y t 2 + ɛ t, t =,...,,024, where a t;τ =.8[ τ cos(πt/,024)], τ =.5, and ɛ t were iid standard normal. We performed three subexperiments, where 2 was defined by the processes Y t = a t;τ Y t.8y t 2 + ɛ t, t =,...,,024, where τ takes on the three different values τ =.4,.3,.2. he plots of a t;τ for the various values of τ are shown in Figure 4. A simulated dataset from process is shown in Figure 5 and that from 2 with τ =.3 is shown in Figure able 2. Misclassification Rate for the Second Simulation Study 4 Where Π Is (4.) and Π 2 Is (4.2) ,024,024 2,048 2, N Error rate Figure 6. A Simulated Slowly Changing AR(2) Dataset From Π 2 With 7 59 τ =.3. 8

8 JASA jasa v.2003/0/20 Prn:23/03/2004; 5:42 F:jasatm034r2.tex; (Aiste) p. 8 8 Journal of the American Statistical Association,???? 0 3 τ Error rate We generated 0 datasets from each of and 2 as the training data. From the training dataset, we obtained the best basis and the estimated spectra. hen we generated 0 new data for each category to evaluate the error rate. he simulation results are shown in able 3. We can see that the error rates are zero when the τ =.3,.2in 2. here was a misclassification error for the case where the τ s are close for both processes, that is, τ =.4 for 2 and τ =.5for Data Analysis As discussed in the Introduction, discriminating between nuclear explosions and earthquakes is a problem of critical importance for monitoring a comprehensive test-ban treaty. We apply the proposed SLEX methodology to construct the discriminant rule for classifying a time series as either an explosion or an earthquake. Because the proliferation of nuclear explosions is monitored in regional distances (00 2,000 km) nowadays, the data on mining explosions can serve as a reasonable proxy. A dataset constructed by Blandford (993) that able 3. he Simulation Results for the Slowly Varying comprises regional (00 2,000 km) recordings of several typi- 60 AR(2) Experiment cal Scandinavian earthquakes and mining explosions measured by stations in Scandinavia are used in this study. A list of these events (eight earthquakes and eight explosions) and an extra event of uncertain origin that was located in the Novaya Zemlya region of Russia (called the NZ event) were given by Kakizawa et al. (998). he problem was discussed in detail by Shumway and Stoffer (2000, chap. 5) and the data are availiable online from the website of the text (the URL is given in the Introduction). he earthquake and explosion seismic recordings are given in Figures 7 and 8, respectively. he SLEX spectral estimates for one earthquake and one explosion time series are given in Figures 9 and 0, respectively, whereas the SLEX spectral estimate of the NZ event is given in Figure. o evaluate our method, we used the holdout procedure; that is, we removed one time series (at a time) to be used for classification and used all the remaining time series in the training dataset (excluding the unknown NZ event) to construct the classification rule. For each holdout time series, we performed the following steps in sequence: We selected a basis for discrimination, we extracted the SLEX periodograms computed at the blocks in this basis, we constructed the classification rule, and, finally, we assigned the holdout time series. We repeated the process for each of the holdout time series. In the Figure 7. Seismic Recordings of Earthquake Origin. 8

9 JASA jasa v.2003/0/20 Prn:23/03/2004; 5:42 F:jasatm034r2.tex; (Aiste) p. 9 Huang, Ombao, and Stoffer: Discrimination and Classification Using the SLEX Model Figure 8. Seismic Recordings of Explosion Origin. 92 classification criterion, we estimated the spectra for each group by first averaging the SLEX periodograms across subjects and then smoothing the averaged SLEX periodograms across frequency. Denote the estimated spectra for the earthquake and the explosion classes to be, respectively, f and f 2. he classification criterion was D ( f, f 2 ; x) and the testing data was classified as an earthquake event if D ( f, f 2 ; x) 0andas an explosion if D ( f, f 2 ; x)<0. he results are reported in Figure 9. he SLEX Spectral Estimate of an Earthquake ime Series Figure 0. he SLEX Spectral Estimate of an Explosion ime Series. 8

10 JASA jasa v.2003/0/20 Prn:23/03/2004; 5:42 F:jasatm034r2.tex; (Aiste) p. 0 0 Journal of the American Statistical Association,???? 0 60 As mentioned in Section 2.6, the existence of the limiting SLEX 4 63 spectrum depends on three assumptions, which we state now for completeness. 5 Details can be found in Ombao et al. (2002) Assumption. For f(u, ω) as a function of ω [ /2, /2], uniformly in u [0, ], we assume a Hölder condition of order µ (0, ] 66 with constant L>0: 0 f(u,ω) f(u,ω ) L ω ω µ. 69 Assumption 2. here exists a hierarchical collection I of dyadic 70 2 subintervals of [0, ], I = {[ 2 l m, 2 l (m + ) ) : l = 0,,...; m = 0,...,2 l }, 5 and a subset of intervals, say I v =[u v,u v+ ) I, such that v I v = 74 6 [0, ] and that f(u,ω), as a function of u, is Hölder of order 0 <s v < 75 7 on I v for all ω [ /2, /2]. Moreover, the transition points u v between the intervals I 76 Figure. he SLEX Spectral Estimate of the NZ Event. 8 v and I v allow for a finite number of possible 77 jumps of finite height. 20 Assumption 3. (a) Either the maximum depth J is fixed or, if 79 able 4, where we denote Eqk to be the kth earthquake in the 2 J = J is allowed to grow as a function of (i.e., J ), then 80 dataset and Expk to be the kth explosion in the dataset. he 22 we must have 2 J / 0as.Furthermore,J J 2. 8 proposed method had a zero misclassification rate. Moreover, (b) he length M 23 j of each segment S j satisfies M J /M j for 82 the unknown NZ event here is classified as an explosion, which j = 0,,...,J agrees with the result in Kakizawa et al. (998). 5. CONCLUSION In Section 3.3 we claimed that our discrimination rule is consistent in the sense of (3.9) and (3.0). We now prove these results un- 28 In this article, we proposed the SLEX method for discrimination and classification of nonstationary time series that is based spectrum, f to the SLEX limiting spectrm, f (recall Remark 4 in 87 der Assumptions 3. First, we state a theorem that relates the SLEX 30 on the SLEX transform. he SLEX transform is localized in Sec. 2.6). 89 both time and frequency domains; thus our method is able to heorem (Ombao et al. 2002). Under Assumptions 3, given the extract local features of the data. Moreover, our method is computationally efficient and hence is able to handle large datasets. SLEX limiting spectrum f, and a sequence of frequencies ω k, ω SLEX model as defined in Section 2.6, with SLEX spectrum f 33 and 92 Our procedure computes the Kullback Leibler distance of the as, we have the following situations: time frequency energy maps (normalized SLEX periodograms 36. Let u I v and let 2 J µ/(µ+sv).hen 95 or the estimated normalized spectrum) of the classes and selects 37 log f 96 the basis from the SLEX library that can best discriminate between classes of nonstationary time series. Our classification uniformly in u I v. (u, ω k, ) log f(u,ω) =O ( s vµ/(µ+s v ) ) 40 rule, which is based on the Kullback Leibler distance between 2. Define s := inf v s v and let 2 J µ/(µ+sv).hen 99 4 the estimated spectra of the groups and the time series that [ needs to be classified, is shown to be consistent; that is, the log f (u, ω k, ) log f(u,ω) ] 2 du = O ( s vµ/(µ+s v ) ) probability of misclassification goes to zero as the length of o aid in the proof of the consistency of our method, we first define the following functions. Let 03 the time series goes to infinity. Finite sample simulation studies and data analysis demonstrate that the method performs well in APPENDIX A: ASSUMPIONS ON HE LIMIING SPECRUM OF HE SLEX MODEL APPENDIX B: PROOF OF CONSISENCY φ (u, ω) = /f 2 46 practice. (u, ω) /f (u, ω) 05 and 48 φ(u,ω)= /f 2 (u, ω) /f (u, ω) able 4. Result of Holdout Procedure for Classifying 08 Seismic Recordings Define 5 Holdout series D 0 ( f, f 2 ; x) Holdout series D ( f, f 2 ; x) H (φ ) = M i /2 φ (u i,ω k ) α i,k, 52 Eq.526 Exp.5096 i S i B k= M i /2+ 53 Eq2.627 Exp Eq3.832 Exp /2 3 Eq4.03 Exp H(φ) = φ(u,ω)f (u, ω) dω du, Eq Exp /2 56 Eq6.789 Exp where α i,k is the SLEX periodogram of x, i 57 Eq7.66 Exp S i represents the blocks 6 corresponding to the best basis B Eq Exp8.044,andf is the SLEX limiting spectrum that corresponds, generically, to class (i.e., x ). Simi- NZ larly, f 8 represents the SLEX spectrum that corresponds to.he

11 JASA jasa v.2003/0/20 Prn:23/03/2004; 5:42 F:jasatm034r2.tex; (Aiste) p. Huang, Ombao, and Stoffer: Discrimination and Classification Using the SLEX Model proofs of the following lemmas are brief to save space; explicit details Proof. Without loss of generality, we prove the result for x can be obtained from the authors. Set 6 Lemma. Under the established notation and conditions, 4 R 63 ([ ] 5 M µ ) ([ 2 J ] µ ) = M i /2 log f 2(u i,ω k ) 64 E 6 H (φ ) = H(φ)+ O + O + O(b s ), i S f i B k= M i /2+ (u i,ω k ). 65 hen where E 8 denotes expection with respect to (wrt), M := sup i M i, D 67 b 9 := M J /,ands := inf v s v. (f,f2 ; x) = R + H (φ ) 68 and 0 Proof. 69 E { D (f 70 E H (φ ) = M i /2,f2 ; x) } = R + E{H (φ ) }. M i φ (u i,ω k )f 3 72 i 4 S M (u i,ω k ) Hence, i i B k= M i /2+ E {[ D (f,f2 ; x) E{D (f,f2 ; x)}] 2 } = M i /2 M i [φ(u i,ω k ) + O(b s 75 7 i S M )] = Var{H (φ ) } 0 i i B k= M i /2+ as by Lemma [f (u i,ω k ) + O(b s )] We are now ready to establish the main result on the consistency of 9 our discrimination rule. 78 = [ M /2 i φ(u i,ω)f (u i,ω)dω heorem 2. Under the established notation and conditions, 2 80 i S /2 i B lim 22 P (2 ) = lim P {D (f 8 ([ 23 2 J ] µ ) ],f2 ; x)<0 }=0, 82 + O + O(b s 24 ) lim P ( 2) = lim P {D (f,f2 ; x) 0 2}= / = φ(u,ω)f Proof. We prove the first result; the second result follows in a similar manner. Using Lemma, we have 85 (u, ω) dω du 27 0 /2 86 ([ ] 28 M µ ) ([ 2 J ] µ ) 87 + O + O + O(b s 29 ). E [ D (f,f2 ; x) ] 88 Lemma 2. Under the established notation and conditions, 3 = M i /2 [ log f 2(u i,ω k ) 90 ( 32 b 9 Var 33 {H (φ )}=O( s ) ( ) + O 2 J ) i + O, S f i B k= M i /2+ (u i,ω k ) + f ] (u i,ω k ) f 2(u i,ω k ) 92 /2 [ 34 = log f 2 (u, ω) 93 where Var 35 denotes variance wrt. 0 /2 f (u, ω) + f ] (u, ω) f 2 (u, ω) dωdu Proof. Let P 95 i = M i /2 k= M i /2+ φ ([ ] (u i,ω k ) α i,k.hen M µ ) ([ 2 J ] µ ) ([ ] s ) M 37 + O + O + O. 96 { M i /2 } 39 Var {H (φ )}=Var φ (u i,ω k ) α However, i,k i S i B k= M i /2+ 99 log f 2 (u, ω) 42 = 0 2 Var (P i ) + f (u, ω) + f (u, ω) f 2 (u, ω) 0, 43 i 2 Cov (P i,p j ) so that i j E [ D 03 in obvious notation. Using arguments similar to the proof of Lemma (f,f2 ; x) ] C 0 (A.) 45 and the result of Lemma, we can show 04 as. In light of (A.) and Lemma 3, the result follows immediately. 47 ( b Var {P i }=O( s ) ) + O Corollary. Let f i l(u i,ω k ) for l =, 2 be the estimates given in (3.6) and let N = min{n,n 2 }. hen, under the established notation and conditions, 08 and 5 ( 2 J ) Cov (P i,p j ) = O. lim lim P { D ( f N, f 2; x) } < 0 = 0, i j 54 Lemma 3. Let D (f lim lim P { D ( f 3,f2 ; x) denote the discriminant statistic defined in (3.4), where f 4 N, f 2; x) } 0 2 = and f 2 are the SLEX spectra in classes Proof. By the strong law of large numbers, D ( f 56 and 2, respectively. hen under the established notation and conditions, as, (f,f2 ; x) as N, and the result follows immediately from, f 2 as ; x) 5 D heorem D (f,f2 ; x) E{ D (f p,f2 ; x) i} 0 forx i,i=, 2. [Received May Revised January 2004.] 8

12 JASA jasa v.2003/0/20 Prn:23/03/2004; 5:42 F:jasatm034r2.tex; (Aiste) p. 2 2 Journal of the American Statistical Association,???? 0 REFERENCES Pulli, J. J. (996), Extracting and Processing Signal Parameters for Regional Seismic Event Identification, in Monitoring a Comprehensive est Blandford, R. R. (993), Discrimination of Earthquakes and Explosions, 3 Report AFAC-R HQ, Air Force echnical Applications Center, Ban reaty, eds. E. S. Husebye and A. M. Dainty, Dordrecht: Kluwer, 62 4 Patrick Air Force Base, FL. pp Coifman, R., and Wickerhauser, M. (992), Entropy Based Algorithms Saito, N. (994), Local Feature Extraction and Its Applications, Ph.D. dissertation, Yale University, Dept. of Mathematics. 5 for Best Basis Selection, IEEE ransactions on Information heory, 32, Sakiyama, K., and aniguchi, M. (2003), Discrimination for Locally Stationary Random Processes, Journal of Multivariate Analysis, in press Dahlhaus, R. (997), Fitting ime Series Models to Nonstationary Processes, he Annals of Statistics, 25, Kakizawa, Y., Shumway, R. H., and aniguchi, M. (998), Discrimination and Shumway, R. H. (982), Discriminant Analysis for ime Series, in Classification, Pattern Recognition and Reduction of Dimensionality, Handbook of Clustering for Multivariate ime Series, Journal of the American Statistical Association, 93, Statistics, Vol. 2, eds. P. R. Krishnaiah and L. N. Kanal, Amsterdam: North- Ombao, H., Raz, J., von Sachs, R., and Guo, W. (2002), he SLEX Model of Holland, pp. 46. Nonstationary Processes, Annals of the Institute of Statistical Mathematics, 70 (2003), ime Frequency Clustering and Discriminant Analysis, Statistics and Probability Letters, in press , Ombao, H., Raz, J., von Sachs, R., and Malow, B. (200), Automatic Statistical 3 Shumway, R. H., and Stoffer, D. S. (2000), ime Series Analysis and Its Applications, New York: Springer-Verlag Analysis of Bivariate Nonstationary ime Series, Journal of the American 4 Statistical Association, 96, Priestley, M. (965), Evolutionary Spectra and Nonstationary Processes, Wickerhauser, M. (994), Adapted Wavelet Analysis From heory to Software, 74 Journal of the Royal Statistical Society, Ser. B, 28, New York: IEEE Press

Discrimination and Classification of Nonstationary Time Series using the SLEX Model

Discrimination and Classification of Nonstationary ime Series using the SLEX Model Hsiao-Yun Huang, Hernando Ombao, and David S. Stoffer Abstract Statistical discrimination for nonstationary random processes