Time Series Analysis and Signal Modeling

Size: px

Start display at page:

Download "Time Series Analysis and Signal Modeling"

Edwin Barrett
5 years ago
Views:

1 Time Series Analysis and Signal Modeling Andreas Jakobsson Lund University Version 4

2 i Time Series Analysis and Signal Modeling This is an early version of a set of notes for the course on Time Series Analysis offered at Lund University. Any and all comments and suggestions are most welcome and appreciated. Andreas Jakobsson aj@maths.lth.se

3 ii

4 Contents Signal Modeling 3. Introduction Stochastic Processes 5 2. Introduction Stochastic vectors Properties and peculiarities Normal distributed vectors Conditional expectations Linear projections Stochastic processes Properties and peculiarities The power spectral density Filtering of a stochastic process The moving average process The autoregressive process The Levinson-Durbin algorithm ARMA, ARIMA, and SARIMA processes Estimating the power spectral density Identification and Modeling Introduction Finding an appropriate model structure The partial auto-correlation function Data with trends Estimating the unknown parameters Least squares estimation Maximum likelihood estimation iii

5 iv CONTENTS The Cramér-Rao lower bound Estimating the model order Prediction of Stochastic Processes Optimal linear prediction Prediction of ARMA processes Implementation using Matlab Prediction of ARMAX processes Multivariate Processes Introduction Identification and Estimation Maximum likelihood estimation A Complements 7 A..2 Complex-valued Normal distributed vectors A..3 Wishart distribution B Matlab functions 73

6 Glossary and Notation Glossary and Abbreviations -D, 2-D, etc. ACF AR ARIMA ARMA ARMAX ARX BLUE CRLB db DFT FIM FFT LS MA MIMO MISO ML MSE NLS PACF PDF PSD SARIMA SARIMAX SARMA SARMAX SISO SNR SVD VARMA One-dimensional, two-dimensional, etc. Auto-covariance function Autoregressive Integrated ARMA Autoregressive moving average ARMA with exogenous input AR with exogenous input Best linear unbiased estimate Cramér-Rao lower bound Decibel Discrete Fourier transform Fisher information matrix Fast Fourier transform Least-squares Moving average Multiple-input multiple-output Multiple-input single-output Maximum likelihood Mean squared error Non-linear least-squares Partial auto-correlation function Probability density function Power spectral density Seasonal ARIMA SARIMA with exogenous input Seasonal ARMA SARMA with exogenous input Single-input single-output Signal-to-noise ratio Singular value decomposition Vector ARMA

7 2 CONTENTS Notational Conventions a, b,... boldface lower case letters are used for vectors A, B, Σ,... boldface upper case letters are used for matrices A, a,, α,... non-bold letters are generally used to denote scalars a T, A T,... ( ) T denotes the matrix or vector transpose a, A,... ( ) denotes the Hermitian (conjugate) transpose Â, â, â, ˆα,... ( ) ˆ is used to denote an estimate C n m the complex n m-dimensional space C n the complex n-dimensional plane CN (m, R) the complex-valued Normal distribution N (m, R) the real-valued Normal distribution arg max f(x) the argument that maximizes f(x) arg min f(x) the argument that minimizes f(x) E{ } the expectation operator exp( ) the exponential function; exp(a) = e a i or j the imaginary unit,, unless otherwise specified I the identity matrix (of unspecified dimension) I m,n the m n identity matrix I θ the Fisher information matrix Im( ) the imaginary part of log( ) the natural logarithm R n m the real n m-dimensional space R n the real n-dimensional plane (R is used for n = ) Re( ) the real part of tr( ) the trace of a matrix defined as conditioned on; e.g., a b means a conditioned on b det( ) matrix determinant the L 2 -norm (Euclidian norm) x the integer part of x

8 Chapter Signal Modeling. Introduction.3.2 Amplitude Time [s] Figure.: A female voice uttering: Why were you away a year, Roy? Example.. Figure. shows the sampled speech signal of a female speaker uttering the phrase Why were you away a year, Roy?. The signal is sampled at f s = 8 Hz, which is a common sampling frequency for speech signals. Example.2. Figure.2 shows the measured temperature in the Swedish city Svedala. The temperature data is sampled every hour during a period in April and May

9 4. Signal Modeling 25 2 Temperature Days Figure.2: Temperature measurements in the Swedish city Svedala. The temperature data is sampled every hour during a period in April and May 994.

10 Chapter 2 Stochastic Processes A thing of beauty is a joy for ever: Its loveliness increases; it will never Pass into nothingness; but still will keep A bower quiet for us, and a sleep Full of sweet dreams, and health, and quiet breathing. John Keats 2. Introduction We first examine vectors of stochastic variables, and then extend the discussion to also deal with stochastic processes. 2.2 Stochastic vectors 2.2. Properties and peculiarities Let x denote a vector containing p stochastic variables, such that x = [ x... x p ] T (2.) where ( ) T and x l denote the transpose and the l:th element of the vector x, respectively. Denote the mean of the vector m x = E{x} = [ E{x }... E{x p } ] T, (2.2) where E{ } denotes the statistical expectation, defined as E{g(x)} = 5 g(x)f(x) dx, (2.3)

11 6 2. Stochastic Processes with f(x), depending on the dimensionality of x in (2.2) and (2.3), denoting the probability density function (PDF) of the stochastic vector or variable, respectively. Furthermore, denote the covariance matrix of x and y by R x,y = C{x, y} = E { [x m x ] [y m y ] }, (2.4) where ( ) denotes the conjugate transpose, and where the q-dimensional vectors y and m y are defined similarly to x. Thus, R x,y is a (p q)-dimensional matrix with elements R x,y = C{x, y }... C{x, y q }..... C{x p, y }... C{x p, y q } (2.5) In the particular case when y = x, this yields the (p p)-dimensional (auto) covariance matrix R x,x, which will then be a positive semi-definite Hermitian matrix, i.e., The (auto) covariance matrix R x,x will be: (i) Positive semi-definite, here denoted R x,x, implying that w R x,x w, for all vectors w. This also implies that the eigenvalues of R x,x are real-valued and non-negative. (ii) Hermitian, i.e., the matrix wil satisfy R x,x = R x,x. If x is a real-valued vector, this implies that R x,x = R T x,x; such matrices are termed symmetric. When there is no risk for confusion, we will, for notational convenience, often denote R x,x simply by R x. Although we have here, in the interest of generality, used the general definitions that also allow the stochastic vector to be complex-valued, we will throughout these notes generally assume that the examined data is real-valued, and only treat the particularities of complex-valued measurements in separate, optional, sections. When working with covariances, the following lemma is often useful: Lemma 2.. Let x, u, v, and y denote two p and q-dimensional stochastic vectors, respectively, and let A and B be a (n p) and (m q)-dimensional deterministic matrix, respectively. Then, it holds that C{A(x + u), B(y + v)} = AC{x, y}b + AC{x, v}b + which implies that V {Ax + a} AC{u, y}b + AC{u, v}b (2.6) = C{Ax + a, Ax + a} (2.7) = AV {x}a (2.8) C{x + u, y} = C{x, y} + C{u, y} (2.9) where a is a deterministic p-dimensional vector.

12 2.2. Stochastic vectors Normal distributed vectors Throughout these notes, we often consider p-dimensional Normal distributed vectors, i.e., x N (m x, R x ). By this notation, we imply that x follows a real-valued multivariate Normal distribution with a PDF of the form { f x (x) = (2π) p/2 det (R x ) /2 exp } 2 [x m x] T R x [x m x ] (2.) where det (R x ) denotes the determinant of R x. In the one-dimensional case, i.e., when p =, (2.) naturally simplify to the well-known form { } f x (x) = exp [x m x] 2 2πσ 2 x 2σx 2, (2.) where m x and σ 2 x denote the mean and the variance of the stochastic variable x, respectively. In the particular case when the vector contains equally distributed and independent stochastic variables, i.e., it is an independent identically distributed (iid) white vector, R x = σ 2 I, (2.2) where I is a (p p)-dimensional identity matrix. The distribution for a complexvalued Normal distributed vector, x CN (m x, R x ), is given in Appendix A..2. Hereafter, we will by these notations indicate if x is real- or complexvalued. Theorem 2.. Let x be Normal distributed with x N (m x, Σ), then y = Cx (2.3) is distributed according to y N (Cm x, CΣC T ) for non-singular C. The proof can be found, for instance, in [] Conditional expectations We proceed to examine conditional expectations, using the following definitions: Definition 2.. The conditional distribution of the random variable y, given that x = x, for some value x, is defined as f y x=x (y) = f x,y(x, y) f x (x ) = f x,y(x, y) fx,y (x, y) dy (2.4) Furthermore, the conditional expectation is defined as E {y x = x } = yf y x=x (y) dy (2.5) where, from here on, the integration, unless specified, is from to.

13 8 2. Stochastic Processes In the interest of notational convenience, we will most commonly omit that x = x, simply writing the conditional expectation as E {y x}. Note that, from the above definitions, it is clear that if the stochastic variables x and y are independent, i.e., if this implies that f y x (y) = f y (y), and that f x,y (x, y) = f x (x) f y (y), (2.6) E {y x} = E {y} (2.7) Using Definition 2., one can easily show several useful results E {g(x)y x} = g(x)e {y x} (2.8) E {cu + dv x} = ce {u x} + de {v x} (2.9) E {y} = E x {E y {y x}} (2.2) E {g(x)y} = E x {g(x)e y {y x}} (2.2) where c and d are some deterministic constants, g(x) is a function of x, and where in (2.2) and (2.2), in the interest of clarity, we have indicated the variable over which the expectation is taken as a subscript. The nested expectation in (2.2), commonly referred to as an iterated expectation, or as the tower rule, may, at first sight, seem confusing, and can therefore deserve some further comments. Note that E x {E y {y x}} = E y {y x} f x (x) dx (2.22) = yf y x (y) dyf x (x) dx (2.23) = yf y,x (x, y) dy dx (2.24) = y f y,x (x, y) dx dy (2.25) = yf y (y) dy (2.26) which proves (2.2). Here, we have in obtaining (2.24) made use of (2.4), and in obtaining (2.26) the fact that the inner integral over x will integrate out the dependence of x from f y,x (x, y). The result in (2.2) is shown similarly. We proceed to define the conditional covariance: Definition 2.2. The conditional covariance is defined as { C {y, z x} = [y ] [ ] } x E my x z mz x (2.27) where m y x = E {y x} and m z x = E {z x}. Setting z = y immediately yields a similar expression for V {y, z x}. Equipped with this definition and the results above, we are then ready to state the variance separation theorem.

14 2.2. Stochastic vectors 9 Theorem 2.2. The variance separation theorem states that { } { } V {y} = E V [y x] + V E [y x] { } { } C {y, z} = E C [y, z x] + C E [y x], E [z x] (2.28) (2.29) where the expectations and (co)variances are taken with respect to the appropriate variables Linear projections We will now examine how to exploit above defined conditional expectations to formulate the optimal linear projection of a y given knowledge of x. To do so, we begin by defining what we mean by a linear projection: Definition 2.3. The linear projection of y onto the space spanned by x, here denoted R(x), is defined as E{y x} = a + Bx (2.3) where a R(x) and B is a deterministic matrix of appropriate dimension. Expressed differently, we decompose the (stochastic) vector y into two components, one that can be written as a linear combination of x, and one that cannot. This can geometrically be viewed as depicted in the figure below, where the vector y is decomposed into the vector E{y x}, which lies in the range space of x, and a vector e = y E{y x}, which is orthogonal to the range space of x. y e = y E{y x} R(x) p = E{y x} From the above figure, we can immediately conclude that the difference, here termed the error, e, is orthogonal to any vector in R(x), i.e., { } C y E{y x}, x = (2.3) Alternatively, one may show this using the variance separation theorem. Using

15 2. Stochastic Processes (2.3) and (2.8), note that { } C y E{y x}, x { } = C y a Bx, x (2.32) { } { } = C y, x C a + Bx, x (2.33) { } = C y, x BV {x} (2.34) However, using Theorem 2.2, { } { } { } C y, x = E C [y, x x] + C E [y x], E [x x] (2.35) { } { } = E C [y, x x] + C a + Bx, x (2.36) { } = E C [y, x x] + BV {x} (2.37) = BV {x} (2.38) where we in the last step have used Definition 2.2. Inserting (2.38) in (2.34) immediately yields (2.3). It is worth stressing that the geometrical interpretation is offering a valuable insight, allowing us directly conclude (2.3), without having to go through the steps of the detailed proof. Theorem 2.3. Let z denote the concatenated vector z = [ x y ] T (2.39) having mean E{z} = [ ] T m x m y and covariance matrix (cf. (2.5)) [ ] Rx,x R R z,z = y,x R x,y R y,y (2.4) Then, the linear projection in (2.3) can be expressed as E{y x} = m y R y,x R x,x (x m x ) (2.4) with the difference e = y E{y x} having variance { } V {e x} = R y,y R y,x R x,xr y,x = E V {y x} (2.42) is the optimal linear projection, i.e., the projection that yields the minimum variance among all linear projections. Furthermore, if x and y are Normal distributed, then e and x are independent; otherwise, they are uncorrelated. Here, the last step of (2.42) follows from Theorem 2.2. Furthermore, recall that stochastic variables are said to be uncorrelated if E {xy} = E {x} E {y}, (2.43) which is a notably weaker requirement than independence. A direct result of Theorem 2.3 is that if we wish to use the vector x to form an optimal linear prediction of y, this is obtained as E{y x}. We will return to this notion in further detail in Chapter 4.

16 2.3. Stochastic processes 2.3 Stochastic processes We will now proceed to extend the earlier discussion to also treat stochastic processes. In doing so, we will initially consider only one-dimensional stochastic processes, and then, in Chapter 5, extend on this discussion to also allow for multi-dimensional stochastic processes Properties and peculiarities In these notes, we will restrict our attention to wide-sense stationary (WSS) processes: A stochastic process is wide-sense stationary (WSS) if (i) The mean of the process is constant. (ii) The auto-covariance C {y s, y t } only depends on the difference (s t), and not on the actual values of s and t. (iii) The variance of the process is finite, i.e., E { y t 2} <. For a WSS process y t, we define the auto-covariance, cross-covariance, autocorrelation, and cross-correlation functions as: Definition 2.4. The auto-covariance function (ACF) for y t is defined as r y (k) = C { y t, y t k} = E { [yt m y ] [y t k m y ] } (2.44) = E { y t y t k} my m y (2.45) Similarly, we define the cross-covariance of the WSS processes x t and y t as r x,y (k) = C { x t, y t k} = E { [xt m x ] [y t k m y ] }, (2.46) where m x and m y denote the means of the respective processes. Definition 2.5. The auto-correlation function of y t is defined as ρ y (k) = r y(k) r y () (2.47) and will therefore be bounded such that ρ y (k), with equality for k =, as well as, possibly, for k = l, with l >, if the signal is periodic with period l (see, e.g., Example 2.2). Similarly, we define the cross-correlation of the processes x t and y t as r x,y (k) ρ x,y (k) = rx ()r y (), (2.48) which will be bounded as ρ x,y (k). Example 2.. Consider a complex-valued sinusoidal signal with frequency ω, x t = Ae itω+iφ, (2.49)

17 2 2. Stochastic Processes Amplitude.5.5 Amplitude Time [ms] (a) Lag (b) Figure 2.: (a) An example of a real-valued periodic signal. The signal is a voiced speech signal extracted from the utterance in Example., together with the estimated correlation function of the signal. where φ is a uniformly distributed random variable between [ π, π]. Then, m x = E { Ae itω+iφ} = π π Ae itω+iφ dφ = (2.5) 2π and { r x (k) = E A 2 e itω+iφ e i(t k)ω iφ} = A 2 e iωk. (2.5) Thus, the ACF of a complex-valued sinusoid is also a sinusoid, both having the same frequency. Example 2.2. Consider instead a real-valued sinusoidal signal x t = A cos(tω + φ), (2.52) where φ is a uniformly distributed random variable between [ π, π]. Euler s formula, we rewrite x t as x t = A 2 Using ( e itω+iφ + e itω iφ), (2.53) which, using steps similar to the ones in Example 2., yields r x (k) = A 2 cos(ω k) (2.54) Comparing the steps needed here and in Example 2. clearly illustrates that working with complex-valued signals is often simpler than working with their real-valued counterparts. Example 2.3. Figure 2.(a) shows an example of a real-valued periodic signal. This signal is a voiced speech signal extracted from the utterance in Example., and is sampled at f s = 8 Hz. Clearly, the signal exhibits strong periodicities,

18 2.3. Stochastic processes 3 and one can therefore conclude from Example 2.2 that the covariance function of this signal should contain the same periodicities as the actual signal. Figure 2.(b) illustrates this, as well as the fact that ρ y (k) is symmetric, and bounded as ρ y (k), with equality for k =. The definition of WSS processes implies some very useful properties of the ACF, namely: The auto-covariance function (ACF) of a WSS process satisfy: (i) The ACF is conjugate symmetric, i.e., r y (k) = r y( k). (ii) The variance is always non-negative, i.e., r y () = E { y t m y 2}. (iii) The ACF takes its largest value at lag, i.e., r y () r y (k), k. These properties are easily verified for the above examples. In these notes, we will typically assume that we have observed a single realization of the process, containing, say, N samples, numbered from t =,..., N. Generally, the true r y (k) is also unknown, and we will therefore need to estimate r y (k) as accurately as possible from this one (vector) observation. When no assumptions are made on the measurements, there are two standard ways to estimate the ACF, namely, the unbiased ACF estimate ˆr y (k) = N k and the biased ACF estimate ˆr y (k) = N N t=k+ N t=k+ ( yt ˆm y )( yt k ˆm y ), (2.55) ( yt ˆm y )( yt k ˆm y ), (2.56) for k N, where the mean of the process is estimated as ˆm y = N N y t. (2.57) t= These covariance estimates deserves some further commenting; firstly, note that the sums in (2.55) and (2.56) will start at t = k +. This is due to the first available sample is y, and values t < k + in the sums will thus use measurements y l, for l <, which are not available. Secondly, note that the

19 4 2. Stochastic Processes estimates only differ in the normalization constant before the sum. As E {ˆr y (k)} = = N l N l N t=k+ N t=k+ { (yt )( ) } E ˆm y yt k ˆm y, (2.58) r y (k) (2.59) = N k N l r y(k), (2.6) for l = or l = k, corresponding to (2.55) and (2.56), respectively, the estimate in (2.55) will clearly result in an unbiased estimate of r y (k), i.e., E {ˆr y (k)} = r y (k), (2.6) whereas (2.56) will result in an estimate that is only asymptotically unbiased, i.e., it is unbiased only as N. Similar to the above discussion, we should estimate the cross-covariance between the stationary processes x t and y t as ˆr x,y (k) = N N t=k+ ( xt ˆm x )( yt k ˆm y ), (2.62) where the means of the processes have been estimated similar to (2.57). In both the estimation of ˆr y (k) and ˆr x,y (k), it is important to be aware of the difficulty to estimate these covariances accurately for higher order lags. Due to finite sample effects, both these estimates can exhibit correlation with themselves, making it appear as there may be a correlation among higher lags that is not there. Often, this correlation appears as a pattern at larger lags that is also seen in the lower lags. Therefore, as a practical rule of thumb, one should at most calculate these covariances for lags up to N/4. Obviously, this rule also holds for the corresponding correlation functions. A convenient way to represent a set of measurements is in vector form. To allow for further flexibility, we will here divide the measurement into a collection of subvectors y t, each containing L N samples of y t, i.e., y t = [ y t y t+l ] T, (2.63) for t =,..., M, where M = N L + denotes the number of available subvectors y t. Often, we will also only use a single vector containing all the samples, i.e., L = N, and will then, for simplicity, commonly just write y in place of y N. Following the discussion in Section 2.2., we can form the

20 2.3. Stochastic processes Amplitude.6.4 Amplitude Lag (a) Lag (b) Figure 2.2: The estimated correlation function for the white noise signal in Example 2.4. Figure (b) is a magnified version of parts of the figure in (a). The dashed lines correspond to ±2/ N. covariance matrix of y t as } R y = E {y t yt r y () r y () r y (L ) ry() r y () r y (L 2) = ry(l ) ry(l 2) r y () (2.64) (2.65) Similarly to the covariance matrix for a general stochastic vector, R y will be a positive semi-definite Hermitian matrix. However, as can be seen from (2.65), the matrix will also have a Toeplitz structure, i.e., the matrix will have the same element along each of the diagonals. Example 2.4. Let e t be a zero-mean white (real-valued) Gaussian process with variance σ 2 e. Then, r e (k) = σ 2 eδ K (k), (2.66) where δ K (k) is the Kronecker delta function, δ K (k) = {, k =, k (2.67) and the covariance matrix of the L-dimensional subvectors e, formed similar to (2.63), is given by R e = σ 2 ei (2.68) where I is the L L identity matrix. Figure 2.2 illustrates the estimated correlation for N = 5 samples of a white noise process. Looking at Figure 2.2(a), showing ρ e (k), for 6 k 6, it is clear that ρ e (k) is a symmetric function taking its maximal value ρ e (k) =,

21 6 2. Stochastic Processes for k =. An even more interesting thing to note is that, counter to what could be expected from (2.66), ˆρ e (k), for k (2.69) where ˆρ e (k) is estimated using (2.56). This is due to the limited number of available samples; given N samples, one is simply not able to estimate r e (k), and thus ρ e (k), with better accuracy than this. To be able to determine the variance of these estimates is most useful when trying to determine if a process is white or not, and we will in Chapter 3 make good use of the following result: Theorem 2.4. Let e t, for t =,..., N, be a realization of a zero-mean white process with variance σ 2 e. If ˆρ e (k) is estimated according to Definition 2.5, i.e., ˆρ e (k) = ˆr e(k) ˆr e (), (2.7) where ˆr e (k) is estimated using (2.56), then asymptotically E{ˆρ e (k)} = (2.7) V {ˆρ e (k)} = N for k. Furthermore, ˆρ e (k) is asymptotically Normal distributed. (2.72) An important consequence of Theorem 2.4 is that the 95% (approximative) confidence interval of ˆρ e (k), for k, is ±.96/ N, i.e., with 95% confidence, ˆρ e (k) ± 2 N, for k (2.73) This means that all values of the estimate ˆρ e (k) < 2/ N should be treated as zero, i.e., we are unable to tell the difference between values within ±2/ N, and should therefore treat all of them as being zero. As seen in Figure 2.2(b), illustrating the correlation estimate for k 2 with the corresponding confidence intervals, the estimate can thus be seen to satisfy (2.66). We will exploit this result frequently in what follows. Example 2.5. The covariance matrix for the process in Example 2. is where a L (ω) is a so-called Fourier vector, R x = A 2 a L (ω )a L(ω ) (2.74) a L (ω) = [ e iω e iω(l ) ] T Thus, R x is a rank-one positive semi-definite Toeplitz matrix. (2.75) Similar to the above discussion for r y (k), we will generally need to estimate R y from the available measurement. The definition of R y in (2.65) immediately suggests the forming of ˆR y as the Toeplitz matrix formed from the estimated

22 2.3. Stochastic processes 7 ˆr y (k), obtained using (2.56). An alternative estimate can be obtained by instead forming the outer-product estimate ˆR y = M M y t yt, (2.76) which, due to finite sample effects, typically will not exhibit a Toeplitz structure. Imposing the Toeplitz structure, as is done if using ˆr y (k) often yields undesirable effects (more on this later on), and one therefore often prefer using (2.76) instead. As all Toeplitz matrices will be persymmetric (although it should be noted that the opposite is not true), i.e., t= A = J T AJ (2.77) for some matrix A, where J is the L L exchange (or reversal) matrix formed as J =... (2.78) where all the empty values of the matrix are zero, one further alternative is to instead impose a persymmetric structure on ˆR y, forming the so-called forwardbackward covariance matrix estimate ˆR fb y = 2 ( ˆRy + J ˆR T y J ), (2.79) where ˆR y is formed using (2.76). Often, yields estimates that are superior to ˆR y, and if not otherwise specified, this should be our choice for estimating R y. The estimates in (2.76) and (2.79) can be computed using the Matlab function covm provided in Appendix B. ˆR fb y The power spectral density An often convenient way to characterize a stochastic process is via its power spectral density (PSD), defined as the discrete-time Fourier transform (DFT) of the ACF, i.e., φ y (ω) = r y (k)e iωk (2.8) k= The inverse transform recovers r y (k), r y (k) = 2π φ y (ω)e iωk dω (2.8) It is worth noting that the inverse of a (per)symmetric matrix will be (per)symmetric. Generally, the inverse of a Toeplitz matrix is not Toeplitz, but as all Toeplitz matrices are persymmetric, the inverse of a Toeplitz matrix will be persymmetric. Furthermore, the inverse of a symmetric Toeplitz matrix will be centrosymmetric, i.e., it is both symmetric and persymmetric. If A C, such that one A = J T AJ, we instead say that A is a perhermitian matrix; in this case, the inverse of a Hermitian Toeplitz matrix will instead be centrohermitian.

23 8 2. Stochastic Processes from which we note that r y () = 2π φ y (ω) dω. (2.82) Since r y () = E{ y t 2 } measures the power of y t, the equality in (2.82) shows that φ y (ω) is indeed correctly named a power spectral density as it is representing the distribution of the signal power over frequencies. Under weak assumptions 2, it can be shown that (2.8) is equivalent to φ y (ω) = lim E N 2 y t e iωk (2.83) N N Using the DFT, Y N (ω) = t= N y t e iωk, (2.84) t= the PSD in (2.83) can be expressed as { } φ y (ω) = lim E N N Y N (ω) 2, (2.85) which also suggests the most natural way to estimate the PSD, i.e., as the magnitude square of the DFT of the data vector, i.e., ˆφ y (ω) = N Y N (ω) 2 = N 2 y t e iωk N t= (2.86) This estimator, termed the periodogram, was introduced in 898 by Sir Arthur Schuster 3, who derived it to determine hidden periodicities (non-obvious periodic signals) in time series [2; 3]. As an alternative, one could use the definition in (2.8) to instead form the estimate of the PSD as ˆφ y (ω) = N k= (N ) ˆr y (k)e iωk, (2.87) where ˆr y (k) is the biased ACF estimate defined in (2.56). The resulting estimate is commonly referred to as the correlogram. As shown in, e.g. [4], the estimates in (2.86) will coincide with the estimate in (2.87) as long as the latter is formed using the biased ACF estimate. This is most convenient as it is often simpler to use (2.87) when analyzing the performance of the estimate, whereas it is computationally simpler to use (2.86) when actually computing the estimate. 2 P The ACF needs to decay sufficiently rapidly, i.e., lim N N N k= N k ry(k) =. 3 Schuster applied the Periodogram to find hidden periodicities in the monthly sunspot numbers for the years 749 to 894, yielding the classical estimate of.25 years for the sunspot cycle.

24 2.3. Stochastic processes Magnitude] 4 3 Power [db] Absolute frequency (a) Absolute frequency (b) Figure 2.3: The periodogram estimate of the white noise signal in Example 2.6. Since φ y (ω) is a power density, it is natural to assume that it should be realvalued and non-negative. This is indeed the case which can readily be seen from (2.83). Hence, φ y (ω), ω. (2.88) Further, the power spectral density is periodic, such that φ y (ω) = φ y (ω + 2πk), (2.89) for any integer k. In the particular case when the process is real-valued, the PSD is symmetric, so that φ y (ω) = φ y ( ω). Otherwise, if it is complex-valued, the PSD is non-symmetric. Example 2.6. The white process in Example 2.4 has the PSD φ e (ω) = σ 2 e, (2.9) which, as expected, is real-valued and positive. Figure 2.3 illustrates the periodogram estimate of a realization of this process, with σe 2 = and the realization consisting of N = 5 samples. Figure 2.3(a) shows the (regular) periodogram estimate, whereas Figure 2.3(b) instead plots the estimate in decibel (db). The second plot is obtained as { } ˆφ db e (ω) = log ˆφe (ω), (2.9) db where ˆφ e (ω) is the periodogram estimate expressed in db, whereas ˆφ e (ω) is the periodogram estimate expressed in the regular (linear) domain. Plotting the signal in db allows us to easier see the full range of values, as even the relatively small values are visible, whereas if expressed in the regular domain, it would be hard to see these. Here, since the spectrum will be symmetric (as e t is real-valued), the figures

25 2 2. Stochastic Processes Power [db] Frequency [Hz] Figure 2.4: The periodogram estimate of the voiced speech signal in Example 2.8. The fundamental frequency of the signal is about 24 Hz. only show the positive frequencies. As is clear from the figures, the periodogram estimate seems to be unbiased, having a mean of about db (i.e., ), but exhibit a very large variance. We will return to this aspect and discuss it further detail in Section 2.4. Example 2.7. Let y t = x t + e t, t =,..., N, where e t is assumed to by a zero-mean white noise, with variance σ 2 e, independent of x t, and Then, and R y = x t = n A l e iω lt+iϕ l (2.92) l= n A l 2 a L (ω l )a L(ω l ) + σei 2 (2.93) l= φ y (ω) = n A l 2 δ D (ω ω l ) + σe 2 (2.94) l= where δ D (ω) is the Dirac delta, satisfying f(a) = f(x)δ D (x a) dx. (2.95) In the particular case when x t is a sum of real values sinusoids, it is clear that the spectrum will be symmetric, whereas it will otherwise not be. Example 2.8. Figure 2.4 illustrates the periodogram estimate of the voiced speech signal in Example 2.3. As is typical for voiced speech, the signal can be seen to contain several spectral peaks at frequencies being an integer multiple

26 2.3. Stochastic processes 2 of the first peak frequency, the so-called fundamental frequency. One common model for such signals is (see also, e.g. [5]) y t = d α k sin(ω k t + φ k ) + e t, (2.96) k= where α k, ω k, and φ k are the amplitude, frequency, and phase of the k:th sinusoidal component, with e t denoting some additive noise, and the frequencies ω k = kω, with ω being the fundamental frequency. In this example, the fundamental frequency of the signal is about 24 Hz Filtering of a stochastic process We are herein particularly interested in the filtering of stochastic processes through a asymptotically stable linear system. Let H(z) = k= h k z k (2.97) denote an asymptotically stable linear time-invariant system, where z denotes the unit delay operator, defined as z z t = z t, (2.98) and assume that the process y t is formed as the output of this system, i.e., y t = k= where x t is the input to the system. Then, m y = E { k= = m x k= h k x t k } = h k x t k, (2.99) k= h k E {x t k } (2.) h k = m x H() (2.) with H(ω) = h k e iωk (2.2) k= The mean of the output process is thus the mean of the input process scaled with the gain of the filter. Comparing (2.97) and (2.2), it is clear that we are here using the fact that z = e iω. For this reason, the notation H(e iω ) is often also used for H(ω). We proceed to examine the covariance of the output

27 22 2. Stochastic Processes process, noting that the covariance of the output process can be expressed as { } r y (t + k, t) = E {y t+k y t } = E = = l= l= y t+k l= x l h t l (2.3) h t le {y t+k x l } (2.4) h t lr y,x (t + k, l) (2.5) where we with the notation r y (t + k, t) and r y,x (t + k, l) indicate that the auto-covariance and the cross-covariance may, possibly, not be WSS, and can therefore not be written as a function of only the time difference k. Expanding the cross covariance in (2.5) as { } r y,x (t + k, t) = E {y t+k x t } = E h l x t+k l x t (2.6) = = l= l= l= h l E {x t+k l x t } (2.7) h l r x (k l) (2.8) which indicate that the cross-covariance, and as a result from (2.5), the autocovariance of the output process, only depends on the time difference k, thus indicating that y t will also be WSS. Changing the summation index in (2.5) by setting m = k l yields which using (2.8) yields r y (k) = r y (k) = m= m= l= h mr y,x (m + k), (2.9) h mh l r x (m + k l) (2.) = r x (k) h k h k (2.) where denotes the convolution operator, or, in the frequency domain, It is worth noting that this also implies that r y () = φ y (ω) = H(ω) 2 φ x (ω) (2.2) m= l= h mh l r x (m l), (2.3)

28 2.3. Stochastic processes 23 which, for finite length filters, say, of length n, implies that r y () = σ 2 y = h R x h, (2.4) where h = [ h... h n ] T (2.5) and R x is defined as in (2.65). We proceed to define: Definition 2.6. The cross spectral density of the two stationary processes x t and y t is defined as the DFT of the cross-covariance function, i.e., φ x,y (ω) = k= r x,y (k)e iωk (2.6) where r x,y (k) is defined as in (2.46). In general, φ x,y (ω) is complex-valued. Definition 2.7. The (complex) coherence spectrum of the two stationary processes x t and y t is defined as C x,y (ω) = φ x,y (ω) φx (ω)φ y (ω), (2.7) The coherence spectrum is generally complex-valued and is bound as C x,y (ω), (2.8) with equality, for all ω, if and only if, x t and y t are related as in (2.99). From (2.6)-(2.8), it is clear that the cross spectrum of the input and output is related via the so-called Wiener-Hopf equation We have thus concluded the following: φ x,y (ω) = H(ω)φ x (ω) (2.9) When filtering the WSS process x t through the stable linear finite time-invariant system, h k, the output, y t, will satisfy: φ y (ω) = H(ω) 2 φ x (ω) φ x,y (ω) = H(ω)φ x (ω) m y = m x H() r y () = h R x h with H(ω) and h are defined in (2.2) and (2.5), respectively. We will make good use of these important relations in the following.

29 24 2. Stochastic Processes e t e t e t 2 e t q z z z c c 2 3 c q y t Figure 2.5: Generation of a MA(q) process The moving average process We proceed to define the first of two basic forms of linear filters, namely: Definition 2.8. The process y t is called a moving average (MA) process if y t = e t + c e t c q e t q = C(z)et, (2.2) where C(z) is a monic polynomial of order q, i.e., C(z) = + c z c q z q, (2.2) where c q, and e t is a zero-mean white noise process with variance σe. 2 The process is always stable, and is invertible if and only if all the zeros to the generating polynomial C(z) are strictly within the unit circle. Figure 2.5 illustrates the generation of a MA(q)-process. As seen in (2.2), the generating polynomial, C(z), C(z) = + c z c q z q = q c k z k, (2.22) k= where c =, allows us to express the MA(q) process as y t = C(z)e t, (2.23) which suggests that the transfer function of the corresponding (linear) filter is C(z). If the zeros of the polynomial C(z) are inside the unit circle, the polynomial is invertible, allowing one to form the inverse filter, i.e., one may form the (driving) noise process e t as e t = C(z) y t = D(z)y t = d k y t k, (2.24) where D(z) is the inverse filter generating the noise process. It is worth noting that D(z) will generally have an infinite impulse response (IIR). From Definition 2.8, one can easily conclude that: k=

30 2.3. Stochastic processes 25 An MA(q)-process will satisfy m y = E{C(z)e t } = (2.25) { σ 2 q k r y (k) = e l= c k c k+ l if k q (2.26) if k > q φ y (ω) = σe 2 C(e iω ) 2 (2.27) where C(e iω ) indicates that the polynomial has been evaluated at frequency ω, i.e., z = e iω. Example 2.9. Consider the (real-valued) MA()-process y t = e t +c e t, i.e., the process having the generating polynomial C(z) = + c z. (2.28) Thus, if c <, the process is invertible. The ACF of y t is (cf. (2.26)) r y () = σ 2 e( + c 2 ) (2.29) r y () = σ 2 ec (2.3) r y (k) =, for k > 2 (2.3) To easily verify the above, as well as other similar cases, it is helpful to write out the covariances explicitly, i.e., for instance for r y () (cf. (2.44)) r y () = E { [e t + c e t ] [e t + c e t 2 ]} (2.32) = E { e t e t + c e t e t 2 + c e t e t + c 2 e t e t 2 } (2.33) = c E {e t e t } = c σ 2 e. (2.34) Similarly, the PSD of y t is φ y (ω) = σe 2 + c e iω ( ) 2 = σe 2 + c 2 + 2c cos(ω), (2.35) for ω = 2πf, with.5 f.5. It is worth stressing that for an MA(q)-process, r y (k) =, for k > q. This insight allow for a way to identify if a measurement can be well modeled as an MA-process; if the estimated ACF is zero for lags higher than l, it may be reasonable to model the measurement as a realization of an MA(l)-process. We will return to this discussion further in Chapter 3. Example 2.. Consider the MA(4)-process formed using C(z) = +.8z +.5z 2 +.2z 3 +.6z 4 (2.36) Figure 2.6 illustrates a realization of this process together with the estimated correlation function, spectral density, and the roots of the C(z)-polynomial.

31 26 2. Stochastic Processes Amplitude Amplitude Sample (a) Lag (b) Periodogram MA estimate Power [db] Imaginary Part Frequency [Hz] (c).5.5 Real Part (d) Figure 2.6: The figure illustrates the MA(4)-process discussed in Example 2., with (a) showing a realization of the process, (b) the estimated correlation function, (c) the estimated power spectral density, and (d) the roots of the C(z)-polynomial. These figures deserve some further comments. First, it is worth noting in Figure 2.6(b) that the estimated cross-covariance is not zero, as expected from (2.26), for lags higher than 4. Similar to the discussion following Example 2.4, this is due to the difficulty of estimating r y (k) accurately given a finite amount of data. This can be seen better in Figure 2.7, which shows a closer look at the correlation function in Figure 2.6(b), together with the corresponding confidence intervals as given by Theorem 2.5 (below). Secondly, in Figure 2.6(c), the periodogram estimate is plotted together with the true PSD. As can be seen in the figure, the spectrum contains two nulls, i.e., two frequencies for which the PSD has low power. This can also be seen from the location of the roots of the C(z)-polynomial. These roots are shown in Figure 2.6(d), and are (approximately) z = i, z 2 =.76.64i, z 3 = i, and z 4 =.36.69i. If expressed using z = e iω, with ω = 2πf, this corresponds to the frequencies f =.39, f 2 =.39, f 3 =.7, and f 4 =.7. These frequencies can also be viewed as the angles of the

32 2.3. Stochastic processes Amplitude Lag Figure 2.7: A closer look at the estimated correlation function for the MA(4)- process in Example 2.. This is a magnified version of Figure 2.6(b). The dashed lines correspond to the confidence interval given by Theorem 2.5. vectors pointing to the corresponding roots. Thus, the angle of the vector for the root corresponding to f, which is marked with an arrow in the figure, will be ω = 2π.39. An important insight is that the spectrum is nothing but C(e iω ) 2 evaluated along the unit circle, which implies that the spectrum will have dips at the frequencies that corresponds to the angle of the roots z l, for l =,..., 4. Moreover, the closer the actual root is to the unit circle, the deeper the null, with the spectrum being zero if the root is on the unit circle. Examining the root corresponding to f, we note that this root is closer to the unit circle as compared to z 3, and will thus exhibit a deeper null in the resulting spectrum as compared to the one at frequency f 3, just as we see in Figure 2.6(c). It is often helpful to compute the roots of the generating polynomial as well as the angles of these roots. Using Matlab, this is easily done using the following lines of code: C = [ ]; f = angle( roots(c) )/pi/2 In Matlab, all vectors are indexed as starting at z. Thus, the first line will be interpreted by Matlab as forming the polynomial C(z) in (2.36). The second line will compute the roots of C(z), followed by finding the argument of the roots, and scaling these arguments with 2π. As noted in the above discussion, we also need to formulate a generalization of Theorem 2.4 for MA-processes. It can be shown that: Theorem 2.5. Let y t, for t =,..., N, be a realization of an MA(q)-process.

33 28 2. Stochastic Processes e t 2 a p z y t p a 2 a z z y t 2 y t y t Figure 2.8: Generation of an AR(p)-process. If ˆρ y (k) is estimated according to Definition 2.5, then asymptotically E{ˆρ y (k)} = (2.37) V {ˆρ y (k)} = ( ) + 2(ˆρ 2 N y() ˆρ 2 y(q)) (2.38) for k = q+, q+2,.... Furthermore, ˆρ y (k), for k > q, is asymptotically Normal distributed The autoregressive process We proceed to define the second basic linear process, namely: Definition 2.9. The process y t is called an autoregressive (AR) process if A(z)y t = yt + a y t a p y t p = e t, (2.39) where A(z) is a monic polynomial of order p, i.e., A(z) = + a z a p z p, (2.4) where a p, and e t is a zero-mean white noise process with variance σ 2 e, being uncorrelated with y t l, for l >. The process is stationary (and thus an AR-process) if and only if all the zeros of the generating polynomial A(z) are strictly within the unit circle. An AR-process is always invertible. Figure 2.8 illustrates the generation of an AR(p)-process. The mean of an AR-process is easily found by taking the expectation on both sides of (2.39), i.e., E{y t + a y t a p y t p } = E{e t } =, (2.4) Thus, m y ( + a a p ) = m y A() =, which implies that m y = as all the zeros of A(z) are strictly within the unit circle, implying that A().

34 2.3. Stochastic processes 29 From (2.44) and (2.39), as well as m y =, one may also find the covariance function of the process by post-multiplying the process with yt k and taking the expectation, i.e., E{e t y t k} = E{y t y t k + a y t y t k a p y t p y t k} (2.42) = r y (k) + a r y (k ) a p r y (k p) (2.43) Since e t is uncorrelated with y t l, for l >, E{e t y t k } = σ2 eδ K (k), with δ K (k) defined as in (2.67), implying that r y (k) + a r y (k ) a p r y (k p) = σ 2 eδ K (k), (2.44) which is known as the Yule-Walker equations. Expressed in matrix form for k =,..., n, (2.44) implies r y () r y ( )... r y ( n) σe 2. r y () r y (). a.... ry ( ). =. (2.45). r y (n)... r y () a n Introducing θ = [ a... a p ] T (2.46) and, by using all but the first row of (2.45), implies that r y () r y ()... r y ( n + ) a r y (n) r y (n )... r y () a n or, with obvious definitions, implying that =. (2.47) r n + R n θ =, (2.48) ˆθ = R n r n, (2.49) which directly yields an estimate of the AR coefficients. We will here refer to this as the Yule-Walker estimate of the AR-coefficients (see also Example 3.5). It is worth noting that R n is a Toeplitz matrix, a fact that we will make good use of in the following. Example 2.. Consider the real-valued AR()-process formed using y t + a y t = e t. (2.5) Clearly, if a >, then y t = e t a y t will grow exponentially as t grows, and y t will therefore not a stationary process, confirming that the roots of the A(z)-polynomial needs to be strictly inside the unit circle for the process to be an AR-process. Using (2.44) implies that r y () + a r y () = σ 2 e (2.5) r y () + a r y () = (2.52)

35 3 2. Stochastic Processes where we have exploited that r y (k) = r y( k). Clearly, this allows us to estimate a as a function of r y () and r y (), i.e., a = r y() r y () σ 2 e = r y () + a r y () = r2 y() r 2 y() r y () (2.53) (2.54) Alternatively, we may assume we know a and instead solve for r k, i.e., r y () = σ 2 2 a 2 (2.55) σ2 2 r y () = a r y () = a a 2. (2.56) As r y (k) + a r y (k ) =, we may extend this to r y (k) = ( ) k σ 2 a 2 a 2, (2.57) where we have again exploited the symmetry of r y (k). The power spectrum of y t can be found easily by expressing the process as formed by filtering a white noise with variance σ 2 e through the first-order all-pole filter which, using (2.2), yields H(z) =, (2.58) a z φ y (ω) = σ 2 e [ a e iω ] [ a e iω ] = σ 2 e + a 2 2a cos ω (2.59) From the expression of φ y (ω), it is worth noting that the power in y t will be concentrated at low frequencies if a >, and is therefore referred to as a lowpass process, with the power being more concentrated close to ω = if a is closer to (recall that a < to ensure stability), whereas if a <, the power will instead be concentrated at high frequencies, and is then called a high-pass process. Generalizing the formulation of φ y (ω) in the above example to an AR(p)- process, we find that:

36 2.3. Stochastic processes Amplitude 2 Amplitude Sample (a) Lag (b) 6 4 Periodogram AR estimate Power [db] 2 4 Imaginary Part Frequency [Hz] (c).5.5 Real Part (d) Figure 2.9: The figure illustrates the AR(4)-process discussed in Example 2.2, with (a) showing a realization of the process, (b) the estimated correlation function, (c) the estimated power spectral density, and (d) the roots of the A(z)-polynomial. An AR(p)-process will satisfy m y = (2.6) p r y (k) = σeδ 2 K (k) a l r y (k l) (2.6) l= φ y (ω) = σ 2 e A(e iω ) 2 (2.62) with A(e iω ) indicating that the polynomial has been evaluated at frequency ω, i.e., z = e iω. Example 2.2. Consider the AR(4)-process formed using A(z) = +.4z +.4z 2 +.7z 3 +.6z 4 (2.63) Figure 2.9 illustrates a realization of this process together with the estimated

37 32 2. Stochastic Processes.8 2 Periodogram AR estimate Imaginary Part.2.2 Power [db] Real Part (a) Frequency [Hz] (b) Figure 2.: (a) The roots of the estimated A(z)-polynomial, as well as (b) the estimated power spectral density for the signal in Example 2.3. correlation function, spectral density, and the roots of the A(z)-polynomial The Levinson-Durbin algorithm In this section, we will discuss a computationally efficient method for computing the Yule-Walker estimate of the AR-coefficients, as given in (2.49). The presentation here follows the one in [4]. The computing of (2.49), as stated, is computationally expensive, requiring about O(n 3 ) operations, meaning that the cost can be written as c n 3 + c 2 n 2 + c 3 n + c 4, for some constants c l, for l =,..., 4, i.e., the operation has a complexity of order n 3. Fortunately, this complexity can be drastically reduced by exploiting the Toeplitz structure of the covariance matrix. Recall the Yule-Walker equations in (2.45), r y () r y ( )... r y ( n) r y () r y ()..... ry ( ) r y (n)... r y () a. a n = σ 2 n. (2.64) or, using matrix notation, [ ] [ σ 2 R n+ = n θ n ] (2.65) where denotes a column vector with elements of appropriate dimension, and where we now use the notation σ 2 n and θ n in place of σ 2 e and θ, respectively, to stress the order n of the nested structure. Using this structure, we may form

38 2.3. Stochastic processes 33 the vector R n+2 θ n = = r R y(n + ) n+ r n θ n r y (n + ) r n r y () σn 2 (2.66) α n where r n indicates that the vector r n has been ordered in the opposite direction, i.e., (cf. (2.47)-(2.48)) r n = [ r y (n)... r y () ] T (2.67) and where α n = r y (n + ) + r n θ n (2.68) is obtained from the bottom row. Thus, if α n could be nulled, (2.66) would be the counterpart of (2.65), with n increased by one. To achieve this, we introduce the reflection coefficient k n+, defined as and form R n+2 θ n + k n+ k n+ = α n σ 2 n θ n = = σn 2 + k n+ α n [ σ 2 n + k n+ α n ] α n σ 2 n (2.69) (2.7) where we have made use of the fact that for any Hermitian Toeplitz matrix y = Rx ỹ = R x (2.7) where, as before, x indicates that the vector x has been ordered in the opposite direction. The expression in (2.69) has the same form as (2.65), with n increased by one, i.e., [ ] [ ] σ 2 R n+2 = n+ (2.72) θ n+ This suggests that we may compute an order-recursive estimate of θ as [ ] [ ] θn θn θ n+ = + k n+ (2.73) σ 2 n+ = σ 2 n ( kn+ 2) (2.74) The initialization is straightforward, and the algorithm can be summarized as:

39 34 2. Stochastic Processes Initialization: The Levinson-Durbin algorithm θ = r y() r y () = k σ 2 = r y () r y() 2 r y () Then, for iteration n =,..., n max, k n+ = r y(n + ) + rθ n σn 2 σn+ 2 = σn 2 ( kn+ 2) [ ] [ θn θn θ n+ = + k n+ As can be seen from the table, the Levinson-Durbin algorithm will reduce the complexity of computing θ to O(n 2 ) operations, which is a substantial computational reduction, particularly important for larger values of n. One should note that an estimate of r y (k) is needed prior to computing the Levinson- Durbin estimate. As this is also a relatively computationally expensive estimate, an algorithm that could estimate θ directly from the measurements y t, without the need of first computing r y (k), would clearly be preferable. Such algorithms exist and work exceedingly well (see, e.g., Example 3.5). The most well-known of these are the so-called Burg algorithm [6] and the modified covariance method [7], where the latter is generally perceived to be the method of choice for estimating θ both efficiently and accurately. If using Matlab, these estimates can be found by using the functions arburg and armcov. The interested reader is referred to [4; 7] for a further discussion of these algorithms. It is also worth stressing that the Levinson-Durbin algorithm will produce an exact solution of (2.49). If one allows for an approximate solution, one can achieve further substantial computational reductions using, for instance, the preconditioned conjugate gradient algorithm which only requires O (2n log(2n)) operations (see, e.g., [8]) ARMA, ARIMA, and SARIMA processes We now proceed to combine the two basic processes to form an ARMA process: Definition 2.. The process y t is called a autoregressive moving average (ARMA) process if A(z)y t = C(z)e t, (2.75) where A(z) and C(z) are monic polynomials of order p and q, respectively, i.e., A(z) = + a z a p z p (2.76) C(z) = + c z c q z q (2.77) ]

Parametric Signal Modeling and Linear Prediction Theory 1. Discrete-time Stochastic Processes

Parametric Signal Modeling and Linear Prediction Theory 1. Discrete-time Stochastic Processes Electrical & Computer Engineering North Carolina State University Acknowledgment: ECE792-41 slides were adapted