Stochastic Processes (Master degree in Engineering) Franco Flandoli

Contents Preface v Chapter. Preliminaries of Probability. Transformation of densities. About covariance matrices 3 3. Gaussian vectors 5 Chapter. Stochastic processes. Generalities 3. Discrete time stochastic process 3. Stationary processes 6 3. Time series and empirical quantities 9 4. Gaussian processes 5. Discrete time Fourier transform 6. Power spectral density 4 7. Fundamental theorem on PSD 6 8. Signal to noise ratio 3 9. An ergodic theorem 3 Chapter 3. ARIMA models 37. De nitions 37. Stationarity, ARMA and ARIMA processes 4 3. Correlation function 4 4. Power spectral density 45 iii

Preface These notes are planned to be the last part of a course of Probability and Stochastic Processes. The rst part is devoted to the introduction to the following topics, taken for instance from the book of Baldi (Italian language) or Billingsley (in English): Probability space (; F; P ) Conditional probability and independence of events Factorization formula and Bayes formula Concept of random variable, random vector = ( ; :::; n ) Law of a r.v., probability density (discrete and continuous) Distribution function and quantiles Joint law of a vector and marginal laws, relations (Transformation of densities and moments) (see complements below) Expectation, properties Moments, variance, standard deviation, properties Covariance and correlation coe cient, covariance matrix Generating function and characteristic function (Discrete r.v.: Bernoulli, binomial, Poisson, geometric) Continuous r.v.: uniform, exponential, Gaussian, Weibull, Gamma Notions of convergence of r.v. (Limit theorems: LLN, CLT; Chebyshev inequality.) Since we need some more specialized material, Chapter is a complement to this list of items. v

CHAPTER Preliminaries of Probability. Transformation of densities Exercise. If has cdf F (x) and g is increasing and continuous, then Y = g () has cdf F Y (y) = F g (y) for all y in the image of y. If g is decreasing and continuous, the formula is F Y (y) = F g (y) Exercise. If has continuous pdf f (x) and g is increasing and di erentiable, then Y = g () has pdf f Y (y) = f g (y) g (g (y)) = f (x) g (x) y=g(x) for all y in the image of y. If g is decreasing and di erentiable, the formula is f Y (y) = f (x) g (x) : y=g(x) Thus, in general, we have the following result. Proposition. If g is monotone and di erentiable, the transformation of densities is given by f Y (y) = f (x) jg (x)j y=g(x) Remark. Under proper assumptions, when g is not injective the formula generalizes to f Y (y) = f (x) jg (x)j : x:y=g(x) Remark. A second proof of the previous formula comes from the following characterization of the density: f is the density of if and only if Z E [h ()] = h (x) f (x) dx R for all continuous bounded functions h. Let us use this fact to prove that f Y (y) = f (x) is the y=g(x) density of Y = g (). Let us compute E [h (Y )] for a generic continuous bounded functions h. We jg (x)j

. PRELIMINARIES OF PROBABILITY have, from the de nition of Y and from the characterization applied to, Z E [h (Y )] = E [h (g ())] = h (g (x)) f (x) dx: Let us change variable y = g (x), under the assumption that g is monotone, bijective and di erentiable. We have x = g (y), dx = dy (we put the absolute value since we do not change the extreme jg (g (y))j of integration, but just rewrite R R ) so that Z Z h (g (x)) f (x) dx = h (y) f g (y) R R jg (g (y))j dy: If we set f Y (y) := f (x) jg (x)j we have proved that y=g(x) Z E [h (Y )] = h (y) f Y (y) dy R for every continuous bounded functions h. By the characterization, this implies that f Y (y) is the density of Y. This proof is thus based on the change of variable formula. Remark 3. The same proof works in the multidimensional case, using the change of variable formula for multiple integrals. Recall that in place of dy = g (x)dx one has to use dy = jdet Dg (x)j dx where Dg is the Jacobian (the matrix of rst derivatives) of the transformation g : R n! R n. In fact we need the inverse transformation, so we use the corresponding formula dx = det Dg (y) dy = jdet Dg (g (y))j dy: With the same passages performed above, one gets the following result. Proposition. If g is a di erentiable bijection and Y = g (), then f (x) f Y (y) = jdet Dg (x)j : y=g(x) Exercise 3. If (in R n ) has density f (x) and Y = U, where U is an orthogonal linear transformation of R n (it means that U = U T ), then Y has density R f Y (y) = f U T y :.. Linear transformation of moments. The solution of the following exercises is based on the linearity of expected value (and thus of covariance in each argument). Exercise 4. Let = ( ; :::; n ) be a random vector, A be a n d matrix, Y = A. = ; :::; n be the vector of mean values of, namely i = E [ i ]. Then Y := A is the vector of mean values of Y, namely Y i = E [Y i ]. Exercise 5. Under the same assumptions, if Q and Q Y are the covariance matrices of and Y, then Q Y = AQ A T : Let

. ABOUT COVARIANCE MATRICES 3. About covariance matrices The covariance matrix Q of a vector = ( ; :::; n ), de ned as Q ij = Cov ( i ; j ), is symmetric: and non-negative de nite: x T Qx = Q ij = Cov ( i ; j ) = Cov ( j ; i ) = Q ji Q ij x i x j = i;j= = Cov @ x i i ; i= Cov ( i ; j ) x i x j = i;j= x j j A = V ar [W ] j= Cov (x i i ; x j j ) where W = P n i= x i i. The spectral theorem states that any symmetric matrix Q can be diagonalized, namely it exists a orthonormal basis e ; :::; e n of R n where Q takes the form Q e = @ ::: n Moreover, the numbers i are eigenvalues of Q, and the vectors e i are corresponding eigenvectors. Since the covariance matrix Q is also non-negative de nite, we have A : i ; i = ; :::; n: Remark 4. To understand better this theorem, recall a few facts of linear algebra. R n is a vector space with a scalar product h:; :i, namely a set of elements (called vectors) with certain operations (sum of vectors, multiplication by real numbers, scalar product between vectors) and properties. We may call intrinsic the objects de ned in these terms, opposite to the objects de ned by means of numbers, with respect to a given basis. A vector x R n is an intrinsic object; but we can write it as a sequence of numbers (x ; :::; x n ) in in nitely many ways, depending on the basis we choose. Given an orthonormal basis u ; :::; u n, the components of a vector x R n in this basis are the numbers hx; u j i, j = ; :::; n. A linear map L in R n, given the basis u ; :::; u n, can be represented by a matrix of components hlu i ; u j i. We shall write y T x for hx; yi (or hy; xi). Remark 5. After these general comments, we see that a matrix represents a linear transformation, given a basis. Thus, given the canonical basis of R n, that we shall denote by u ; :::; u n, given the matrix Q, it is de ned a linear transformation L from R n to a R n. The spectral theorem states that there is a new orthonormal basis e ; :::; e n of R n such that, if Q e represents the linear transformation L in this new basis, then Q e is diagonal. Remark 6. Let us recall more facts about linear algebra. Start with an orthonormal basis u ; :::; u n, that we call canonical or original basis. Let e ; :::; e n be another orthonormal basis. The vector u, in i;j=

4. PRELIMINARIES OF PROBABILITY the canonical basis, has components u = B @ and so on for the other vectors. Each vector e j has certain components. Denote by U the matrix such that its rst column has the same components as e (those of the canonical basis), and so on for the other columns. We could write U = (e ; :::; e n ). Also, U ij = e T j u i. Then U B @ ::: ::: C A C A = e and so on, namely U represents the linear map which maps the canonical (original) basis of R n into e ; :::; e n. This is an orthogonal transformation: U = U T : Indeed, U maps e ; :::; e n into the canonical basis (by the above property of U), and U T same: e T U T e e = B e T @ e C ::: A = B C @ ::: A e T n e and so on. does the Remark 7. Let us now go back to the covariance matrix Q and the matrix Q e given by the spectral theorem: Q e is a diagonal matrix which represents the same linear transformation L in a new basis e ; :::; e n. Assume we do not know anything else, except they describe the same map L and Q e is diagonal, namely of the form Q e = @ ::: A : n Let us deduce a number of facts: i) Q e = UQU T ii) the diagonal elements j are eigenvalues of L, with eigenvectors e j iii) j, j = ; :::; n. To prove (i), recall from above that (Q e ) ij = e T j Le i and Q ij = u T j Lu i :

Moreover, U ij = e T j u i, hence e j = P n k= U kju k, and thus (Q e ) ij = e T j Le i = U ki U k ju T k Lu k = k;k = 3. GAUSSIAN VECTORS 5 k;k = U ki Q ij U k j = UQU T ij : To prove (ii), let us write the vector Le in the basis e ; :::; e n : e i is the vector represented by Q e, hence Le is equal to Q e B @ ::: C A = B @ ::: C A = B @ ::: C A B @ ::: C A, the map L is which is e in the basis e ; :::; e n. We have checked that Le = e, namely that is an eigenvalue and e is a corresponding eigenvector. The proof for, etc. is the same. To prove (iii), just see that, in the basis e ; :::; e n, e T j Q e e j = j : But e T j Q e e j = e T j UQU T e j = v T Qv where v = U T e j, having used the property that Q is non-negative de nite. Hence j. 3. Gaussian vectors Recall that a Gaussian, or Normal, r.v. N ; is a r.v. with probability density! f (x) = p exp jx j : We have shown that is the mean value and the variance. The standard Normal is the case =, =. If Z is a standard normal r.v., then + Z is N ;. We may give the de nition of Gaussian vector in two ways, generalizing either the expression of the density or the property that + Z is N ;. Let us start with a lemma. Lemma. Given a vector = ( ; :::; n ) and a symmetric positive de nite n n matrix Q (namely v T Qv > for all v 6= ), consider the function! f (x) = p () n det(q) exp (x ) T Q (x ) where x = (x ; :::; x n ) R n. Notice that the inverse Q is well de ned for positive de nite matrices, (x ) T Q (x ) is a positive quantity, det(q) is a positive number. Then: i) f (x) is a probability density;

6. PRELIMINARIES OF PROBABILITY ii) if = ( ; :::; n ) is a random vector with such joint probability density, then is the vector of mean values, namely and Q is the covariance matrix: i = E [ i ] Q ij = Cov ( i ; j ) : Proof. Step. In this step we explain the meaning of the expression f (x). We have recalled above that any symmetric matrix Q can be diagonalized, namely it exists a orthonormal basis e ; :::; e n of R n where Q takes the form Q e = @ ::: A : n Moreover, the numbers i are eigenvalues of Q, and the vectors e i are corresponding eigenvectors. See above for more details. Let U be the matrix introduced there, such that U = U T. Recall the relation Q e = UQU T. Since v T Qv > for all v 6=, we deduce v T Q e v = v T U Q U T v > for all v 6= (since U T v 6= ). Taking v = e i, we get i >. Therefore the matrix Q e is invertible, with inverse given by Qe = @ ::: A : n It follows that also Q, being equal to U T Q e U (the relation Q = U T Q e U comes from Q e = UQU T ), is invertible, with inverse Q = U T Qe U. Easily one gets (x ) T Q (x ) > for x 6=. Moreover, because det(q) = det U T det (Q e ) det (U) = n det(q e ) = n and det (U) =. The latter property comes from = det I = det U T U = det U T det (U) = det (U) (to be used in exercise 3). Therefore det(q) >. The formula for f (x) is meaningful and de nes a positive function. Step. Let us prove that f (x) is a density. By the theorem of change of variables in multidimensional integrals, with the change of variables x = U T y, Z Z f (x) dx = f U T y dy R n R n

3. GAUSSIAN VECTORS 7 because det U T = (and the Jacobian of a linear transformation is the linear map itself). Now, since UQ U T = Qe, f U T y is equal to the following function:! f e (y) = p () n det(q e ) exp (y e ) T Qe (y e ) where Since and det(q e ) = n, we get e = U: (y e ) T Q e (y e ) = f e (y) = ny p exp i i= (y i ( e ) i ) i= i! (y i ( e ) i ) : i Namely, f e (y) is the product of n Gaussian densities N (( e ) i ; i ). We know from the theory of joint probability densities that the product of densities is the joint density of a vector with independent components. Hence f e (y) is a probability density. Therefore R R R f n e (y) dy =. This proves R f (x) dx =, so that f is a probability density. n Step 3. Let = ( ; :::; n ) be a random vector with joint probability density f, when written in the original basis. Let Y = U. Then (exercise 3) Y has density f Y (y) given by f Y (y) = f U T y. Thus! ny (y i ( e ) f Y (y) = f e (y) = p exp i ) : i i Thus (Y ; :::; Y n ) are independent N (( e ) i ; i ) r.v. and therefore i= E [Y i ] = ( e ) i ; Cov (Y i ; Y j ) = ij i : From exercises 4 and 5 we deduce that = U T Y has mean and covariance = U T Y Q = U T Q Y U: Since Y = e and e = U we readily deduce = U T U =. Since Q Y = Q e and Q = U T Q e U we get Q = Q. The proof is complete. Definition. Given a vector = ( ; :::; n ) and a symmetric positive de nite n n matrix Q, we call Gaussian vector of mean and covariance Q a random vector = ( ; :::; n ) having joint probability density function! f (x) = p () n det(q) exp (x ) T Q (x ) where x = (x ; :::; x n ) R n. We write N (; Q).

8. PRELIMINARIES OF PROBABILITY The only drawback of this de nition is the restriction to strictly positive de nite matrices Q. It is sometimes useful to have the notion of Gaussian vector also in the case when Q is only non-negative de nite (sometimes called degenerate case). For instance, we shall see that any linear transformation of a Gaussian vector is a Gaussian vector, but in order to state this theorem in full generality we need to consider also the degenerate case. In order to give a more general de nition, let us take the idea recalled above for the -dimensional case: a ne transformations of Gaussian r.v. are Gaussian. Definition. i) The standard d-dimensional Gaussian vector is the random vector Z = (Z ; :::; Z d ) dy with joint probability density f (z ; :::; z d ) = p (z i ) where p (z) = p e z : i= ii) All other Gaussian vectors = ( ; :::; n ) (in any dimension n) are obtained from standard ones by a ne transformations: = AZ + b where A is a matrix and b is a vector. If has dimension n, we require A to be d n and b to have dimension n (but n can be di erent from d). The graph of a standard -dimensional Gaussian vector is.5 z..5. y x and the graph of the other Gaussian vectors can be guessed by linear deformations of the base plane xy (deformations de ned by A) and shift (by b). For instance, if A = matrix which enlarge the x axis by a factor, we get the graph

3. GAUSSIAN VECTORS 9.5 z..5 4. 4 4 4 y x First, let us compute the mean and covariance matrix of a vector of the form = AZ + b, with Z of standard type. From exercises 4 and 5 we readily have: by Proposition 3. Mean and covariance Q matrix of a vector of the previous form are given = b Q = AA T : When two di erent de nitions are given for the same object, one has to prove their equivalence. If Q is positive de nite, the two de nition aim to describe the same object, but for Q non-negative de nite but not strictly positive de nite, we have only the last de nition, so we do not have to check any compatibility. Proposition 4. If Q is positive de nite, then de nitions and are equivalent. More precisely, if = ( ; :::; n ) is a Gaussian random vector with mean and covariance Q in the sense of de nition, then there exists a standard Gaussian random vector Z = (Z ; :::; Z n ) and a n n matrix A such that = AZ + : One can take A = p Q, as described in the proof. Vice versa, if = ( ; :::; n ) is a Gaussian random vector in the sense of de nition, of the form = AZ + b, then is Gaussian in the sense of de nition, with mean and covariance Q given by the previous proposition. Proof. Let us prove the rst claim. Let us de ne p Q = U T p Q e U where p Q e is simply de ned as We have p Qe = @ p ::: A p : n p Q T = U T p Qe T U = U T p Q e U = p Q

. PRELIMINARIES OF PROBABILITY and p Q = U T p Q e UU T p Q e U = U T p Q e p Qe U = U T Q e U = Q because p p Q e Qe = Q e. Set p Z = Q where notice that p Q is invertible, from its de nition and the strict positivity of i. Gaussian. Indeed, from the formula for the transformation of densities, f (x) f Z (z) = jdet Dg (x)j z=g(x) Then Z is where g (x) = p Q p x ; hence det Dg (x) = det Q = p p ; therefore n p ny p Qz + f Z (z) = i p () n det(q) exp T p Q Qz + = i= p T Qz Q p p () n exp Qz! = p () n exp z T z! which is the density of a standard Gaussian vector. From the de nition of Z we get = p QZ +, so the rst claim is proved. The proof of the second claim is a particular case of the next exercise, that we leave to the reader. Exercise 6. Let = ( ; :::; n ) be a Gaussian random vector, B a n m matrix, c a vector of R m. Then Y = B + c is a Gaussian random vector of dimension m. The relations between means and covariances is and covariance Y = B + c Q Y = BQ B T : Remark 8. We see from the exercise that we may start with a non-degenerate vector and get a degenerate one Y, if B is not a bijection. This always happens when m > n. Remark 9. The law of a Gaussian vector is determined by the mean vector and the covariance matrix. This fundamental fact will be used below when we study stochastic processes. Remark. Some of the previous results are very useful if we want to generate random vectors according to a prescribed Gaussian law. Assume we have prescribed mean and covariance Q, n- dimensional, and want to generate a random sample (x ; :::; x n ) from such N (; Q). Then we may generate n independent samples z ; :::; z n from the standard one-dimensional Gaussian law and compute p Qz +

3. GAUSSIAN VECTORS where z = (z ; :::; z n ). In order to have the entries of the matrix p Q, if the software does not provide them (certain software do it), we may use the formula p Q = U T p Q e U. The matrix p Q e is obvious. In order to get the matrix U recall that its columns are the vectors e ; :::; e n written in the original basis. And such vectors are an orthonormal basis of eigenvectors of Q. Thus one has to use at least a software that makes the spectral decomposition of a matrix, to get e ; :::; e n.

CHAPTER Stochastic processes. Generalities. Discrete time stochastic process We call discrete time stochastic process any sequence ; ; ; :::; n ; ::: of random variables de ned on a probability space (; F; P ), taking values in R. This de nition is not so rigid with respect to small details: the same name is given to sequences ; ; :::; n ; :::, or to the case when the r.v. n take values in a space di erent from R. We shall also describe below the case when the time index takes negative values. The main objects attached to a r.v. are its law, its rst and second moments (and possibly higher order moments and characteristic or generating function, and the distribution function). We do the same for a process ( n ) n : the probability density of the r.v. n, when it exists, will be denoted by f n (x), the mean by n, the standard deviation by n. Often, we shall write t in place of n, but nevertheless here t will be always a non-negative integer. So, our rst concepts are: i) mean function and variance function: t = E [ t ] ; t = V ar [ t ] ; t = ; ; ; ::: In addition, the time-correlation is very important. We introduce three functions: ii) the autocovariance function C (t; s), t; s = ; ; ; :::: C (t; s) = E [( t t ) ( s s )] and the function R (t; s) = E [ t s ] (the name will be discussed below). They are symmetric (R (t; s) = R (s; t) and the same for C (t; s)) so it is su cient to know them for t s. We have C (t; s) = R (t; s) t s ; C (t; t) = t : In particular, when t (which is often the case), C (t; s) = R (t; s). Most of the importance will be given to t and R (t; s). In addition, let us introduce: iii) the autocorrelation function C (t; s) (t; s) = t s We have (t; t) = ; j (t; s)j : The functions C (t; s), R (t; s), (t; s) are used to detect repetitions in the process, self-similarities under time shift. For instance, if ( n ) n is roughly periodic of period P, (t + P; t) will be signi - cantly higher than the other values of (t; s) (except (t; t) which is always equal to ). Also a trend 3

4. STOCHASTIC PROCESSES. GENERALITIES is a form of repetitions, self-similarity under time shift, and indeed when there is a trend all values of (t; s) are quite high, compared to the cases without trend. See the numerical example below. Other objects (when de ned) related to the time structure are: iv) the joint probability density of the vector ( t ; :::; tn ) and v) the conditional density f t ;:::;t n (x ; :::; x n ) ; ; t n ::: t f tjs (xjy) = f t;s (x; y) ; t > s: f s (y) Now, a remark about the name of R (t; s). In Statistics and Time Series Analysis, the name autocorrelation function is given to (t; s), as we said above. But in certain disciplines related to signal processing, R (t; s) is called autocorrelation function. There is no special reason except the fact that R (t; s) is the fundamental quantity to be understood and investigated, the others (C (t; s) and (t; s)) being simple transformations of R (t; s). Thus R (t; s) is given the name which mostly reminds the concept of self-relation between values of the process at di erent times. In the sequel we shall use both languages and sometimes we shall call (t; s) the autocorrelation coe cient. The last object we introduce is concerned with two processes simultaneously: ( n ) n and (Y n ) n. It is called: vi) cross-correlation function C ;Y (t; s) = E [( t E [ t ]) (Y s E [Y s ])] : This function is a measure of the similarity between two processes, shifted in time. For instance, it can be used for the following purpose: one of the two processes, say Y, is known, has a known shape of interest for us, the other process,, is the process under investigation, and we would like to detect portions of which have a shape similar to Y. Hence we shift in all possible ways and compute the correlation with Y. When more than one process is investigated, it may be better to write R (t; s), C (t; s) and so on for the quantities associated to process... Example : white noise. The white noise with intensity is the process ( n ) n with the following properties: i) ; ; ; :::; n ; ::: are independent r.v. s ii) n N ;. It is a very elementary process, with a trivial time-structure, but it will be used as a building block for other classes of processes, or as a comparison object to understand the features of more complex cases. The following picture has been obtained by R software by the commands x<-rnorm(); ts.plot(x).

. DISCRETE TIME STOCHASTIC PROCESS 5 Let us compute all its relevant quantities (the check is left as an exercise): t = t = R (t; s) = C (t; s) = (t s) where the symbol (t s) denotes for t 6= s, for t = s, f t ;:::;t n (x ; :::; x n ) = (t; s) = (t s) ny p (x i ) where p (x) = i= f tjs (xjy) = p (x). p e.. Example : random walk. Let (W n ) n be a white noise (or more generally, a process with independent identically distributed W ; W ; W ; :::). Set = n+ = n + W n ; n : This is a random walk. White noise has been used as a building block: ( n ) n is the solution of a recursive linear equation, driven by white noise (we shall see more general examples later on). The following picture has been obtained by R software by the commands x<-rnorm(); y<-cumsum(x); ts.plot(y). The random variables n are not independent ( n+ obviously depends on n ). One has n+ = W i : We have the following facts We prove them by means of the iterative relation (this generalizes better to more complex discrete linear equations). First, = i= n+ = n ; n hence n = for every n. By induction, n and W n are independent for every n, hence: x

6. STOCHASTIC PROCESSES. GENERALITIES Exercise 7. Denote by the intensity of the white noise; nd a relation between n+ and n and prove that n = p n; n : An intuitive interpretation of the result of the exercise is that n behaves as p n, in a very rough way. As to the time-dependent structure, C (t; s) = R (t; s), and: Exercise 8. Prove that R (m; n) = n, for all m n (prove it for m = n, m = m+, m = n+ and extend). Then prove that r n (m; n) = m : The result of this exercise implies that (m; )! as m! : We may interpret this result by saying that the random walk looses memory of the initial position.. Stationary processes A process is called wide-sense stationary if t and R (t + n; t) are independent of t. It follows that also t, C (t + n; t) and (t + n; t) are independent of t. Thus we speak of: i) mean ii) standard deviation iii) covariance function C (n) := C (n; ) iv) autocorrelation function (in the improper sense described above) R (n) := R (n; ) v) autocorrelation coe cient (or also autocorrelation function, in the language of Statistics) (n) := (n; ) :

. STATIONARY PROCESSES 7 A process is called strongly stationary if the law of the generic vector ( n +t; :::; nk +t) is independent of t. This implies wide stationarity. The converse is not true in general, but it is true for Gaussian processes (see below)... Example: white noise. We have R (t; s) = (t s) hence R (n) = (n) :.. Example: linear equation with damping. Consider the recurrence relation n+ = n + W n ; n where (W n ) n is a white noise with intensity and ( ; ) : The following picture has been obtained by R software by the commands ( = :9, = ): w <- rnorm() x <- rnorm() x[]= for (i in :999) { x[i+] <-.9*x[i] + w[i] } ts.plot(x) It has some features similar to white noise, but less random, more persistent in the direction where it moves. Let be a r.v. independent of the white noise, with zero average and variance e. Let us show that ( n ) n is stationary (in the wide sense) if e is properly chosen with respect to.

8. STOCHASTIC PROCESSES. GENERALITIES First we have = n+ = n ; n hence n = for every n. The mean function is constant. As a preliminary computation, let us impose that the variance function is constant. By induction, n and W n are independent for every n, hence If we want n+ = n for every n, we need namely In particular, this implies the relation n+ = n + ; n : n = n + ; n n = ; n : e = : It is here that we rst see the importance of the condition jj <. If we assume this condition on the law of, then we nd = + = = and so on, n+ = n for every n. Thus the variance function is constant. Finally, we have to show that R (t + n; t) is independent of t. We have which is independent of t; and so on, R (t + ; t) = E [( t + W t ) t ] = n = R (t + ; t) = E [( t+ + W t+ ) t ] = R (t + ; t) = R (t + n; t) = E [( t+n + W t+n ) t ] = R (t + n ; t) = ::: = n R (t; t) = n which is independent of t. The process is stationary. We have It also follows that R (n) = n : (n) = n : The autocorrelation coe cient (as well as the autocovariance function) decays exponentially in time.

3. TIME SERIES AND EMPIRICAL QUANTITIES 9.3. Processes de ned also for negative times. We may extend a little bit the previous de nitions and call discrete time stochastic process also the two-sided sequences ( n ) nz of random variables. Such processes are thus de ned also for negative time. The idea is that the physical process they represent started in the far past and continues in the future. This notion is particularly natural in the case of stationary processes. The function R (n) (similarly for C (n) and (n)) are thus de ned also for negative n: R (n) = E [ n ] ; n Z: By stationarity, R ( n) = R (n) because R ( n) = E [ n ] = E [ n+n +n ] = E [ n ] = R (n). Therefore we see that this extension does not contain so much new information; however it is useful or at least it simpli es some computation. 3. Time series and empirical quantities A time series is a sequence or real numbers, x ; :::; x n. Also empirical samples have the same form. The name time series is appropriate when the index i of x i has the meaning of time. A nite realization of a stochastic process is a time series. ideally, when we have an experimental time series, we think that there is a stochastic process behind. Thus we try to apply the theory of stochastic process. Recall from elementary statistics that empirical estimates of mean values of a single r.v. are computed from an empirical sample x ; :::; x n of that r.v.; the higher is n, the better is the estimate. A single sample x is not su cient to estimate moments of. Similarly, we may hope to compute empirical estimates of R (t; s) etc. from time series. But here, when the stochastic process has special properties (stationary and ergodic, see below the concept of ergodicity), one sample is su cient! By one sample we mean one time series (which is one realization of the process, like the single x is one realization of the r.v. ). Again, the higher is n, the better is the estimate, but here n refers to the length of the time series. Consider a time series x ; :::; x n. In the sequel, t and n t are such Let us de ne x t = n t n t i= bc (t) = n t b (t) = b C (t) b b t = t + n t = n: x i+t ; b t = n t (x i+t x t ) n t br (t) = n t n t i= n t i= i= x i x i+t (x i x ) (x i+t x t ) P nt i= (x i x ) (x i+t x t ) q Pnt i= (x i x ) P : n t i= (x i+t x t )

. STOCHASTIC PROCESSES. GENERALITIES These quantities are taken as approximations of t ; t ; R (t; ) ; C (t; ) ; (t; ) respectively. In the case of stationary processes, they are approximations of ; ; R (t) ; C (t) ; (t) : In the section on ergodic theorems we shall see rigorous relations between these empirical and theoretical functions. The empirical correlation coe cient P n i= b ;Y = (x i x) (y i y) q Pn i= (x i x) P n i= (y i y) between two sequences x ; :::; x n and y ; :::; y n is a measure of their linear similarity. If the there are coe cients a and b such that the residuals " i = y i (ax i + b) are small, then jb ;Y j is close to ; precisely, b ;Y is close to if a >, close to - if a <. A value of b ;Y close to means that no such linear relation is really good (in the sense of small residuals). Precisely, smallness of residuals must be understood compared to the empirical variance b Y of y ; :::; y n : one can prove that b ;Y = (the so called explained variance, the proportion of variance which has been explained by the linear model). After these remarks, the intuitive meaning of R b (t), C b (t) and b (t) should be clear: they measure the linear similarity between the time series and its t-translation. It is useful to detect repetitions, periodicity, trend. Example. Consider the following time series, taken form EUROSTAT database. export data concerning motor vehicles accessories, since January 995 to December 8. b " b Y It collects Its autocorrelation function b (t) is given by

4. GAUSSIAN PROCESSES We see high values (the values of b (t) are always smaller than in absolute value) for all time lag t. The reason is the trend of the original time series (highly non stationary). Example. If we consider only the last few years of the same time series, precisely January 5 - December 8, the data are much more stationary, the trend is less strong. The autocorrelation function b (t) is now given by where we notice a moderate annual periodicity. 4. Gaussian processes If the generic vector ( t ; :::; tn ) is jointly Gaussian, we say that the process is Gaussian. The law of a Gaussian vector is determined by the mean vector and the covariance matrix. Hence the law of the marginals of a Gaussian process are determined by the mean function t and the autocorrelation function R (t; s). Proposition 5. For Gaussian processes, stationarity in the wide and strong sense are equivalent. Proof. Given a Gaussian process ( n ) nn, the generic vector ( t +s; :::; tn+s) is Gaussian, hence with law determined by the mean vector of components and the covariance matrix of components E [ ti +s] = ti +s Cov ti +s; tj +s = R (ti + s; t j + s) ti +s tj +s: If the process is stationary in the wide sense, then ti +s = and R (t i + s; t j + s) ti +s tj +s = R (t i t j ) do not depend on s. Then the law of ( t +s; :::; tn+s) does not depend on s. This means that the process is stationary in the strict sense. The converse is a general fact. The proof is complete. Most of the models in these notes are obtained by linear transformations of white noise. White noise is a Gaussian process. Linear transformations preserve gaussianity. Hence the resulting processes are

. STOCHASTIC PROCESSES. GENERALITIES Gaussian. Since we deal very often with stationary processes in the wide sense, being them Gaussian they also are strictly stationary. 5. Discrete time Fourier transform Given a series (x n ) nz of real or complex numbers such that P nz jx nj <, we denote by bx (!) or by F [x] (!) the discrete time Fourier transform (DTFT) de ned as bx (!) = F [x] (!) = p e i!n x n ;! [; ] : nz The function can be considered for all! R, but it is -periodic. Sometimes the factor p is not included in the de nition; sometimes, it is preferable to use the variant bx (f) = p e ifn x n ; f [; ] : nz We make the choice above, independently of the fact that in certain applications it is customary or convenient to make others. The factor p is included for symmetry with the inverse transform or the Plancherel formula (without p, a factor appears in one of them). The L -theory of Fourier series guarantees that the series P nz e i!n x n converges in mean square with respect to!, namely, there exists a square integrable function bx (!) such that Z lim N! e i!n x n bx (!) jnjn d! = : The sequence x n can be reconstructed from its Fourier transform by means of the inverse Fourier transform x n = p Z e i!n bx (!) d!: Among other properties, let us mention Plancherel formula Z jx n j = jbx (!)j d! nz and the fact that under Fourier transform the convolution corresponds to the product: " # F f ( n) g (n) (!) = f b (!) bg (!) : When nz jx n j < nz then the series P nz e i!n x n is absolutely convergent, uniformly in! [; ], simply because sup e i!n x n = sup e i!n jx n j = jx n j < : nz![;] nz![;] nz

5. DISCRETE TIME FOURIER TRANSFORM 3 In this case, we may also say that bx (!) is a bounded continuous function, not only square integrable. Notice that the assumption P nz jx nj < implies P nz jx nj <, because P nz jx nj sup nz jx n j P nz jx nj and sup nz jx n j is bounded when P nz jx nj converges. One can do the DTFT also for sequences which do not satisfy the assumption P nz jx nj <, in special cases. Consider for instance the sequence Compute the truncation Recall that Hence sin (! n) = ei! n jnjn e i! n i x n = a sin (! n) : bx N (!) = p e i!n a sin (! n) = i jnjn sin t = eit jnjn e i!n a sin (! n) : e it : i e i(!! )n i jnjn e i(!+! )n : The next lemma makes use of the concept of generalized function or distribution, which is outside the scope of these notes. We still given the result, to be understood in some intuitive sense. We use the generalized function (t) called delta Dirac, which is characterized by the property (5.) Z (t t ) f (t) dt = f (t ) for all continuous compact support functions f. No usual function has this property. A way to get intuition is the following one. Consider a function n (t) which is equal to zero for t outside n ; n, interval of length n around the origin; and equal to n in n ; n. Hence (t t ) is equal to zero for t outside t n ; t + n, equal to n in t n ; t + n. We have Z n (t) dt = : Now, Z Z t + n n (t t ) f (t) dt = n f (t) dt t n which is the average of f around t. As n!, this average converges to f (t ) when f is continuous. Namely. we have Z lim n (t t ) f (t) dt = f (t ) n! which is the analog of identity (5.), but expressed by means of traditional concepts. In a sense, thus, the generalized function (t) is the limit of the traditional functions n (t). But we see that n (t) converges to zero for all t 6=, and to for t =. So, in a sense, (t) is equal to zero for t 6=, and to for t = ; but this is a very poor information, because it does not allow to deduce identity (5.) (the way n (t) goes to in nity is essential, not only the fact that (t) is for t = ).

4. STOCHASTIC PROCESSES. GENERALITIES Lemma. Denote by (t) the generalized function such that Z (t t ) f (t) dt = f (t ) for all continuous compact support functions f (it is called the delta Dirac distribution). Then lim e itn = (t) : N! jnjn From this lemma it follows that lim e i!n a sin (! n) = i (!! ) In other words, N! jnjn Corollary. The sequence has a generalized DTFT bx (!) = lim N! bx N (!) = x n = a sin (! n) i (! +! ) : p p i ( (!! ) (! +! )) : This is only one example of the possibility to extend the de nition and meaning of DTFT outside the assumption P nz jx nj <. It is also very interesting for the interpretation of the concept of DTFT. If the signal x n has a periodic component (notice that DTFT is linear) with angular frequency!, then its DTFT has two symmetric peaks (delta Dirac components) at!. This way, the DTFT reveals the periodic components of the signal. Exercise 9. Prove that the sequence has a generalized DTFT bx (!) = lim N! bx N (!) = x n = a cos (! n) p p ( (!! ) + (! +! )) : 6. Power spectral density Given a stationary process ( n ) nz with correlation function R (n) = E [ n ], n Z, we call power spectral density (PSD) the function S (!) = p e i!n R (n) ;! [; ] : Alternatively, one can use the expression S (f) = p e ifn R (n) ; f [; ] nz nz which produces easier visualizations because we catch more easily the fractions of the interval [; ].

6. POWER SPECTRAL DENSITY 5 Remark. In principle, to be de ned, this series requires P nz jr (n)j < or at least P nz jr (n)j <. In practice, on a side the convergence may happen also in unexpected cases due to cancellations, on the other side it may be acceptable to use a nite-time variant, something like P jnjn e i!n R (n), for practical purposes or from the computational viewpoint. A priori, one may think that S (f) may be not real valued. However, the function R (n) is nonnegative de nite (this means P n i= R (t i t j ) a i a j for all t ; :::; t n and a ; :::; a n ) and a theorem states that the Fourier transform of non-negative de nite function is a non-negative function. Thus, at the end, it turns out that S (f) is real and also non-negative. We do not give the details of this fact here because it will be a consequence of the fundamental theorem below. 6.. Example: white noise. We have R (n) = (n) hence S (!) = p ;! R: The spectra density is constant. This is the origin of the name, white noise. 6.. Example: perturbed periodic time series. This example is numeric only. Produce with R software the following time series: t <- : y<- sin(t/3)+.3*rnorm() ts.plot(y) The empirical autocorrelation function, obtained by acf(y), is and the power spectral density, suitable smoothed, obtained by spectrum(y,span=c(,3)), is

6. STOCHASTIC PROCESSES. GENERALITIES 6.3. Pink, Brown, Blue, Violet noise. In certain applications one meets PSD of special type which have been given names similarly to white noise. Recall that white noise has a constant PSD. Pink noise has PSD of the form Brown noise: S (f) f : Blue noise Violet noise S (f) f : S (f) f : S (f) f : 7. Fundamental theorem on PSD The following theorem is often stated without assumptions in the applied literature. One of the reasons is that it can be proved under various level of generality, with di erent meanings of the limit operation (it is a limit of functions). We shall give a rigorous statement under a very precise assumption on the autocorrelation function R (n); the convergence we prove is rather strong. The assumption is a little bit strange, but satis ed in all our examples. The assumption is that there exists a sequence (" n ) nn of positive numbers such that (7.) lim n! " n = ; nn jr (n)j " n < : This is just a little bit more restrictive than the condition P nn jr (n)j < which is natural to impose if we want uniform convergence of p PnZ e i!n R (n) to S (!). Any example of R (n) satisfying P nn jr (n)j < that the reader may have in mind, presumably satis es assumption (7.) in a easy way.

7. FUNDAMENTAL THEOREM ON PSD 7 Theorem (Wiener-Khinchin). If ( (n)) nz is a wide-sense stationary process satisfying assumption (7.), then S (!) = lim N! N + E N b (!) : The limit is uniform in! [; ]. Here N is the truncated process [ N;N]. In particular, it follows that S (!) is real an non-negative. Proof. Step. Let us prove the following main identity: (7.) S (!) = where the remainder r N is given by with r N (!) = N + E N + F 4 b N (!) + r N (!) n(n;) 3 E [ ( + n) (n)] 5 (!) (N; t) = [ N; N t ) [ (N t + 8 ; N] < N if t N t + = N t if < t N : if t > N 8 < N t = : N if t N t if N t < if t < N Since R (t) = E [ (t + n) (n)] for all n, we obviously have, for every T >, R (t) = E [ (t + n) (n)] : N + Thus (7.3) S (!) = b R (!) = hence Then recall that because " F f ( nz jnjn Z N N + F E [ ( + n) (n)] (!) : N n) g (n) # (!) = b f (!) bg (!) " # " # F f ( + n) g (n) (!) = F f ( n) g ( n) (!) nz nz = b f (!) bg (!) F [g ( )] (!) = bg (!) : :

8. STOCHASTIC PROCESSES. GENERALITIES Moreover, if the input function g is real, then bg (!) = bg (!), so we get " # F f ( + n) g (n) (!) = f b (!) bg (!) : nz If f (n) = g (n) = (n) [ N;N] (n) = N (n), then, for t, For t < we have In general, Therefore F 4 f (t + n) g (n) = nz f (t + n) g (n) = nz N + f (t + n) g (n) = nz N (t^n) n= N N n= N+(( t)^n) N + t (t + n) (n) : (t + n) (n) : n=n t (t + n) (n) : 3 ( + n) (n) 5 (!) = b N (!) b N (!) = b N (!) : And thus n=n F 4 N + 3 E [ ( + n) (n)] 5 (!) = E N b (!) : n=n From (7.3), we now get (7.). Step. The proof is complete if we show that lim N! r N (!) = uniformly in! [; ]. But E [ (t + n) (n)] = R (t) R (t) = " t j (N; t)j " t n(n;t) n(n;t) where j (N; t)j denotes the cardinality of (N; t). We have hence N + n(n;t) j (N; t)j (N ^ t) E [ (t + n) (n)] jr (t)j " t (N ^ t) " t N + : Given >, let t be such that " t for all t t. Then take N t such that N N. It is not restrictive to assume " t for all t. Then, for N N, if t t then t N+ for all (N ^ t) " t N + t " t N + t N +

7. FUNDAMENTAL THEOREM ON PSD 9 and if t t then (N ^ t) " t (N ^ t) N + N + : We have proved the following statement: for all > there exists N such that for all N N, uniformly in t. Then also N + n(n;t) (N ^ t) " t N + E [ (t + n) (n)] jr (t)j " t for all N N, uniformly in t. Therefore 3 jr N (!)j = p e i!t 4 E [ (t + n) (n)] 5 N + tz n(n;t) p N + E [ (t + n) (n)] p jr (t)j = p C " t jr(t)j tz n(n;t) where C = P tz " t <. This is the de nition of lim N! r N (!) = uniformly in! [; ]. The proof is complete. This theorem gives us the interpretation of PSD. The Fourier transform b T (!) identi es the frequency structure of the signal. The square b T (!) drops the information about the phase and keeps the information about the amplitude, but in the sense of energy (a square). It gives us the energy spectrum, in a sense. So the PSD is the average amplitude of the oscillatory component at frequency f =!. Thus PSD is a very useful tool if you want to identify oscillatory signals in your time series data and want to know their amplitude. By PSD, one can get a "feel" of data at an early stage of time series analysis. PSD tells us at which frequency ranges variations are strong. Remark. A priori one could think that it were more natural to compute the Fourier transform b (!) = P nz ei!n n without a cut-o of size T. But the process ( n ) is stationary. Therefore, it does not satisfy the assumption P nz n < or similar ones which require a decay at in nity. Stationarity is in contradiction with a decay at in nity (it can be proved, but we leave it at the obvious intuitive level). Remark 3. Under more assumptions (in particular a strong ergodicity one) it is possible to prove that S (!) = lim T! T b T (!) without expectation. Notice that T b T (!) is a random quantity, but the limit is deterministic. tz

3. STOCHASTIC PROCESSES. GENERALITIES 8. Signal to noise ratio Assume the process ( n ) nz we observe is the superposition of white noise (W n ) nz and a signal (f n ) nz, namely a process (maybe deterministic) which contains information and we would like to detect in spite of the noise corruption. The nal problem is the noise ltering, namely the reconstruction of a signal efn as close as possible to (f n) nz (the meaning of closedness maybe di erent; nz for instance we could be interested only in distinguishing between two a priori known signals). Let us make only preliminary comments on the size of the signal inside the noise. Assume n = W n + f n with and f independent of each other and, for sake of simplicity, assume f stationary. Then So where R (n) = E [ n ] = E [W n W ] + E [W n f ] + E [f n W ] + E [f n f ] = (n) + R f (n) : (n) = R (n) R () = (n) + R f (n) = + R f () SNR (n) + + SNR + SNR f (n) SNR := R f () = f W is the so called signal-to-noise-ratio. We see that we appreciate the shape of f (n) in (n) only if SN R is su ciently large. [One should be more precise. Indeed, theoretically, since (n) is equal to SNR zero for n 6=, we always see +SNR f (n) with in nite precision. The problem is that the measured W (n) is not (n) but something close to at n =, close to zero for n 6= but not equal to zero. However, the closedness to zero of W (n) is not just measured by : it depends on the number of observed points, the whiteness of the noise,... so we cannot write a simple formula.] Second, S (!) = p + S f (!) where Thus again we see S f (!) = p e i!n R f (n) = R f () p e i!n f (n) : nz p S (!) = + SNR nz nz e i!n f (n) : The contribution of the signal, P nz e i!n f (n), is visible only if SNR is not too small. [Here also we could say that we always may reconstruct exactly P nz e i!n f (n), just by taking p S (!) ; however, the term is only theoretical, in practice it is a moderately at function, with uctuations, and usually with a cut-o at large distances, again all facts depending on the size of the sample and the whiteness of the noise.]

9. AN ERGODIC THEOREM 3 9. An ergodic theorem There exist several versions of ergodic theorems. The simplest one is the Law of Large Numbers. Let us recall it in its simplest version, with convergence in mean square. Proposition 6. If ( n ) n is a sequence of uncorrelated r.v. (Cor ( i ; j ) = for all i 6= j), with nite and equal mean and variance, then P n n i= i converges to in mean square: lim E 4 3 n! i 5 = : n It also converges in probability. hence Proof. E 4 n i= i n i= n i 3 5 = i= i n i;j= = n i= = n ( i ) i= = n ( i ) ( j ) i;j= E [( i ) ( j )] = n Cor ( i ; j ) i;j= ij = n! : Recall that Chebyshev inequality states (in this particular case)! h P E n P i n > " n i= i " i= i;j= i for every " >. Hence, from the computation of the previous proof we deduce! P i n > " " n : i= In itself, this is an interesting estimate on the probability that the sample average P n n i= i di ers from more than ". It follows that! lim P i n! n > " = for every " >. This is the convergence in probability of n P n i= i to. i=

3. STOCHASTIC PROCESSES. GENERALITIES Remark 4. Often this theorem is stated only in the particular case when the r.v. i are independent and identically distributed, with nite second moment. We see that the proof is very easy under much more general assumptions. We have written the proof, very classical, so that the proof of the following lemma is obvious. Lemma 3. Let ( n ) n be a sequence of r.v. with nite second moments and equal mean. Assume that (9.) lim n! n Cor ( i ; j ) = : i;j= Then n P n i= i converges to in mean square and in probability. The lemma will be useful if we detect interesting su cient conditions for (9.). Here is our main ergodic theorem. Usually by the name ergodic theorem one means a theorem which states that the time-averages of a process converge to a deterministic value (the mean of the process, in the stationary case). Theorem. Assume that ( n ) n is a wide sense stationary process (this ensures in particular that ( n ) n is a sequence of r.v. with nite second moments and equal mean ). If lim R (n) = n! then n P n i= i converges to in mean square and in probability. Proof. Since Cor ( i ; j ) = Cor ( j ; i ), we have Cor ( i ; j ) jcor ( i ; j )j i;j= so it is su cient to prove that lim n! i;j= n i= j= i= j= i jcor ( i ; j )j = : i jcor ( i ; j )j Since the process is stationary, Cor ( i ; j ) = R (i P j) so we have to prove lim n P i n! n i= j= jr (i j)j =. But i i jr (i j)j = jr (k)j i= j= i= k= = jr ()j + (jr ()j + jr ()j) + (jr ()j + jr ()j + jr ()j) + ::: + (jr ()j + ::: + jr (n )j) = n jr ()j + (n ) jr ()j + (n ) jr ()j + ::: + jr (n )j = (n k) jr (k)j n jr (k)j : k= k=

9. AN ERGODIC THEOREM 33 P Therefore it is su cient to prove lim n n! n k= jr (k)j =. If lim n! R (n) =, for every " > there is n such that for all n n we have jr (n)j ". Hence, for n n, n jr (k)j n n k= k= jr (k)j + " n n k=n k= jr (k)j + ": Since P n k= jr (k)j is independent of n, there is n n such that for all n n Therefore, for all n n, This means that lim n! n n jr (k)j ": n k= n jr (k)j ": n k= P n k= jr (k)j =. The proof is complete. 9.. Rate of convergence. Concerning the rate of convergence, recall from the proof of the LLG that E 4 3 i 5 n n : i= We can reach the same result in the case of the ergodic theorem, under a suitable assumption. Proposition 7. If ( n ) n is a wide sense stationary process such that := jr (k)j < (this implies lim n! R (n) = ) then E 4 n k= i= i 3 5 n : Proof. It is su cient to put together several steps of the previous proof: E 4 3 i 5 = n n Cor ( i ; j ) i n jcor ( i ; j )j The proof is complete. i= i;j= jr (k)j n n : k= i= j= Notice that the assumptions of these two ergodic results (especially the ergodic theorem) are very general and always satis ed in our examples.

34. STOCHASTIC PROCESSES. GENERALITIES 9.. Empirical autocorrelation function. Very often we need the convergence of time averages of certain functions of the process: we would like to have g ( i )! g n i= in mean square, for certain functions g. We need to check the assumptions of the ergodic theorem for the sequence (g ( n )) n. Here is a simple example. Proposition 8. Let ( n ) n be a wide sense stationary process, with nite fourth moments, such that E n+k is independent of n and lim E k = : k! Then n P n i= i converges to E in mean square and in probability. Proof. Consider the process Y n = n. The mean function of (Y n ) is E n which is independent of n by the wide-sense stationarity of ( n ). For the autocorrelation function R (n; n + k) = E [Y n Y n+k ] = E n+k we need the new assumption of the proposition. Thus (Y n ) is wide-sense stationary. Finally, from the assumption lim k! E k =, which means limk! R Y (k) = where R Y (k) is the autocorrelation function of (Y n ), we can apply the ergodic theorem. The proof is complete. More remarkable is the following result, related to the estimation of R (n) by sample path autocorrelation function. Given a process ( n ) n, call sample path (or empirical) autocorrelation function the process i i+k : n i= Theorem 3. Let ( n ) n be a wide sense stationary process, with nite fourth moments, such that E [ n n+k n+j n+j+k ] is independent of n and lim E [ k j j+k ] = : j! Then the sample path autocorrelation function P n n i= i i+k converges to R (k) as n! in mean square and in probability. Precisely, for every k N, we have lim E 4 3 i n! i+k R (k) 5 = n i= and similarly for the convergence in probability. Proof. Given k N, consider the new process Y n = n n+k. Its mean function is constant in n because of the wide-sense stationarity of ( n ). For the autocorrelation function, R Y (n; n + j) = E [Y n Y n+j ] = E [ n n+k n+j n+j+k ]