Brooklyn College, CUNY. Lecture Notes. Christian Beneš

Size: px

Start display at page:

Download "Brooklyn College, CUNY. Lecture Notes. Christian Beneš"

Morgan Bailey
6 years ago
Views:

1 Brooklyn College, CUNY Math 4506 Time Series Lecture Notes Spring 2015 Christian Beneš

2 Math 4506 (Spring 2015) January 28, 2015 Prof. Christian Beneš Lecture #1: Introduction; Probability Review 1.1 What this course is about: time series and time series models Essentially, all models are wrong, but some are useful. George E. P. Box Probabilists study time series models. These are abstract random objects which are completely well-defined and can generate sets of data (using random number generators). Statisticians study time series (which are data sets) and try to find the right model for it, that is, the time series model from which the data could have been generated. In that sense, probabilists and statisticians do the opposite job, the first being (arguably) more elegant, the second being (definitely) more practical. Below are some examples of time series. The first three are real-world data. The following 6 are computer-generated. Our goal in this course will be to find ways to construct models from which these data could have arisen Baltimore city annual water use, liters per capita per day,

3 ed Index Daily value of one $US in Euros, May 6, May 6,

4 Closing value of NASDAQ 100 index, July January 23,

5 Ten random data points. What can we say about the underlying distribution? What about these 10 data points? Same question 1 4

6 Scale is important when visualizing data. Here are the same data sets as on the previous page, shown all three at the same scale:

7 It turns out that these data are drawn from the (multivariate) normal distributions N(0, Σ 1 ), N(0, Σ 2 ), N(0, Σ 3 ), respectively, where Σ 1 = , /5 4/5 4/5 4/5 4/5 4/5 4/5 4/5 4/5 4/5 1 4/5 4/5 4/5 4/5 4/5 4/5 4/5 4/5 4/5 4/5 1 4/5 4/5 4/5 4/5 4/5 4/5 4/5 4/5 4/5 4/5 1 4/5 4/5 4/5 4/5 4/5 4/5 Σ 2 = 4/5 4/5 4/5 4/5 1 4/5 4/5 4/5 4/5 4/5 4/5 4/5 4/5 4/5 4/5 1 4/5 4/5 4/5 4/5, 4/5 4/5 4/5 4/5 4/5 4/5 1 4/5 4/5 4/5 4/5 4/5 4/5 4/5 4/5 4/5 4/5 1 4/5 4/5 4/5 4/5 4/5 4/5 4/5 4/5 4/5 4/5 1 4/5 4/5 4/5 4/5 4/5 4/5 4/5 4/5 4/5 4/ /25 24/25 24/25 24/25 24/25 24/25 24/25 24/25 24/25 24/ /25 24/25 24/25 24/25 24/25 24/25 24/25 24/25 24/25 24/ /25 24/25 24/25 24/25 24/25 24/25 24/25 24/25 24/25 24/ /25 24/25 24/25 24/25 24/25 24/25 Σ 3 = 24/25 24/25 24/25 24/ /25 24/25 24/25 24/25 24/25 24/25 24/25 24/25 24/25 24/ /25 24/25 24/25 24/25. 24/25 24/25 24/25 24/25 24/25 24/ /25 24/25 24/25 24/25 24/25 24/25 24/25 24/25 24/25 24/ /25 24/25 24/25 24/25 24/25 24/25 24/25 24/25 24/25 24/ /25 24/25 24/25 24/25 24/25 24/25 24/25 24/25 24/25 24/25 1 If you re not sure what this means, don t worry. Details are coming up. In a nutshell, the samples of the first data set are drawn from independent normal random variables, while those from the other two sets are drawn from a family of pairwise positively correlated random variables (with covariances 4/5 in the first case and 24/25 in the second). The main purpose of time series modeling is to come up (as one would expect) with the stochastic process (time series model) from which the observed data (time series) is a realization. This is an impossible task, as suggested by the quote at the beginning of this lecture. Randomness in the real world is simply too complex to grasp completely. However, there are ways to determine according to some (sometimes subjective) criteria which models work better and which models don t work as well in a given setting. 1 6

8 Here s where finding a model for data is tricky: There are many choices for a model which at first (and even second) glance seem reasonable for a given data set. I am sure none of you would have been shocked if I had told you that the second to last data set above was drawn from independent normal random variables with mean 0 and standard deviation 1/2. Nor would you have been very troubled if I d suggested that they were generated using independent exponential random variables with mean 1. This illustrates the fact that in time series modeling, one often has a choice between a number of models (in the case I just mentioned, types of random variables) and within these, a number of parameters (means, variances, covariances, etc.). In this course, you will be exposed to a number of models which all depend on a number of parameters. There usually isn t a systematic way to choose a model (and the corresponding parameters), so modeling usually requires a fair dose of theoretical understanding (to determine if a model is even acceptable in a given setting) and flair (since all models are wrong, experience comes in handy when trying to find one that is better than others). Since the title of this course is Time Series, it might be useful if we know what a time series is! Definition 1.1. A time series is simply a set of observations {x t }, with each data point being observed at a specific time t. A time series model is a set of random variables {X t }, each of which corresponds to a specific time t. Notation The symbol A := B means A is defined to equal B, whereas C = D by itself means simply that C and D are equal. This is an important distinction because if you write A := B, then there is no need to verify the equality of A and B. They are equal by definition. However, if C = D, then there IS something that needs to be proved, namely the equality of C and D (which might not be obvious). For example, you may recall that for a random variable X, and V ar(x) := E[(X E[X]) 2 ] V ar(x) = E[X 2 ] E[X] Introduction to Random Variables While writing my book [Stochastic Processes] I had an argument with Feller. He asserted that everyone said random variable and I asserted that everyone said chance variable. We obviously had to use the same name in our books, so we decided the issue by a stochastic procedure. That is, we tossed for it and he won. 1 7 Joe Doob

9 In probability, Ω is used to denote the sample space of outcomes of an experiment. Example 1.1. Toss a die once: Ω = {1, 2, 3, 4, 5, 6}. Example 1.2. Toss two dice: Ω = {(i, j) : 1 i 6, 1 j 6}. Note that in each case Ω is a finite set. (That is, the cardinality of Ω, written Ω, is finite.) Example 1.3. Consider a needle attached to a spinning wheel centred at the origin. When the wheel is spun, the angle ω made with the tip of the needle and the positive x-axis is measured. The possible values of ω are Ω = [0, 2π). In this case, Ω is an uncountably infinite set. (That is, Ω is uncountable with Ω =.) Definition 1.2. A random variable X is a function from the sample space Ω to the real numbers R = (, ). Symbolically, X : Ω R ω X(ω). Example 1.4. (1.1 continued). Let X denote the upmost face when a die is tossed. Then, X(i) = i, i = 1,..., 6. Example 1.5. (1.2 continued). Let X denote the sum of the upmost faces when two dice are tossed. Then, X((i, j)) = i + j, i = 1,..., 6, j = 1,..., 6. Note that the elements of Ω are ordered pairs, so that the function X( ) acts on (i, j) giving X((i, j)). We will often omit the inner parentheses and simply write X(i, j). Example 1.6. (1.3 continued). Let X denote the cosine of the angle made by the needle on the spinning wheel and the positive x-axis. Then X(ω) = cos(ω) so that X(ω) [ 1, 1]. Remark. As mentioned in the definition, a random variable is really a function whose input variable is random, that is, determined by chance (or God, or destiny, or karma, or whatever you think decides how our world works). The use of the notation X and X(ω) is EXACTLY the same as the use of f and f(x) in elementary calculus. For example, f(x) = x 2, f(t) = t 2, f(ω) = ω 2, and X(ω) = ω 2 all describe EXACTLY the same function, namely, the function which takes a number and squares it. What makes random variables slightly more complicated than functions is that, unlike the variable x from calculus, the variable ω is random and therefore comes from a distribution. 1.3 Discrete and Continuous Random Variables Definition 1.3. Suppose that X is a random variable. Suppose that there exists a function f : R R with the properties that f(x) 0 for all x, f(x) dx = 1, and P ({ω Ω : X(ω) a}) =: P (X a) = a f(x) dx. We call f the (probability) density (function) of X and say that X is a continuous random variable. Furthermore, the function F defined by F (a) := P (X a) is called the (probability) distribution (function) of X. 1 8

10 Note 1.1. By the Fundamental Theorem of Calculus, F (x) = f(x). Remark. There exist continuous random variables which do not have densities. Although it s good to know that the definition of continuous random variables is slightly more general than what is suggested above, you won t need to worry about it in this course. Example 1.7. A random variable X is said to be normally distributed with parameters µ, σ 2, if the density of X is f(x) = 1 ( ) (x µ) 2 σ 2π exp, < µ <, 0 < σ <. 2σ 2 This is sometimes written X N (µ, σ 2 ). In Exercise 1.2, you will show that the mean of X is µ and the variance of X is σ 2, respectively. Definition 1.4. Suppose that X is a random variable. Suppose that there exists a function p : Z R with the properties that p(k) 0 for all k, k= p(k) = 1, and P ({ω Ω : X(ω) N}) =: P (X N) = N k= p(k). We call p the (probability mass function or) density of X and say that X is a discrete random variable. Furthermore, the function F defined by F (N) := P (X N) is called the (probability) distribution (function) of X. Example (continued). If X is defined to be the sum of the upmost faces when two dice are tossed, then the density of X, written p(k) := P (X = k), is given by p(2) p(3) p(4) p(5) p(6) p(7) p(8) p(9) p(10) p(11) p(12) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36 and p(k) = 0 for any other k Z. Remark. There do exist random variables which are neither discrete nor continuous; however, such random variables will not concern us. 1.4 Expectation and Variance Suppose that X : Ω R is a random variable (either discrete or continuous), and that g : R R is a (piecewise) continuous function. Then Y := g X : Ω R defined by Y (ω) = g(x(ω)) is also a random variable. We usually write Y = g(x). We now define the expectation of the random variable Y, distinguishing the discrete and continuous cases. 1 9

11 Definition 1.5. If X is a discrete random variable and g is as above, then the expectation of g X is given by E[g(X)] := g(k) p(k) k where p is the probability mass function of X. Definition 1.6. If X is a continuous random variable and g is as above, then the expectation of g X is given by E[g(X)] := where f is the probability density function of X. g(x) f(x) dx Notice that if g is the identity function (that is, g(x) = x for all x, we get the expectation of X itself: E[X] := k k p(k), if X is discrete, and E[X] := x f(x) dx if X is continuous. µ := E[X] is also called the mean of X. Note that µ. If < µ <, then we say that X has a finite mean, or that X is an integrable random variable, and we write X L 1. Exercise 1.1. Suppose that X is a Cauchy random variable. That is, X is a continuous random variable with density function f(x) = 1 π 1 x Carefully show that X L 1 (that is, X doesn t have a finite mean). Theorem 1.1 (Linearity of Expectation). Suppose that X : Ω R and Y : Ω R are (discrete or continuous) random variables with X L 1 and Y L 1. Suppose also that f : R R and g : R R are both (piecewise) continuous and such that f(x) L 1 and g(y ) L 1. Then, for any a, b R, af(x) + bg(y ) L 1 and, furthermore, E[af(X) + bg(y )] = ae[f(x)] + be[g(y )]. Using Definitions 1.5 and 1.6, we can compute the kth moments E[X k ] of a random variable X. One frequent assumption about a random variable is that it has a finite second moment. This is to ensure that the Central Limit Theorem can be used. Definition 1.7. If X is a random variable with E[X 2 ] <, then we say that X has a finite second moment and write X L 2. If X L 2, then we define the variance of X to be the number σ 2 := E [(X µ) 2 ]. The standard deviation of X is the number σ := σ 2. (As usual, this is the positive square root.) 1 10

12 Remark. It is an important fact that if X L 2, then it must be the case that X L 1. The following is a useful formula when computing variances (people sometime confuse it with the definition of variance, which it s not; for the definition, see above). Theorem 1.2. Suppose X L 2. Then Proof. By linearity of expectation, V ar(x) = E[X 2 ] E[X] 2. V ar(x) = E[(X µ) 2 ] = E[X 2 2µX + µ 2 ] = E[X 2 ] E[2µX] + E[µ 2 ] = E[X 2 ] 2µE[X] + µ 2 = E[X 2 ] 2µ 2 + µ 2 = E[X 2 ] µ 2 = E[X 2 ] E[X] 2 The following exercise is a little bit tedious, but you should make sure you know how to do it. If you remember doing it and remember well how it works, feel free to skip it. Since this lecture and the next are mostly review, I am including several exercises which are meant to refresh your memory on some basic ideas from probability but which you may know very well how to do already. That s why I m including the comment (optional) next to them. I will not include these problems on the homework assignments. Exercise 1.2. (optional) The purpose of this exercise is to make sure you can compute some straightforward (but messy) integrals [Hint: A change of variables will make them easier to handle.]. Suppose that X N (µ, σ 2 ); that is, X is a normally distributed random variable with parameters µ, σ 2. (See Example 1.7 for the density of X.) Show directly (without using any unstated properties of expectations or distributions) that E[X] = µ, E[X 2 ] = σ 2 + µ 2, and { )} E[e θx ] = exp (θµ σ2 θ 2, for 0 θ <. 2 V ar(x) = σ 2 [Note that this follows from the first two parts and Theorem 1.2.] This is the reason that if X N (µ, σ 2 ), we say that X is normally distributed with mean µ and variance σ 2 (not just with parameters µ and σ 2 ). 1 11

13 Math 4506 (Spring 2015) February 2, 2015 Prof. Christian Beneš Lecture #2: Multivariate Random Variables 2.1 Bivariate Random Variables Theorem 2.1. If X and Y are random variables with X L 2 and Y L 2, then the product XY is a random variable with XY L 1. Definition 2.1. If X and Y are both random variables in L 2, then the covariance of X and Y, written Cov(X, Y ) is defined to be Cov(X, Y ) := E [(X µ X )(Y µ Y )] where µ X := E[X], µ Y := E[Y ]. Whenever the covariance of X and Y exists, we define the correlation of X and Y to be Corr(X, Y ) := Cov(X, Y ) σ X σ Y ( ) where σ X is the standard deviation of X, and σ Y is the standard deviation of Y. Remark. By convention, 0/0 := 0 in the definition of correlation. This arbitrary choice is designed to simplify some formulas and means that if Var(X) = 0 or V ar(y ) = 0, then Corr(X, Y ) = 0 (this follows from the fact that if Var(X) = 0 or Var(Y ) = 0, then Cov(X, Y ) = 0). Since if Var(X) = 0, X is constant (in which case we call X degenerate which in this context just means non-random), this means that the correlation of two random variables is always 0 if one of them is degenerate. Definition 2.2. We say that X and Y are uncorrelated if Cov(X, Y ) = 0 (or, equivalently, if Corr(X, Y ) = 0). Fact 2.1. If X L 2 and Y L 2, then the following computational formulas hold: Cov(X, Y ) = E[XY ] E[X]E[Y ]; Var(X) = Cov(X, X); Exercise 2.3. Verify the two computational formulas above. [Note that the formulas don t necessarily hold without the assumption that X L 2 and Y L 2, so make sure you explain why these assumptions are needed in general.] Definition 2.3. Two random variables X and Y are said to be independent if f(x, y), the joint density of (X, Y ), can be expressed as f(x, y) = f X (x) f Y (y) where f X is the (marginal) density of X and f Y is the (marginal) density of Y. 2 1

14 Remark. Notice that we have combined the cases of a discrete and a continuous random variable into one definition. You can substitute the phrases probability mass function or probability density function as appropriate. The following result is often needed and at a first glance not completely obvious. Theorem 2.2. If X and Y are independent random variables with X L 1 and Y L 1, then the product XY is a random variable with XY L 1, and E[XY ] = E[X] E[Y ]. Exercise 2.4. (optional) Using this theorem, quickly prove that if X and Y are independent random variables, then they are necessarily uncorrelated. (As the next exercise shows, the converse, however, is not true: there do exist uncorrelated, dependent random variables.) Exercise 2.5. (optional) Consider the random variable X defined by P (X = 1) = 1/4, P (X = 0) = 1/2, P (X = 1) = 1/4. Let the random variable Y be defined as Y := X 2. Hence, P (Y = 0 X = 0) = 1, P (Y = 1 X = 1) = 1, P (Y = 1 X = 1) = 1. Show that the density of Y is P (Y = 0) = 1/2, P (Y = 1) = 1/2. Find the joint density of (X, Y ), and show that X and Y are not independent. Find the density of XY, compute E[XY ], and show that X and Y are uncorrelated. The following result allows us to get a grip on the variance in algebraic manipulations when the random variables involved are independent: Theorem 2.3 (Linearity of Variance in the Case of Independence). Suppose that X : Ω R and Y : Ω R are (discrete or continuous) random variables with X L 2 and Y L 2. If X and Y are independent, then X + Y L 2 and Var(X + Y ) = Var(X) + Var(Y ). 2.2 Multivariate Random Variables We just saw that pairs of random variables can be more complicated than what one might like to think. It is not enough to know the distributions of the random variables X and Y to know how they behave together. Think of the following example: You may know the distribution of the heights (X) and weights (Y ) of people in a certain population. However, this by itself will not tell you how height affects weight and vice-versa. The information on how the random variables are related is not contained in the distributions of X and Y (that is, the marginals). To have an idea of the relative behavior of random variables, one needs the correlation coefficient. Recall: 2 2

15 If we want to describe a single random variable (also called univariate random variables), we need a density f(x), which graphically can be described as a curve (or a set of points in the discrete case) in the plane. If we want to describe a pair of random variables (also called bivariate random variables), we need a joint density f(x, y), which graphically can be described as a surface (or a set of points in the discrete case) in space. This extends easily to higher dimensions: If we want to describe a family of n random variables, we need a joint density f(x 1,..., x n ), which graphically can be described as a hyper-surface (or a set of points in the discrete case) in (n + 1)-dimensional space. We are usually comfortable with drawing or imagining objects in 1, 2, or 3 dimensions. In higher dimensions, we tend to get a headache before we can make a sense of what we are trying to represent, so we will limit ourselves to depicting densities of univariate and bivariate random variables and will deal with the rest algebraically (and refer to pictures in dimensions 3 when we get confused and need a picture to help us out). We will write x = x 1 x 2 x n = (x 1,..., x n ) and will think of random vectors as being column vectors. Therefore, the random vector X = (X 1,..., X n ) has joint distribution (we will often just say distribution) An equivalent way of writing this is F (x 1,..., x n ) = P (X 1 x 1,... X n x n ). F (x) = P (X x). Recall that if F (x, y) is a bivariate distribution (say for jointly continuous r.v. s), then F (x) = P (X x) = P (X x, Y ) = x f(a, b) db da = F (x, ). The distributions of subsets of random variables are obtained in the same way as in 2 dimensions: If F (x 1,..., x n ) is a multivariate distribution, then, for instance, F (x 1, x 2, x n ) = P (X 1 x 1, X 2 x 2, X 3, X n 1, X n x n ) = F (x 1, x 2,,...,, x n ). 2 3

16 For univariate random variables, you know that the p.d.f. is the derivative of the distribution function. In higher dimensions, this is true as well, but since we are dealing with functions of several variables, we have to talk about partial derivatives. f(x 1,... x n ) = n F (x 1,..., x n ) x 1 x n. The random variables X 1,..., X n are independent if F (x 1,..., x n ) = F X1 (x 1 ) F Xn (x n ) or, alternatively, if the joint p.d.f. (p.m.f.) is the product of the marginal p.d.f s (p.m.f s). Since the random vector X = (X 1,..., X n ) is a vector, so is its mean E[X] = (E[X 1 ],... E[X n ]). Since there is a covariance between any two of the X i, there is a total of n 2 covariances which compose the covariance matrix Cov(X 1, X 1 ) Cov(X 1, X 2 )... Cov(X 1, X n 1 ) Cov(X 1, X n ) Cov(X 2, X 1 ) Cov(X 2, X 2 )... Cov(X 2, X n 1 ) Cov(X 2, X n ) Σ X = Cov(X n 1, X 1 ) Cov(X n 1, X 2 )... Cov(X n 1, X n 1 ) Cov(X n 1, X n ) Cov(X n, X 1 ) Cov(X n, X 2 )... Cov(X n, X n 1 ) Cov(X n, X n ) Note that Σ X = Var(X 1 ) Cov(X 1, X 2 )... Cov(X 1, X n 1 ) Cov(X 1, X n ) Cov(X 2, X 1 ) Var(X 2 )... Cov(X 2, X n 1 ) Cov(X 2, X n ) Cov(X n, X 1 ) Cov(X n, X 2 )... Cov(X n, X n 1 ) Var(X n ). Since for any i, j {1,..., n}, Cov(X i, X j ) = Cov(X j, X i ), the covariance matrix is symmetric. The following result tells us how to deal with the covariance of linear combinations of random variables. Theorem 2.4. If X, Y, Z L 2 and a, b, c R, then Cov(aX + by + c, Z) = a Cov(X, Z) + b Cov(Y, Z). Exercise 2.6. (optional) Prove Theorem 2.4 Note 2.1. From this theorem follows another result which you already know: Var(aX) = a 2 Var(X). 2 4

17 2.3 Some Basic Linear Algebra Caveat 2.1. I may not be entirely consistent with notation in what follows. Sometimes, vectors will be represented by boldfaced symbols (x) and sometimes like this: x. On rare occasions, I may use the same notation as for scalars, since that notation is common as well. If that s the case, you should be able to figure out from context whether you re dealing with a vector or not. For matrices and a vector we have the following definitions: a 1,1 a 1,2... a 1,l 1 a 1,l a 2,1 a 2,2... a 2,l 1 a 2,l A = [a ij ] 1 i k,1 j l = a k,1 a k,2... a k,l 1 a k,l b 1,1 b 1,2... b 1,n 1 b 1,n b 2,1 b 2,2... b 2,n 1 b 2,n B = [b ij ] 1 i l,1 j n = b l,1 b l,2... b l,n 1 b l,n v = v 1 v 2. v l 1 v l,,, The product of two matrices is AB = [c i,j ] 1 i k,1 j n, where c i,j = l a i,k b k,j. k=1 In particular, the product of a matrix and a vector is l a 1,i v i l a 2,i v i A v =.. l a n 1,i v i l a n,i v i 2 5

18 The transpose of matrix A is A = [c i,j ] 1 i l,1 j k, where c i,j = a j,i. The determinant of a matrix A, written det(a), is something fairly easy to compute but its definition isn t exactly short, so those who can t remember it should look it up in a book on linear algebra. Wikipedia also has a definition and some examples. Note that the determinant is defined only for square matrices (with same number of rows and columns). We say that A is singular if det(a) = 0. Otherwise, A is nonsingular. The following definitions are for the case k = l (that is, A is a square matrix): If A is nonsingular, the inverse of A, denoted by A 1 is the unique matrix such that AA 1 = A A = 1 k := If it is clear from context what the dimensions of the matrix are, we write 1 = 1 k. A is called orthogonal if A = A 1. In that case, A is symmetric if for all 1 i, j k, AA = A A = 1. a i,j = a j,i. A is positive definite if for all vectors v = [v 1,..., v k ], v A v 0. Theorem 2.5. If an n n matrix A is symmetric and positive definite, it can be written as where A = P ΛP, λ λ Λ = λ n and P is orthogonal. Here, λ 1,..., λ n are the eigenvalues of A. Theorem 2.6. The covariance matrix of a random vector X is symmetric and positive definite. 2 6

19 Corollary 2.1. The covariance matrix Σ of a random vector X can be written in the form Σ = P ΛP, where and P is orthogonal. Note 2.2. If we define λ λ Λ = λ n Λ 1/2 := λ 1/ λ 1/ λn 1/2 and then, since P P = P P = 1, B = P Λ 1/2 P, B 2 = BB = P Λ 1/2 P P Λ 1/2 P = P ΛP = Σ. Since B 2 = Σ, it makes perfect sense to define Σ 1/2 := P Λ 1/2 P = B. (1) Since we will often deal with linear transformations of random variables, the following proposition will be useful: Proposition 2.1. If X is a random vector, a is a (nonrandom) vector, B is a matrix, and Y = BX + a, then E[Y ] = a + BE[X], Proof. See first homework assignment. Σ Y = BΣ X B. 2.4 Multivariate Normal Random Variables You already know that the normal distribution is the most important of them all, since the central limit theorem tells us that as soon as we start adding up random variables, a normal pops up. Recall from Lecture 2 that a normal random variable X with parameters µ, σ 2 has density f(x) = 1 σ 2π exp ( (x µ) 2 2σ 2 ), < µ <, 0 < σ <. 2 7

20 You should verify that this is the one-dimensional particular case of the multivariate normal density with mean µ and nonsingular covariance matrix Σ (written X N(µ, Σ)): f X (x) = 1 ((2π) n det(σ)) 1/2 exp{ 1 2 (x µ) Σ 1 (x µ)}. Note 2.3. Make sure you understand why one needs Σ to be nonsingular in order for the definition of the multivariate normal density to make sense. Exercise 2.7. Suppose X N(0, 1), Y N(0, 2) are bivariate normal with correlation coefficient ρ(x, Y ) = 1 2. Find the joint density of X and Y. Let S 1 be the square with vertices (0,0), (1,0), (0,1), and (1,1) and let S 2 be the square with vertices (0,0), (1,0), (0,-1), and (1,-1). Without doing any computations, explain which of P ((X, Y ) S 1 ) and P ((X, Y ) S 2 ) should be greater. You probably recall that if X N(µ, σ 2 ), you can apply a linear transformation to change X into a standard normal: Z = X µ N(0, 1). σ The same works for the multivariate normal: Exercise 2.8. Prove that if X N( µ, Σ), then Z := Σ 1/2 (X µ) N(0, 1). In particular (prove this only in the bivariate case), the components of Z are independent. Hint: Use proposition 2.1. Note 2.4. This last exercise shows how to obtain a standard normal vector from any multivariate normal distribution. On the homework, you will also show how to do the converse, that is, obtain any multivariate normal distribution from the standard multivariate normal. 2 8

21 You can generate multivariate normal random variables in R using the following commands (note that comments about what a line does will follow the symbol %; these comments are not part of what you should include in your input line): > library(mass) % this loads the library in which the multivariate normal generator is > S=c(1,0,0,1) % this generates the vector (1, 0, 0, 1) > dim(s)=c(2,2) % this transforms the vector into a 2-by-2 matrix > S % this allows you to check what S is. [,1] [,2] [1,] 1 0 [2,] 0 1 > mu=c(0,0) % this is the mean (row) vector > mu [1] 0 0 > dim(mu)=c(2,1) %this makes the mean vector into a column vector > mu [,1] [1,] 0 [2,] 0 > N=mvrnorm(100,mu,S) % this generates 100 samples from the multivariate normal random distribution with mean mu and covariance matrix S > plot(n) N[,2] N[,1] 2 9

22 > S2=c(1,1,1,1) > dim(s2)=c(2,2) > N2=mvrnorm(100,mu,S2) > plot(n2) N2[,2] N2[,1] > S3=c(1,-0.8,-0.8,1) > dim(s3)=c(2,2) > N3=mvrnorm(100,mu,S3) > plot(n3) N3[,2] N3[,1] 2 10

23 The following are the graphs of 3 multivariate normal densities (any two pictures on a same line are of the same pdf, but seen under different angles) Try to say as much as you can about their means and covariance matrices. 2 y y x x y y 0 0 x -2 x y y x -2 x

24 2 y y x x 2 2 The joint pdf of two independent standard normal random variables 2 y y 0 0 x -2 x 2 2 The joint pdf of two normal random variables with mean 0 and covariance matrix 1 1/2 Σ=. 1/2 1 2 y y x -2 x 2 2 The joint pdf of two normal random variables with mean 0 and covariance matrix 1 1/2 Σ=. 1/

25 When pictures of surfaces don t make as much sense as we d like, we can always look at level curves. Here are the same graphs as above with level curves: y y y 2 0 x x The joint pdf of two independent standard normal random variables y y 2 0 x x The joint pdf of two normal random variables with mean 0 and covariance matrix [ ] 1 1/2 Σ =. 1/ y 2 x x The joint pdf of two normal random variables with mean 0 and covariance matrix [ ] 1 1/2 Σ =. 1/

26 y y When you draw samples from a distribution, you should see most of your data points accumulate in areas of high probability. The shapes of these areas are precisely given by the level curves: x 50 samples from a bivariate normal random variable with mean 0 and covariance matrix [ ] 1 0 Σ = x 50 samples from a bivariate normal random variable with mean 0 and covariance matrix [ ] 1 1/2 Σ =. 1/ y x 50 samples from a bivariate normal random variable with mean 0 and covariance matrix [ ] 1 1/2 Σ =. 1/

27 y y y The connection between the data and the distribution becomes more obvious as the data set increases in size: x 500 samples from a bivariate normal random variable with mean 0 and covariance matrix [ ] 1 0 Σ = x 500 samples from a bivariate normal random variable with mean 0 and covariance matrix [ ] 1 1/2 Σ =. 1/ x 500 samples from a bivariate normal random variable with mean 0 and covariance matrix [ ] 1 1/2 Σ =. 1/

28 Math 4506 (Spring 2015) February 4, 2015 Prof. Christian Beneš Lecture #3: Decomposing Time Series; Stationarity Reference. The material in this section is an introduction to time series and is meant to complement Chapter 1 in the textbook. Make sure you read that chapter in its entirety and work in parallel with R to reproduce what is being done in the textbook. This lecture also covers most of the topics from Chapter 2, which we will re-visit in more detail in the next lecture. 3.1 Basic decomposition The following graph represents the number of monthly aircraft miles (in Millions) flown by U.S. airlines between 1963 and 1970: Air.ts Time Given a data set such as the one above, how can we construct a model for it? The idea will be to decompose random data into three distinct components: A trend component m t (increase of populations, increase in global temperature, etc.) A seasonal component s t (describing cyclical phenomena such as annual temperature patterns, etc.) 3 1

29 A random noise component Y t describing the non-deterministic aspect of the time series. Note that the book uses z t for this component. In the notes, I ll write Y t, as the letter z usually suggests a normal distribution, which may not be the actual underlying distribution of the random noise component. A common model is the so-called additive model, that is, one where we try to find m t, s t, Y t such that a given time series can be expressed as X t = m t + s t + Y t. We will never know what m t, s t, and Y t actually are, but we can estimate them. The estimates will be called ˆm t, ŝ t, and y t. Note that we ll use the same notation for estimates and estimators in this case. Once we see the data, our estimates have to satisfy x t = ˆm t + ŝ t + y t, where ˆm t is an estimate for m t, ŝ t is an estimate for s t, and y t is an estimate for Y t. The corresponding data set can be found at and looks like this: Jan F eb M ar Apr M ay Jun Jul Aug Sep Oct N ov Dec In fact, this is not exactly the form in which the data set is found on that website. There, it doesn t have any labels. As it turns out, it is quite straightforward to include those labels with R. Let s look at the graph above. Two patterns are striking. There appears to be an increasing pattern a clear cyclical pattern with some apparently fixed period Some questions we ll try to answer throughout the course are: How can we extract these patterns?. Once we ve extracted the patterns are we left with pure randomness or does the randomness have a structure? Can we use these patterns to make predictions for future values of this time series? 3 2

30 3.2 Stationary Time Series We will eventually return to a more careful analysis of the trend and seasonal component of a time series, but focus for now on Y t, the random component of a time series after extraction of a trend and cyclical component. Multidimensional distributions are very complicated objects and involve more parameters than we would like to deal with. We will focus on two essential quantities giving information about a time series: the means and the covariances. Definition 3.1. If {X t } is a time series with X t L 1 for each t, then the mean function (or trend) of {X t } is the non-random function µ(t) := E[X t ]. Definition 3.2. If {X t } is a time series with X t L 2 for each t, then the autocovariance function of {X t } is the non-random function γ(t, s) := Cov(X t, X s ) = E [(X t µ(t))(x s µ(s))]. The autocorrelation function of {X t } is ρ(t, s) = γ(t, s) Var(Xt ) Var(X s ) = Corr(X t, X s ). Definition 3.3. We call the time series {X t } second-order (or weakly) stationary if there is a constant µ such that µ(t) = µ for all t, and γ(t + h, t) only depends on h; that is, if γ(t + h, t) = γ(h, 0) =: γ(h) for all t and for all h. Exercise 3.9. For a second-order stationary process, show that Var(X t ) = γ(0) for each t. Via the last exercise, the second condition for second-order stationarity allows us to rephrase the definition above: Definition 3.4. Suppose that {X t } is a second-order stationary process. The autocovariance function (ACVF) at lag h of {X t } is γ(h) := Cov(X t+h, X t ). The autocorrelation function (ACF) at lag h of {X t } is Note 3.1. By Exercise 3.9, ρ(h) = ρ(h) := Corr(X t+h, X t ). Cov(X t+h, X t ) Var(Xt+h ) Var(X t ) = γ(h) γ(0). 3 3

31 3.3 Some simple time series models All the time series below are discrete-time, that is, the time set is a subset of the integers. Example 3.1. (White Noise.) Often when taking measurements, little imprecisions (in the measuring device and on the part of the measurer) will yield measurements that are a little off. It is often assumed that these errors are uncorrelated and that they all come from a same distribution with zero mean. A sequence of random variables {X n } n 1 with E[X n ] = 0 and E[X k X m ] = σ 2 δ(k m) is called white noise. (The name comes from the spectrum of a stationary process which we may discuss at the end of the semester. There also noises that are pink, red, blue, purple, etc.) Here δ(k m) is the Dirac delta function, defined by { 1 x = 0 δ(x) = 0 x R \ {0} Two important particular cases of white noise are: The distribution of X i is binary: P (X i = a) = 1 P (X i = a) = 1/2 for some a R. X i N(0, σ 2 ). In this case, we talk about Gaussian white noise. Example 3.2. (IID Noise.) A sequence of independent, identically distributed random variables {X n } n 1 with E[X n ] = 0 is called i.i.d. noise. Example 3.3. (Random walk.) If {X i } i 1 is i.i.d. noise, n S n = is a random walk. In particular, if P (X i = 1) = 1 P(X i = 1) = 1/2, we have a symmetric simple random walk. Random walks have been a (very crude) choice of model for the stock market for a long time. X i Position of Walker 6 4 Position of Walker Number of Steps Number of Steps Two independent realizations of a simple random walk of 100 time steps. 3 4

32 Example 3.4. (Gaussian time series.) {X n } n 1 is a Gaussian time series if for every collection of integers {i k } 1 k n, the vector is multivariate Gaussian. (X i1,..., X in ) Since many natural quantities have a normal distribution, this is a natural model in many settings. It also has the advantage of allowing many kinds of dependence between the data. 3.4 Autocovariance function: some examples We saw that for stationary time series, covariance depends only on one parameter (the time between two given random variables), allowing us to define an autocovariance function at lag h. In the examples below, we compute the autocovariance function of the simple time series which we defined during the last lecture and use it to determine which of them are stationary and which are not. Example 3.5 (White Noise). Suppose that {X t } is White Noise. We now verify that {X t } is second-order stationary. First, it is obvious that µ(t) = 0 for all t. Second, if s t, then the assumption that the collection is uncorrelated implies that γ(t, s) = 0, s t. On the other hand, if s = t, then γ(t, t) = Var(X t ) = σ 2. Thus, µ(t) = 0 for all t, and { σ 2, h = 0, γ(h) = γ(t + h, t) = 0, h 0, This shows that {X t } is indeed second-order stationary since γ depends only on h. We write {X t } W N(0, σ 2 ) to indicate that {X t } is white noise with Var(X t ) = σ 2, for each t. Example 3.6 (IID Noise). Suppose instead that {X t } is collection of independent random variables, each with mean 0 and variance σ 2. We say that {X t } is iid Noise. As with white noise, we easily see that iid noise is stationary with trend µ(t) = 0 and { σ 2, h = 0, γ(h) = γ(t + h, t) = 0, h 0. We write {X t } IID(0, σ 2 ) to indicate that {X t } is iid noise with Var(X t ) = σ 2, for each t. Remark. With these two examples, we see that two different processes may both have the same trend and autocovariance function. Thus, µ(t) and γ(t + h, t) are NOT always enough to distinguish stationary processes. (However, for stationary Gaussian processes they are enough.) Example 3.7. If S t = t X i (where {X i } is a sequence of independent random variables with P (X i = 1) = 1 P (X i = 1) = 1/2 and therefore Var(X i ) = 1) is symmetric simple 3 5

33 random walk, we find that if s > t, γ(s, t) = Cov(S s, S t ) = Cov(S t + X t X s, S t ) = Cov(S t, S t ) t = Var(S t ) = Var X i = t. In particular, γ(t + h, t) = t, which implies that simple random walk is not a stationary time series (since stationary time series have a constant variance). 3 6

34 Math 4506 (Spring 2015) February 9, 2015 Prof. Christian Beneš Lecture #4: More stationary time series; Autocovariance; Linear Processes; MA processes Reference. Chapter 2 and Sections 4.1 and 4.2 from the textbook. 4.1 Inequalities Many probabilists are enthralled by inequalities (upper/lower bounds). One of the many purposes for finding upper bounds is to check that quantities are finite, by checking it for a more tractable but larger quantity. (This is something you ve seen in the comparison test for integrals: Though it s not straightforward to check that 1000 e x2 log log log x + 1 <, the fact that for x 1000, 0 e x2 log log log x + 1 e x implies that e x2 log log log x + 1 R e x <.) A very common inequality in analysis and probability is Jensen s inequality. Definition 4.1. A function φ : R R is called convex if for x, y R, 0 p 1, φ(px + (1 p)y) pφ(x) + (1 p)φ(y). Theorem 4.1 (Jensen s inequality). Suppose φ : R R is convex. Suppose X is a random variable satisfying E[ X ] < and E[ φ(x) ] <. Then φ(e[x]) E[φ(X)]. Proof. If φ is convex, then for every x 0 R, there is a c(x 0 ) such that φ(x) φ(x 0) x x 0 c(x 0 ). Choosing x 0 = E[X] and letting x = X, we get φ(x) c(e[x])(x E[X]) + φ(e[x]). Taking expectations on both sides concludes the proof. Example 4.1. Two straightforward consequences of Jensen s formula are: E[X] E[ X ]. E[X] 2 E[X 2 ]. In particular, applying the second inequality to the random variable X, we get E[ X ] 2 E[ X 2 ] E[X 2 ], so that if E[X] = 0, E[ X ] σ. (2) Two other very commonly useful inequalities are 4 1

35 Theorem 4.2. (Cauchy-Schwarz inequality) If X, Y L 2, E[ XY ] 2 E[X 2 ]E[Y 2 ]. Note 4.1. This last inequality is the probabilistic version of the C-S inequality and should be compared with the C-S inequality in its most standard form: ( n ) 2 x i y i n x 2 i n yi 2. (3) Theorem 4.3. (Triangle inequality) If x, y R, x + y x + y By induction, if x 1,..., x n R, n x i n x i. 4.2 Linear Processes Definition 4.2. We define the backwards shift operator B by BX t = X t 1. For j 2, we define In other words, B j X t = BB j 1 X t. B j X t = X t j. Definition 4.3. A time series {X t } t Z is a linear process if for every t Z, we can write X t = ψ i Z t i, (4) i= where Z t W N(0, σ 2 ) and the scalar sequence {ψ i } i Z satisfies ψ i <. Using the i Z shortcut Ψ(B) = ψ i B i, we can write i= X t = ψ(b)z t. If ψ i = 0 for all i < 0, we call X a moving average or MA( ) process. 4 2

36 Note 4.2. Infinite sums of random variables are somewhat delicate. You know what it means for an infinite sum of real numbers to converge, but for random variables, it isn t clear at first what the corresponding meaning would be. In fact, there are a number of different ways to give a meaning to the notion of convergence of random variables. For technical reasons, convergence of a sum of random variables is often taken in the mean square sense: {Y k } k 1 converges to Y in the mean square sense if there exists a random variable Y such that ( n ) 2 E Y k Y n 0. k=1 In any case, it should be intuitively clear that some requirement on the ψ i is necessary, since if all the ψ i were equal to 1, X t would be an infinite sum of i.i.d. random variables, which does not converge (since we re always adding more random variables that don t shrink, the sum would not stabilize). The requirement ψ i < ensures that the random series ψ i Z t i has a limit. I i Z i= won t expect you to completely understand what this means, but if you care about it, here s the argument: ψ i < ψi 2 < ψi 2 E[Zt 1] 2 < i 0 i 0 i 0 n i=m ψ 2 i E[Z 2 t i] m,n 0. (The last implication is the Cauchy criterion for convergence of series.) Now by the Cauchy- Schwarz inequality ((3) with y i = 1 for all i {1,..., n}), ( n n n ) 2 ψi 2 E[Zt i] 2 = E[ ψi 2 Zt i] 2 E ψ i Z t i. Therefore, n i=m i=m ψ 2 i E[Z 2 t i] m,n 0 E i=m ( n n i=m i=m ) 2 ψ i Z t i m,n 0 i=m ψ i Z t i converges as n, m i 0 ψ i Z t i converges. The last implication is the Cauchy criterion for convergence of sequences of random variables. Now that we know that the process defined in (10) exists, let s also show that for any t Z, X t L 1 : If ψ i <, using the triangle inequality (for the first inequality; note that since it s an i Z infinite sum, we have to take limits) and Jensen s inequality (for the last), we get E[ X t ] i Z E ψ i Z t i i Z ψ i E Z t i σ i Z ψ i. 4 3

37 4.3 Moving Average Processes We will now construct stationary time series that have a non-zero autocovariance up to a certain lag q but have zero autocovariance at all later lags. One simple and natural way is to start with white noise Z t (denoted Z t W N(0, σ 2 )) and to construct a new sequence of random variables which depend on an overlapping subset of the Z t. Definition 4.4. A moving-average process of order q is defined for t Z by the equation X t = Z t + θ 1 Z t θ q Z t q = Z t + q θ i Z t i = q θ i Z t i = Θ(B)Z t, i=0 where {Z t } W N(0, σ 2 ), θ 0 = 1, θ 1,..., θ q are constants, and Θ(z) = 1 + q θ i z t i. We now check that X t is a stationary sequence: E[X t ] = E[Z t ] + q θ i E[Z t i ] = 0. If h > q, Cov(X t, X t+h ) = Cov( q θ i Z t i, i=0 q θ j Z t+h j ) = j=0 q θ i θ j Cov(Z t i, Z t+h j ) = 0, since if h > q and j q, then t + h j > t, so that t + h j > t i, so that Z t i and Z t+h j are uncorrelated. i,j=0 If 0 h q, the random variables X t and X t+h contain some of the same Z i. Cov(X t, X t+h ) = Cov(θ q Z t q θ q h Z t+h q θ 0 Z t, θ q Z t+h q +... θ h Z t θ 0 Z t+h ) = Cov(θ q Z t q +...+θ q h+1 Z t+h q 1 + q θ q i Z t q+i, i=h q θ q+h i Z t q+i +θ h 1 Z t θ 0 Z t+h ) i=h q h q = σ 2 θ q i θ q i+h = σ 2 θ q i h θ q i i=h Since this covariance does not depend on t, we see that the moving-average process of order q is weakly stationary. To find the autocorrelation function, we just need to compute E[X 2 t ] = Cov(X t, X t ) = Cov( i=0 q θ i Z t i, i=0 q θ i Z t i ) = σ 2 i=0 q θi 2. i=0 4 4

38 Combining all our computations above, we get q h σ γ X (h) = 2 θ q i h θ q i 0 h q i=0 0 h > q (5) and ρ X (h) = q h θ q i h θ q i i=0 q 0 h q (6) i=0 θ 2 i 0 h > q 4 5

39 Math 4506 (Spring 2015) February 11, 2015 Prof. Christian Beneš Lecture #5: MA processes - Autocovariance; AR processes Reference. Section 4.2 from the textbook. 5.1 ACF of MA Processes Example 5.1. (MA(1) process) Let s examine the ACF of a MA(1) process: If X t = Z t + θ 1 Z t 1, we have θ 0 = 1, θ 1 0, and θ i = 0 for all i > 1. Therefore, using (5) and (6), we get and 1 0 γ X (0) = σ 2 θ 1 i θ 1 i = σ 2 (1 + θ 2 ), i=0 1 1 γ X (1) = σ 2 θ i θ 1 i = σ 2 θ 0 θ 1 = σ 2 θ 1 i=0 γ X (h) = 0, h > 2, ρ X (0) = 1, ρ X (1) = σ 2 θ 1 σ 2 (1 + θ 2 ), ρ X (h) = 0, h 2, Example 5.2. (MA(2) process) We ll now compute the ACF of a MA(2) process. Again, this is straightforward with the help of (5) and (6): 2 0 γ X (0) = σ 2 θ 2 i 0 θ 2 i = σ 2 (1 + θ1 2 + θ2), 2 i=0 2 1 γ X (1) = σ 2 θ 2 i 1 θ 2 i = σ 2 (θ 1 θ 2 + θ 1 ) i=0 2 2 γ X (2) = σ 2 θ 2 i 2 θ 2 i = σ 2 θ 2 i=0 γ X (h) = 0, h

40 Therefore, the ACF is ρ X (0) = 1, ρ X (1) = θ 1θ 2 + θ θ1 2 + θ2 2 θ 2 ρ X (2) = 1 + θ1 2 + θ2 2 ρ X (h) = 0, h 3. Example 5.3. Let us now simulate two MA(2) processes. First, consider the process We can simulate it as follows: > Z=rnorm(500) > X=Z > for (i in 3:500) X[i]=Z[i]+Z[i-1]-Z[i-2] > plot(x,type= l ) X t = Z t + Z t 1 Z t 2. X Index Let s now change the signs of the coefficients in the time series above to see what the process X t = Z t Z t 1 + Z t 2 looks like. > Z=rnorm(500) > X=Z > for (i in 3:500) X[i]=Z[i]-Z[i-1]+Z[i-2] > plot(x,type= l ) 5 2

41 X Index 5.2 Autoregressive Processes Recall the following definition: Definition 5.1. We define the backwards shift operator B by BX t = X t 1. For j 2, we define In other words, B j X t = BB j 1 X t. B j X t = X t j. Example 5.4. Recall that for n 1, we defined random walks S n as follows: If {X i } i 1 is i.i.d. noise, n S n = X i. Another way of defining random walk is by defining S 1 = X 1 and for n 2, or, with the backward shift notation, S n = S n 1 + X n S n BS n = X n. We can use the factorization that we use for real numbers in this case as well, but have to be careful and realize that the symbolic factorization is for operators (in particular, 1 represents the identity operator, not the number one). This gives (1 B)S n = X n. One natural way of introducing correlation into a time series model is by defining the time series recursively. 5 3

42 Definition 5.2. We define an autoregressive process of order p to be a process X satisfying for all t Z, X t φ 1 X t 1... φ p X t p = Z t (7) (1 φ 1 B φ 2 B 2 φ p B p )X t = Z t Φ p (B)X t = Z t, where Z t W N(0, σ 2 ), Z t is independent of X s, s < t, and Φ p (z) = 1 p φ iz i. 5 4

43 Math 4506 (Spring 2015) February 18, 2015 Prof. Christian Beneš Lecture #6: AR processes Reference. Section 4.3 from the textbook. 6.1 AR processes Definition 6.1. We define an autoregressive process of order p to be a process X satisfying for all t Z, X t φ 1 X t 1... φ p X t p = Z t (8) (1 φ 1 B φ 2 B 2 φ p B p )X t = Z t Φ p (B)X t = Z t, where Z t W N(0, σ 2 ), Z t is independent of X s, s < t, and Φ p (z) = 1 p φ iz i. Note that random walk S n is defined by the equation (1 B)S n = X n, so random walk is a particular case of an AR(1) process. We already saw that random walk is not stationary, so we see that there are processes satisfying the AR equation that aren t stationary. Note that this is different from MA processes, which are always stationary. 6.2 Stationarity of AR processes It turns out that for any set of parameters {φ i } 1 i p, this process exists. However, it isn t always stationary. The criterion for stationarity is quite simple: An AR(p) process is stationary if and only if all roots of the characteristic equation Φ p (z) = 0 have modulus greater than 1. In that case, the process is uniquely defined by the equation (8) In other words, if z 1,... z p are the roots of the characteristic equation, we need z i 1 for all i {1,..., p}. Note that the z i have to be thought of as complex numbers. Let s see what might go wrong when φ = 1 by looking at simple random walk: Example 6.1. Is the AR(3) process defined by stationary? X t = X t 2 + X t 3 + Z t We can rewrite the equation above as Φ 3 (B)X t = Z t, where Φ 3 (z) = 1 z 2 z 3. Therefore, we need to find the roots of the characteristic polynomial Φ 3 (z) = 1 z 2 z 3. This is best done with the help of R: First define the vector of coefficients of the polynomial > a=c(1,0,-1,-1) 6 1

University of Regina. Lecture Notes. Michael Kozdron

University of Regina. Lecture Notes. Michael Kozdron University of Regina Statistics 252 Mathematical Statistics Lecture Notes Winter 2005 Michael Kozdron kozdron@math.uregina.ca www.math.uregina.ca/ kozdron Contents 1 The Basic Idea of Statistics: Estimating