STAT 443 (Winter ) Forecasting

Size: px

Start display at page:

Download "STAT 443 (Winter ) Forecasting"

Sherilyn Parsons
5 years ago
Views:

1 Winter 2014 TABLE OF CONTENTS STAT 443 (Winter ) Forecasting Prof R Ramezan University of Waterloo L A TEXer: W KONG Last Revision: September 3, 2014 Table of Contents 1 Introduction 1 2 Time Series Models 1 21 Zero Mean Models 1 22 Models with Trend 1 23 Models with Seasonal Component 2 3 Stationary Models 2 31 Moving Averages 3 32 The Sample Autocorrelation Function 4 33 Linear Regression 4 4 Prediction 4 41 Model Selection 5 42 Interpolation vs Extrapolation 6 43 Tests on Residuals 6 5 Smoothing Methods 6 51 Trend Estimation 7 52 Estimating Seasonality 7 53 Modeling Residuals 8 6 Stationary and Linear Processes 9 61 Linear Prediction Linear Processes Box-Jenkins Models Invertibility Partial Autocorrelation Function (PACF) ARIMA(p, d, q) Processes SARIMA(p, d, q) (P, D, Q) S Processes 18 7 Box-Jenkins Methodology 19 8 Parameter Estimation in ARMA Processes Yule-Walker Methods 20 9 Likelihood Models Forecasting Forecasting AR(p) Forecasting MA(q) Forecasting ARMA(p,q) Processes 25 These notes are currently a work in progress, and as such may be incomplete or contain errors i

2 Winter 2014 ACKNOWLEDGMENTS ACKNOWLEDGMENTS: Special thanks to Michael Baker and his L A TEX formatted notes They were the inspiration for the structure of these notes ii

3 Winter 2014 ABSTRACT Abstract The purpose of these notes is to provide the reader with a secondary reference to the material covered in STAT 443 The formal prerequisite to this course is STAT 331 but this author believes that the overlap between the two courses is less than 10% Readers should have a good background in linear algebra, basic statistics, and calculus before enrolling in this course iii

4 Winter TIME SERIES MODELS 1 Introduction A time series is a sub-class of stochastic processes which are indexed by time and can be represented by {X t : t T } Let T be an index set The sequence of random variables {X t : t T } is a stochastic process if X t is a random variable for all t T If T is a set of time points, then {X t } is a time series In this course, we will assume that T shows time points If T is a discrete (continuous) set, then the time series {X t } is said to be discrete (continuous) time The main focus of this course is to develop models for discrete time series Example 11 For example, when we say x 5 = 10, we mean the value of x at time 5 is equal to 10 2 Time Series Models Our interest lies in modeling and the analysis of data collected over time (time series) Ideally, given a discrete stochastic process {X t }, we want the joint distribution of X 1, X 2,, X n for all n N In real world applications, this is generally not possible because we don t have enough information to fully specify the joint distribution The good news is that in most information about the joint distribution is provided in the first two moments and the covariances between pairs of random variables In other words, E(X t ), E(X 2 t ) and E(X t X t ) for any t, t N summarizes most of the information content about the process If the joint distribution is multivariate normal, then the three expectations above fully specify the joint distribution Recall that the multivariate normal distribution is written as N p (µ, Σ) where p is the dimension, µ is mean vector, and Σ is the variance-covariance matrix which is p p It is easy to see that µ and Σ are parametrized by the three expectations above, which we now call (*) Since (*) contains a fair amount of information, instead of working with the joint, we will work with time series models which employ (*) Definition 21 A time series model for observed data {x t } is a specification of the joint distributions (or possibly only (*)) of a sequence of random variables {X t } of which {x t } is postulated to be realization 21 Zero Mean Models (1) iid noise: If {X 1,, X k } are iid random variables, then P (X 1 x 1,, X k x k ) = k P (X n x n ) = n=1 k P (X 1 x n ) and the joint is defined by one marginal distribution with zero mean Observe that using the independence assumption we see that P (X n+h x X 1 x 1,, X n x n ) = P (X n+h x) (2) Random walk: {S t, t N}, starting at S 0 = 0 is a random walk if S t = t k=1 X k where X k are white noise random variables (3) White noise (zero-mean): A white noise process is a sequence of uncorrelated random variables {X t } each with constant mean 0 and constant variance σ 2 We denote this by {X t } W N(0, σ 2 ) n=1 22 Models with Trend Consider the model X t = m t + Y t where m t is a slowly changing function, called the trend, and Y t has zero mean We have E(X t ) = m t t Notice that m t is a non-random function of time t This trend component can be linear, quadratic or any kind of arbitrary function Example 21 Consider X t = m t + Y t where m t = 2 + t and Y t N(0, 1) This is a linear trend with a perturbation being modeled as a standardized normal random variable 1

5 Winter STATIONARY MODELS 23 Models with Seasonal Component In a similar setup to the previous case (models with trend) we can write X t = s t + Y t where E(Y t ) = 0 t and S t is a periodic function (the seasonal component) with period d (S t = S t+d t) In a sense, s t is a particular kind of trend An example would be the seasonal component s t = α 0 + α 1 cos(α 2 t) We can also use indicator functions (on certain months) for the seasonality component as well In both models X t = m t + Y t and X t = s t + Y t the parameters α 0, α 1, α 2, are usually estimated by maximum likelihood or least squares methods Notice that in the general case, we can use the model X t = m t + s t + Y t which contains both the seasonal and trend component This is called the classical decomposition and will be frequently referred to in this course We an use regression models to estimate m t and s t Example 22 Consider the average seasonal temperature over many years where Z 1 = Spring of 2004, Z 2 = Summer of 2004,, Z 20 Suppose that we want to fit a model of the form X t = m t + s t + Y t where m t is polynomial in t We use the dummy coding contrasts matrix in the form [ I 0 ] T in the order of Spring, Summer, Fall, and Winter Suppose that X 1, X 2, X 3, X 4 represent the categories of the seasons Then, we use the general model Z t = Y t }{{} Error + p β i t i + i=0 }{{} m t 3 α j X j } {{ } s t We should use the rule that says that if a periodic trend with period d is being modeled through regression analysis, d 1 binary variates should be introduced to the model Now consider Example 2 (generated from R code on the UW Learn page) In this example, we fitted the model ln(y t ) = 3 11 β i t i + β j x j 2 + R t i=1 with R t being the random component which is iid N(0, σ 2 ) Although the model is good in terms of fit, it does not satisfy the fundamental assumption of independent residuals Therefore, if interest lies in forecasting, this model fails To be able to check the independence of residuals, as well as to introduce a new class of time series models, the concept of stationarity should be introduced j=3 3 Stationary Models Definition 31 The time series {X t : t T } is called strictly (strong) stationary if the joint distribution of X t1, X t2,, X tn is the same as that of X t1 k, X t2 k,, X tn k for all n, t 1,, t n, k N In other words, {X t } is strictly stationary if all of its statistical properties remain the same under time shifts In practice, strict stationarity is too limiting of an assumption and rarely holds true we mentioned earlier that a lot of the information about the joint distributions are provided in the moments E[X t ], E[X 2 t ] and E[X t X t ] for all t, t This motivates introducing a a type of stationarity based on these lower order moments, which we will call weak stationarity To introduce weak stationarity, we need some more definitions first Definition 32 Let {X t } be a time series with E[X 2 t ] < The mean function of {X t } is µ X (t) = µ t = E[X t ] and the covariance function of {X t } is γ X (r, s) = Cov(X r, X s ) = E[(X r µ X (r))(x s µ X (s))] Definition 33 The time series {X t } with E[X 2 t ] < is said to be weakly stationary if: 1 µ X (t) = E[X t ] is independent of t 2 γ X (t, t + h) = Cov(X t, X t+h ) is independent of t for all h; the covariance only depends on the distance h instead of t 2

6 Winter STATIONARY MODELS 3 E[X 2 t ] < is also one of the conditions for weak stationarity Also, in view of the latter condition above, we use the term covariance function with reference to a stationary time series {X t } we shall mean the function of one variable defined by γ X (h) := γ X (h, 0) = γ X (t + h, t) = γ X (t, t + h) Exercise 31 If E(X 2 t ) <, show that strict stationarity implies weak stationarity Definition 34 Let {X t } be a stationary time series The autocovariance function (ACVF) of {X t } at lag h is γ X (h) = Cov(X t+h, X t ) The autocorrelation function (ACF) of {X t } at lag h is γ X (h) = γ X (h)/γ X (0) = Corr(X t+h, X t ) Example 31 Investigate the stationarity of white noise Let {X t } be white noise with {X t } W N(0, σ 2 ) Automatically, σ 2 < = V ar(x t ) <, E[X t ] = 0 does not depend on t, and Cov(X t, X t+h ) = σδ th where δ th is the Dirac delta function Hence, white noise is weakly stationary Example 32 A random walk {S t } is not stationary because V ar(s t ) = tσ 2 Notation 1 Whenever we refer to a stationary time series from now on (since Jan 16, 2014), we mean weakly stationary unless otherwise specified 31 Moving Averages Example 33 Consider the process X t = Z t + θz t 1 where t Z and Z t W N(0, σ 2 ) This process is called the first-order moving average [MA(1)] Show that {X t } is stationary It can be shown that V ar(x t ) = σ 2 (1 + θ 2 ) < E(X t ) = 0 σ 2 (1 + θ 2 ) h = 0 Cov(X t, X t+h ) = θσ 2 h = 1 0 h > 1 and hence because all functions are independent of t and the variance is finite, then X t is stationary We can also derive the ACF as ρ(h) = γ(h) 1 h = 0 γ(0) = θ 1+θ h = h > 1 Note that this illustrates that γ(h) is an even function Example 34 Let {X t } be a stationary time series satisfying the equations X t = φx t 1 + Z t for t Z where φ < 1 and {Z t } W N(0, σ 2 ) Also let Z t and X s be uncorrelated for each s < t Then time series {X t } is called an autoregressive process of order 1 [AR(1)] Note that because {X t } is stationary, we have E[X t ] = µ = φµ = µ = 0 for any t We also have that γ(0) = φ 2 γ(0)+σ 2 = γ(0) = σ2 1 φ Now if h > 0, multiply both sides of the expression for X 2 t by X t h and take expectations to get E[X t X t h ] = φe[x t h X t 1 ] + E[X t 1 Z t ] = γ(h) = φγ(h 1) You can repeat the same trick for h < 0 to get γ(k) = φk σ 2 = γ(k) = φ k γ(0) = φk σ 2 1 φ 2 and hence the ACF is ρ(h) = φ h 1 φ 2 3

7 Winter PREDICTION 32 The Sample Autocorrelation Function What we have seen so far on ACF is based on given models (theoretical) In practice, based on the observed data {x 1, x 2,, x n } we use the sample ACF to assess the degree of dependence in data Sample ACF is the estimate of the theoretical ACF (under stationarity) Definition 35 Let x 1,, x n be observations of a time series The sample mean of x 1,, x n is x = 1 n n i=1 x i The sample autocovariance function is ˆγ(h) = 1 n h ( xt+ h x ) (x t x), h ( n, n) n The sample autocorrelation function is t=1 ˆρ(h) = ˆγ(h), h ( n, n) ˆγ(0) The convention that will use is the hat notation for estimates and the tilde notation (eg γ(h)) for estimators Variables without a hat or a tilde are theoretical fixed values In summary, γ(h) Theoretical; fixed but unknown γ(h) The estimator; random variable ˆγ(h) Realization of γ(h)based on a sample The sample ACF measures the correlation in the data (under stationarity) Therefore, it can be used to the uncorrelatedness of the residuals of a regression model Note that if we have Gaussian residuals, Independent Not Correlated It can also be show that for iid noise with finite variance, ( ρ(h) N 0, 1 ) n where n is the sample size for large values of n Therefore, for data from such processes (iid noise) we expect than 95% of the sample ACFs fall between ±196/ n That is, ( 196 P < ρ(h) < 196 ) = 095 n n Based on the rends in the plot of the sample ACF ( ˆ ρ(h) vs h), we will decide on different models for the data (to be described later) Remark 31 For the observed data {x 1,, x n }: If the data contains a trend (non-constant mean), ˆρ(h) will exhibit a slow decay (linear decay) as h increases If the data contains a substantial deterministic periodic term, ˆρ(h) will exhibit similar behaviour with the same period 33 Linear Regression This was just a review of STAT 331 and STAT 371 done in two lectures Nothing to see here Carry on 4 Prediction Suppose that we have a model Y i = α + βx i + R i, R i N(0, σ 2 ) iid We want to predict Y new for a new value for x = x new If α = α + β x, we can rewrite our model as Y i = α + β(x i x) + R i, R i N(0, σ 2 ) iid 4

8 Winter PREDICTION It can be shown that ( α G β G σ α, ( ) σ β, SXX n ) which are independent, and the estimator µ(x new ) = E[Y X = x new ] is ( µ(x new ) = α + β(x 1n new x) N (α + β(x new x), σ 2 + (x new x) 2 )) where S XX E[ µ(x new )] = α + β(x new x) + E[R new ] = α + β(x new x) }{{} =0 ( V ar[ µ(x new )] = V ar( α) + (x new x) 2 V ar( β) 1 = σ 2 n + (x new x) 2 ) Since Y new N ( α + β(x new x), σ 2) then Y new µ new N (0, σ (1 2 + If c = F 1 t n 2 (0975), then the 95% prediction interval is Y new µ new t n 2 σ n + (xnew x)2 S XX ˆµ(x new ) ± cˆσ n + (x new x) 2 S XX and the prediction interval for multiple linear regression is ˆµ(x new ) ± cˆσ 1 + x T new(x T X) 1 x new where c is from a t n p 1 distribution S XX )) 1n (xnew x)2 + S XX and Note 1 In prediction, we must always consider the bias-variance tradeoff That is, as the model becomes more flexible (less bias and more variance), the less prediction power it has because it does not understand the pattern as well 41 Model Selection We just give the formulas for the various information criterions here: AIC = 2l(ˆθ) + 2N p AICc = AIC + 2N p(n p + 1) n N p 1 BIC = 2l(ˆθ) + N p log(n) where ˆl(θ) is the log-likelihood function Also, here is the formula for the PRESS statistic PRESS = (y ŷ) 2 Here are some strategies involving these statistics: Use a stepwise strategy Build up by adding one variable at a time Build down by subtracting one variable at a time y validation set 5

9 Winter SMOOTHING METHODS Use a mixed strategy 42 Interpolation vs Extrapolation If we let h max = max(h ij ) where H = X(X T X) 1 X T then if the point x satisfies x T (X T X) 1 x h max, then estimating y for x is an interpolation problem, otherwise extrapolation (cf Montgomery, EA Peck) 43 Tests on Residuals The Shapiro-Wilk Test is as follows: H 0 : Y 1,, Y n come from a Gaussian distribution Reject H 0 if the p-value of this test is small In R, if the data is stored in the vector y, then use the command shapirotest(y) The Difference Sign Test is as follows: Count the number S of values such that y i y i 1 > 0 For large iid sequences For large n, S is approximately N(µ S, σs 2 ), therefore, µ S = E[S] = n 1, σs 2 = n W = S µ S σ 2 S N(0, 1) A large positive value of S µ S indicates the presence of increasing (decreasing) trend We reject (H 0 : data is random) if W > z 1 α/2 but this may not work for seasonal data The Runs Test is as follows: Estimate the median and call it m Let n 1 be the number of observations > m and n 2 be the number of observations < m Let R be the number of consecutive observations which are all smaller (larger) than m For large iid sequences For large number of observations, µ R = E[R] = 1 + 2n 1n 2, σr 2 = (µ R 1)(µ R 2) n 1 + n 2 n 1 + n 2 1 R µ R σ R N(0, 1) 5 Smoothing Methods Recall the classical decomposition X t = m t + s t + Y t with period d, noise Y t and trend m t For identification, we need d t=1 s t = 0 and E[Y t ] = 0 Here, the assumption of linearity is strong, in the sense that it may or may not hold Out goal is to estimate and extract m t and s t and hope that the random component Y t will turn out to be stationary time series 6

10 Winter SMOOTHING METHODS 51 Trend Estimation Consider a non-seasonal model with a trend and stochastic component If E[Y t ] 0 then we define m t = m t + E[Y t ] and Y t = Y t E[Y t ] to create new variables which follow our classical assumptions There are many ways to estimate trend in this model which we list below: 1 (Finite Moving Average Filter) Let q be a non-negative integer and consider the two-sided moving average of the series X t We have m t 1 q X t j = 1 q m t j + 1 q Y t j 2q + 1 2q + 1 2q + 1 j= q j= q j= q }{{} 0 2 (Exponential Smoothing) For fixed α [0, 1] define the recursion ˆm t = αx t + (1 α) ˆm t 1 with initial condition ˆm 1 = X 1 This gives an exponentially decreasing weighted moving average where in the general t 2 case, t 2 ˆm t = α(1 α) j X t j + (1 α) t 1 X 1 Note that a smaller α creates a smoother plot compared to a larger α j=0 3 (Polynomial Regression) This is just developing a parametric polynomial form of m t in the form where k is chosen arbitrarily m t = 4 We can also eliminate the trend through differencing where k β i t i i=0 X t = X t X t 1 = (1 B)X t and, B are known to be the differencing and backshift operators respectively Exponentiating these operators is equivalent to function composition In this case, we are applying differencing to get a stationary process (by eliminating the trend) Example 51 Consider X t = α + βt + Y t Since E[X t ] depends on time, this series is non-stationary Under differencing, where Y t X t = β + Y t Y t 1 }{{} =Y t is stationary Similarly, if X t = α + βt + γt 2 + Y t, then but it can be shown that 2 X t IS stationary X t = β + 2γt γ + Y t Y t 1 52 Estimating Seasonality Suppose that ˆm t is a moving average filter for the trend of the data For each k = 1,, d, estimate w k as q<k+jd n q w k = (x k+jd ˆm k+jd ) {x k+jd ˆm k+jd q < k + jd n q} 7

11 Winter SMOOTHING METHODS which is the average of {x k+jd ˆm k+jd q < k + jd n q} Normalize to get so that d s j = 0 Note that ŝ k = ŝ k d for k > d d i=1 ŝ k = w k w i d 53 Modeling Residuals Holt-Winters This generalizes exponential smoothing to the case where there is a trend and seasonality Following Chatfield and Yar (1988), we define trend as long-term change in the mean level per unit time Have local linear trend where mean level at time t is µ t = L t + T t t where L t and T t vary slowly across time L t is the level and T t is the slope of the trend at time t Holt s idea is that Ŷt+h Y 1,, Y t = L t + h T t There are two forms of seasonality to be added to Holt s model: additive and multiplicative Holt-Winters (Additive Case) Define Level, Trend, and Seasonal Index at time t by L t, T t, I t where seasonal effect is of period p The update rules are L t = α(x t I t p ) + (1 α)(l t + T t 1 ) T t = β(l t L t 1 ) + (1 β)t t 1 I t = γ(x t L t ) + (1 γ)i t p The forecast for h periods is L t + ht t + I t p+h Holt-Winters (Multiplicative Case) Define Level, Trend, and Seasonal Index at time t by L t, T t, I t where seasonal effect is of period p The update rules are L t = α(x t /I t p ) + (1 α)(l t + T t 1 ) T t = β(l t L t 1 ) + (1 β)t t 1 I t = γ(x t /L t ) + (1 γ)i t p The forecast for h periods is (L t + ht t ) I t p+h Holt-Winters (Algorithm) This is a recursive algorithm so you will need to provide initial values for L t, T t, I t at the beginning of the series Values will need to be provided for α, β, γ (R minimizes the squared one-step prediction error to estimate these parameters) 8

12 Winter STATIONARY AND LINEAR PROCESSES You then will need to choose between additive and multiplicative models (In R, this is the only step that you execute) Holt-Winters (Special Cases) In the case that β = γ = 0 we have no trend or seasonal updates in the H-W algorithm Here, we have L t = αx t + (1 α)l t 1 which is exactly (simple) exponential smoothing under α In the case that γ = 0 we have no seasonal component and there are two H-W equations for updating L t and T t We call the above case double exponential smoothing Example 52 (Inventory Prediction) This example has a trend but no seasonality Use the H-W algorithm with γ = 0 (double exponential smoothing) In the case of simple exponential smoothing, m t = αy t + (1 α)m t 1 where the forecaster is Ŷt+1 = m t This is why the exponential smoother looks like it is one step behind We can rewrite this equation as Ŷ t+1 = m t 1 + α(y t m t 1 ) = m t 1 + α(y t Ŷt) which is a trend plus a weighted forecasting error This case is β = γ = 0 in the H-W algorithm Remark 51 Double exponential smoothing captures more of the underlying trend than simple exponential smoothing [ CHECK OUT THE LECTURE SLIDES FOR MORE EXAMPLES (for the midterm)! ] 6 Stationary and Linear Processes To perform any form of forecasting, there must be an assumption that some things are the same in the future as in the past The idea of being constant over time is central to stationary processes Therefore, we ll use stationary processes as the main framework to develop forecasting models In this chapter/module, we will talk about moving average (M A(q)), autoregressive (AR(p)) process, and will look at the connection between the two We will also develop forecasting methods within stationary processes Definition 61 A process {X t } is called a moving average process of order q if X t = Z t + θ 1 Z t θ q Z t q where {Z t } W N(0, σ 2 ) and θ 1,, θ q are constants Sometimes Z t is referred to as the innovation Notice that these innovations are uncorrelated, have constant variance and zero mean Deriving the mean and autocovariance function of MA(q), it is easy to see that this process is stationary Definition 62 We say that a process {X t } is q dependent if X t and are X s are independent if t s > q That is, they are dependent if they are within q steps of each other Similarly, we saay that that stationary time series is q correlated if γ(h) = 0 whenever h > q Example 61 It is easy to show that the MA(q) process is q correlated The inverse of this statement is also true Proposition 61 If {X t } is a stationary q correlated time series with mean 0, then it can be represented as the MA(q) process (ON MIDTERM?) Definition 63 Consider the process {X t } denoted by X t = φx t 1 + Z t for t = 0, 1, 2, where {Z t } W N(0, σ) This process is called the first-order autoregressive process or AR(1) We can show this process by (1 φb)x t = Z t Notice that if φ = 1, then {X t } forms a random walk that is not stationary Therefore, depending on the value of φ, {X t } may or may not be stationary 9

13 Winter STATIONARY AND LINEAR PROCESSES Remark 61 Consider the AR(1) process with the condition φ 1 We have, by induction, X t = Z t + φ j Z t j = φ j Z t j Defining θ i = φ i, we have written X t as an MA( ) process Definition 64 process {X t } is called a autoregressive process of order p if where {Z t } W N(0, σ 2 ) and φ 1,, φ p are constants j=0 X t = X t + φ 1 X t φ p X t p + Z t Definition 65 {X t } is called a Gaussian time series if all its joint distributions are multivariate normal That is for any set i 1,, i m with each n N, the random vector (X i1,, X im ) follows a multivariate normal distribution ( [ ]) [ ] T σ 2 Note 2 If (X 1, X 2 ) MV N µ1 µ 2, 1 σ 12 σ 12 σ2 2 then ( X 1 X 2 = x 2 N µ 1 + σ ) 12 σ2 2 (x 2 µ 2 ), σ1 2 σ2 12 σ2 2 ( = N µ 1 + σ ) 1 (x 2 µ 2 ), σ 2 σ 1(1 ρ 2 ), ρ = σ 12 2 σ 1 σ 2 Example 62 Consider the stationary Gaussian time series {X t } Suppose X n has been observed and we want to forecast X t+h using m(x n ), a function of X n Let us measure the quality of the forecast by ) MSE = E ([X n+h m(x n )] 2 X n It can be shown that m( ) which minimizes MSE in a general case is m(x n ) = E(X n+h X n ) In this case, stationarity implies that E[X n+h ] = E[X n ] = µ and Cov(X n+h, X n+h ) = γ(0) = V ar(x n ) = σ 2 and Cor(X n+h, X n ) = γ(h) γ(0) If this process was a Gaussian time series, then ( [ [ ] T (X 1, X 2 ) MV N µ µ, ]) σ 2 σ 2 ρ(h) X 1 X 2 = x 2 N ( µ + ρ(h)(x 2 µ 2 ), σ 2 (1 ρ(h) 2 ) ) m(x n ) = E[X n+h X n ] = µ + ρ(h)(x n µ) }{{}}{{} (1) (2) ) MSE = E ([X n+h m(x n )] 2 X n = V ar(x n+h X n ) = σ 2 (1 ρ(h)) and even if the normality does not hold, we can still look at the predictor m(x n ) = ax n + b where a and b where there are derived from 2 min a,b E X n+h (ax n + b) }{{} m(x n) Equation (1) will be shown later Equation (2) is the best form of m(x) such that MSE is minimized under the Gaussian 10

14 Winter STATIONARY AND LINEAR PROCESSES process assumption Note that there is no conditional above because 1 [ E (X n+h m(x n )) 2] { ]} = E E [(X n+h m(x n )) 2 X n { ]} E E [(X n+h m (X n )) 2 X n, m (X n ) = E (X n+h X n ) = E [(X n+h m (X n )) 2] 61 Linear Prediction We now consider the problem of predicting X n+h, h > 0 for a stationary time series with known mean µ and ACVF γ( ) based on previous values {X n,, X 1 } showing the linear predictor of X n+h by P n X n+h We are interested in P n X n+h = a 0 + a 1 X n + a 2 X n a n X 1 which minimizes S(a 0,, a n ) = E [(X n+h P n X n+h ) 2] To get a 0, a 1,, a n we need to solve the system S a j a 0 = µ = 0 for j = 0, 1,, n Doing so, we get ( 1 ) n a i, Γ n a n = γ n (h) i=1 where a n = a 1 a 2 a n, Γ n = γ(0) γ(1) γ(n 1) γ(1) γ(0) γ(n 2) γ(n 1) γ(n 2) γ(0), γ n(h) = γ(h) γ(h + 1) γ(n + h 1) which implies P n X n+h = a 0 + Note 3 Here are some properties from the above: = µ = µ ( n a i X n i+1 i=1 1 ) n a i + i=1 n a i X n i+1 i=1 n a i (X n i+1 µ) P n X n+h is defined by µ, γ(h) [ It can be shown that E (X n+h P n X n+h ) 2] = γ(0) a T n γ n (h) E(X n+h P n X n+h ) = 0 E [(X n+h P n X n+h )X j ] = 0 for j = 1, 2,, n i=1 In a more general set-up, suppose that Y and W 1,, W n are any random variables with finite second moments and means µ Y = E(Y ), µ i = E(W i ) and Cov(Y, Y ), Cov(Y, W i ), Cov(W i, W j ) are all known for i = 1,, n Define W = (W n,, W 1 ) and µ W = (µ n,, µ 1 ) T Then 1 Midterm content ends here γ = Cov(Y, W ) = (Cov(Y, W n ),, Cov(Y, W 1 )) T Γ = Cov( W, W ) = [Cov(W n+1 i, W n+1 i )] n i, Rn n 11

15 Winter STATIONARY AND LINEAR PROCESSES Now by the same argument used in the derivation of P n X n+h, the best linear predictor of Y in terms of {W n,, W 1 } is P W Y = P (Y W ) = µ Y + a T n ( W µ W ) where a n is the solution of Γa = γ Also the MSE of this predictor is [ E (Y P W Y ) 2] = V ar(y ) a T n γ Note 4 In this case, we have the following properties Suppose that E[U 2 ] <, E[V 2 ] <, Γ = Cov( W, W ) and β, α 1,, α n are constants Then the following are true: 1 P W U = E[U] + ã T n ( W µ W ) where Γã n = γ [ 2 E (U P W U) W ] = 0 and E [U P W U] = 0 [ 3 E (U P W U) 2] = V ar(u) ã T n Cov(U, W ) 4 P W [α 1 U + α 2 V + β] = α 1 P W U + α 2 P W V + β 5 P W [ n i=1 α iw i + β] = n i=1 α iw i + β 6 P W U = E[U] if Cov(U, W ) = 0 Exercise 61 What is the best linear predictor for X n+1 in an AR(p) process, based on X 1,, X n for n > p? Example 63 Derive the one-step prediction for the AR(1) model (Here, h = 1) Suppose X t = φx t 1 + Z t where φ < 1 and {Z t } W N(0, σ 2 ) In a previous example, we showed that γ(h) = φ h γ(0), h = 1, 2,, γ(0) = Also, E[X t ] = µ = 0 To find the linear predictor, we need to solve σ2 1 φ 2 Γ n a n = γ n (h) = Γ na n γ(0) = γ n(h) γ(0) 1 φ φ n 1 φ 1 φ n 2 = φ n 1 φ n 2 1 a 1 a 2 a n = φ φ 2 φ n An obvious solution is Note that a n = a 1 a 2 a n = P nx n+1 = µ + = P n X n+1 = n a i (X n+1 i µ) i=1 n a i X n+1 i = a 1 X n + 0 = φx n i=1 MSE = E [(X n+1 P n X n+1 ) 2] [ = E (X n+1 φx n ) 2] = E[Zn+1] 2 = σ 2 12

16 Winter STATIONARY AND LINEAR PROCESSES or you can use the formula of MSE to get MSE = γ(0) a T n γ n (h) = γ(0) φγ(1) = γ(0) φ 2 γ(0) = γ(0)[1 φ 2 ] = σ 2 62 Linear Processes We have discussed linear prediction in which future values are predicted by linear combinations of historical values This section focuses on a class of linear time series which provides a general framework for studying stationary processes Definition 66 The time series {X t } is a linear process if X t = j= ψ jz t j for all t where {Z t } W N(0, σ 2 ) and ψ j is a sequence of constants such that j= ψ j < Example 64 Show that AR(1) with φ < 1 is a linear process We know that X t = φx t 1 + Z t }{{} W N(0,σ 2 ) and we showed before that X t = j=0 φj Z t j Since φ < 1 then if ψ j = φ j then j= ψ j and therefore all assumptions in the definition above are satisfied So AR(1) is a linear process For prediction purposes, we may not want to have dependence on the future innovations (Z t s) However, the general form of a linear process involves future innovations Definition 67 A linear process j= ψ jz t j is causal or future independent if ψ j = 0 for any j < 0 Example 65 Both AR(1) and M A(q) are causal where X AR(1) t = φ j Z t j j=0 X MA(q) t = Z t + q θ j Z t j 63 Box-Jenkins Models The Box-Jenkins methodology uses ARMA and ARIMA models for forecasting The class of ARMA models tries to balance goodness of fit with a limited number of parameters Whenever the series is not stationary, ARIMA models (ARMA with differencing) are used When seasonal effect is present, the more general SARIMA model will be used All these models are two key functions ACF and PACF Definition 68 {X t, t T } is an ARMA(p, q) process if 1) {X t, t T } is stationary 2) X t φ 1 X t 1 φ 2 X t 2 φ p X t p = Z t + θ 1 Z t θ q Z t q where {Z t } W N(0, σ 2 ) 3) Polynomials (1 φ 1 z φ p z p ) and (1 + θ 1 z + + θ q z q ) have no common factors/roots (IMPORTANT FOR THE FINAL!) We say that {X t, t T } is an ARMA process with mean µ if {X t µ} is an ARMA(p, q) process Recall the backward shift operator BX t = X t 1 With this operator, we can rewrite the ARMA process as (1) (1 φ i B φ 2 B 2 φ p B p ) X t = (1 + θ 1 B + θ 2 B θ q B q ) }{{}}{{} φ(b) θ(b) Z t 13

17 Winter STATIONARY AND LINEAR PROCESSES The general model above has a unique stationary solution X t iff φ(z) 0 for all complex z C such that z = 1 If z such that z = 1 we have φ(z) 0, then there exists δ > 0 such that 1 φ(z) = j= χ j z j, 1 δ < z < 1 + δ and j= χ j < Under this condition, is a linear filter and substituting (2) in (1), we get (2) 1 φ(b) = j= χ j B j X t = 1 φ(b) θ(b)z t Since 1/φ(B) and θ(b) are polynomials, then so is ψ(b) = 1 φ(b) θ(b) We then have X t = ψ(b)z t = where ψ(b) is of degree and j= ψ j < (Why?) j= ψ j Z t j Definition 69 An ARMA(p, q) process φ(b)x t = θ(b)z t where Z t W N(0, σ 2 ) is causal if there exists constants {ψ j } such that j=0 ψ j < and X t = j=0 ψ jz t j for any t This condition is equivalent to for any z C such that z 1 Remark 62 If the condition above holds true, then and we have Note 5 We try a few special cases: φ(z) = 1 φ 1 z 1 φ 2 z 2 φ p z p 0 θ(z) = ψ(z) φ(z) = θ(z) = φ(z) ψ(z) = 1 + θ 1 z + + θ q z q = (1 φ 1 z φ p z p )(ψ 0 + ψ 1 z + ) 1 = ψ 0 θ 1 = ψ 1 φ 1 ψ 0 1) If φ(z) = 1 then φ(b)x t = θ(b)z t reduces to X t = θ(b)z t = Z t + θ 1 Z t θ q Z t q which is an MA(q) process 2) If θ(b) = 1 we have φ(b)x t = Z t = X t φ 1 X t 1 φ p X t p = Z t which is an AR(p) process From here, we can see that the AR(p) and MA(q) are special cases of ARMA(p, q) processes where AR(p) = ARM A(p, 0) MA(q) = ARMA(0, q) 64 Invertibility An ARMA(p, q) process {X t } is invertible if there exists constants {Π j } such that j=0 Π j < and Z t = j=0 Π jx t j for all t Invertibility is equivalent to the condition θ(z) = 1 + θ 1 z + + θ q z q 0 14

18 Winter STATIONARY AND LINEAR PROCESSES for any z C such that z 1 Using the same methods above, one can get that Π 0 = 1 φ 1 = Π 0 θ 1 + Π 1 Example 66 Consider {X t, t T } satisfying X t 05X t 1 = Z t + 04Z t 1 where {Z t } W N(0, σ 2 ) Investigate the causality and invertibility of X t If the series is causal (invertible) then provide the causal (invertible) solutions These are called the M A( ) and AR( ) representations [Causality] We have φ(z) = 1 05z = z = 2 = z > 1 Since this is outside the unit circle, X t is causal We then have z = (1 05z)(ψ 0 + ψ 1 z + ) = ψ 0 = 1, ψ 1 05ψ 0 = 04, ψ 2 05ψ 1 = 0, = ψ 0 = 1, ψ 1 = 09, ψ 2 = 09(05), ψ 3 = 09(05) 2, We can kind of see the pattern (and prove using induction) { ψ j = 1 j = 0 ψ j = ψ j = 09(05) j 1 j 0 = X t = Z t + 09 (05) j 1 Z t j [Invertibility] We have θ(z) = z = 0 = z = 10/4 = z > 1 Since this is outside the unit circle, X t is invertible We then have, like above, 1 05z = (1 + 04z)(Π 0 + Π 1 z + ) = Π 0 = 1, Π Π 0 = 05, Π Π 1 = 0, = Π 0 = 1, Π 0 = 09, ψ 2 = 09( 04), ψ 3 = 09( 04) 2, We can kind of see the pattern (and prove using induction) { ψ j = 1 j = 0 ψ j = ψ j = 09( 04) j 1 j 0 = X t = Z t 09 ( 04) j 1 Z t j Remark 63 (ACVF of ARMA processes) Consider a causal, stationary process φ(b)x t = θ(b)z t with Z t W N(0, σ 2 ) The MA( ) representation of X t is X t = j=0 ψ jz t j where E[X t ] = 0 We have γ(h) = E[X t X t+h ] E[X t ]E[X t+h ] }{{} =0 = E ψ j Z t j ψ j Z t+h j Notice that E[Z t Z s ] = 0 when t s We then have { j=0 γ(h) = ψ jψ j+h E[Zj 2 ] h 0 j=0 ψ jψ j h E[Zj 2] h < 0 = σ2 ψ j ψ j+ h Example 67 Derive the ACVF for the following ARMA(1, 1) process j=0 j=0 X t φx t 1 = Z t θz t 1 where Z t W N(0, σ 2 ) and φ < 1 Note that φ(z) is causal because 1 φz = 0 = z = 1/φ > 1 It can be shown, with similar methods above, that { ψ j = φ(φ + θ) j = 0 ψ j = ψ j = φ j 1 (φ + θ) j 0 j=0 15

19 Winter STATIONARY AND LINEAR PROCESSES Now if h = 0 then If h 0 then γ(0) = σ 2 γ(0) = σ 2 j=0 j=0 ψj 2 = σ ψ 2 j = σ (θ + φ) 2 = σ 2 [ φ 2(j 1) ] 1 + (θ + φ) 2 φ 2i i=0 [ = σ (θ + ] φ)2 φ 1 φ 2 ψ j ψ j+ h = σ 2 ψ 0 ψ h + ψ j ψ j+ h = σ 2 φ h 1 (θ + φ) + (θ + φ) 2 = σ 2 φ h 1 (θ + φ) + (θ + φ) 2 φ h 1 [ = σ 2 φ h 1 (θ + φ) + (θ + φ)2 φ h +1 ] 1 φ 4 φ j 1 φ j+ h φ 2j 65 Partial Autocorrelation Function (PACF) ACF measures the correlation between X n and X n+h This correlation can be due to direct connection, or through the intermediate steps X n+1, X n+2,, X n+h 1 PACF looks at the correlation between X n and X n+h once the effect of the intermediate steps are removed We remove the effect of the intermediate steps by deriving the linear predictors P (X n X n+1,, X n+h 1 ) and P (X n+h X n+1,, X n+h 1 ) Definition 610 The partial autocorrelation function (PACF) is shown by α(h) and is defined to be 1 h = 0 α(h) = Cor(X n, X n+1 ) = ρ(1) h = 1 Cor [X n P (X n X n+1,, X n+h 1 ), X n+h P (X n+h X n+1,, X n+h 1 )] o/w Example 68 Derive the PACF for an AR(1) process with φ < 1 We saw in Example 10 that P (X n+1 X n ) = φx n where X t = φx t 1 + Z t is an AR(1) process We then have { α(0) = 1 h = 0 α(h) = α(1) = φ 1 = φ h = 1 For h = 2 we have Cor [X t P (X t X t+1 ), X t+2 P (X t+2 X t+1 )] = Cor X t P (X t X t+1 ), X t+2 φx t+1 }{{}}{{} f(x t+1) Z t+2 = 0 16

20 Winter STATIONARY AND LINEAR PROCESSES Similarly, α(h) = 0 for any h > 2 This is fairly similar to the ACF of an MA(1) process where the value cuts off after 1 Also notice that this is similar to ACF in the sense that the PACF is symmetric in h so h < 0 is omitted from deviations above Theorem 61 {X t, t T } is a causal AR(p) process if and only if its PACF has the following arguments: 1) α(p) 0 2) α(h) = 0, h > p Furthermore, α(p) = φ p This theorem show that PACF is a powerful tool for identifying AR(p) processes In fact, ACF to MA(q) is like the PACF to AR(q) from the visual point of view (trend) In summary: ACF PACF M A(q) Zero after lag q Decays exponentially AR(p) Decays exponentially Zero after lag p In the general case of ARMA processes, the PACF is defined as α(0) = 1 and α(h) = Φ hh for h 1 where Φ hh is the last component of the vector Φ h = Γ 1 h γ h in which γ(0) γ(1) γ(h 1) γ(1) γ(0) γ(h 2) Γ h =, γ h = γ(h 1) γ(h 2) γ(0) Definition 611 Based on observations {x 1,, x n } with x i x j for i, j = 1,, n The sample PACF ˆα(h) is given by ˆα(0) = 1, ˆα(h) = ˆΦ hh, h 1 where ˆΦ hh is the last component of where the terms on the right are sample estimates Example 69 Calculate α(2) for an M A(1) process ˆΦ = 1 ˆΓ h ˆγ h X t = Z t + θz t 1, {Z t } W N(0, σ 2 ) γ(1) γ(2) γ(h) We have shown before that (1 + θ 2 )σ 2 h = 0 γ(h) = θσ 2 h = 1 0 h 2 We have Φ = Γ 1 h γ h So α(h) is the last element of Φ h and h = 1 = Φ 11 = (γ(0)) 1 γ(1) = γ(1) γ(0) = h = 2 = θ 1 + θ 2 θσ 2 (1 + θ 2 )σ 2 0 ( (1 + θ 2 )σ 2 θσ 2 ) 1 ( θσ 2 Where the last element of the case of h = 2, in reduced form, is ) ( = θ(1+θ 2 )σ 4 (1+θ 2 ) 2 σ 4 θ 2 σ 4 θσ 2 (1+θ 2 ) 2 σ 4 θ 2 σ 4 ) It can be shown, in general, that α(2) = Φ 22 = θ θ 2 + θ 4 α(h) = Φ hh = ( θ)h h i=0 θ2h 17

21 Winter STATIONARY AND LINEAR PROCESSES 66 ARIM A(p, d, q) Processes Definition 612 Let d be a non-negative integer {X t, t T } is an ARIMA(p, d, q) process if Y t = (1 B) d X t is a causal ARMA(p, q) process The definition above means that {X t, t T } satisfies an equation of the form φ (B)X t φ(b)(1 B) d X t = θ(b)z t, {Z t } W N(0, σ 2 ) Note that φ (1) = 0 = X t is not stationary unless d = 0 Therefore, {X t } is stationary iff d = 0 in which case it is reduced to an ARMA(p, q) process in the previous case Recall that if {X t } exhibits a polynomial trend of the form m(t) = α 0 + α 1 t + + α d t d then (1 B) d X t will not have that trend any more Therefore, ARIMA models (when d 0) are appropriate when the trend in the data is well approximated by a polynomial degree d Example 610 Consider the process X t = 08X t 1 + 2t + Z t where {Z t } W N(0, σ 2 ) Write this process in ARIMA(p, d, q) format To get rid of the linear trend 2t, we perform one time differencing So If Y t = (1 B)X t then the above is written as (1 B)X t = 08 (X t 1 X t 2 ) + Z t Z t Y t 08Y t 1 = Z t Z t = (Y t 10) 08(Y t 1 10) = Z t Z t 1 Therefore {Y t 10} is an ARMA(1, 1) process Hence Y t i s an ARMA(1, 1) process with mean 10 and X t is an ARIMA(1, 1, 1) process We have seen how differencing can be used to remove a trend Seasonality is a particular type of trend which can be removed by a particular type of differencing This will be discussed under SARIMA (seasonal ARIMA models) 67 SARIMA(p, d, q) (P, D, Q) S Processes Recall the operator B where B k X t = X t k Clearly (1 B k ) and (1 B) k are different filters The latter is performing k times differencing, but the former is differencing once in lag k In R, we will write diff(x,difference=k) (1 B) k X t diff(x,lag=k) (1 B k )X t As an example, consider the process {X t } where t represents the month If there exists a seasonal effect, ie S(t) = S(t+12), then the effect of seasonal trend for X t and X t 12 should be the same That is Y t = X t X t 12 should not exhibit any seasonal trends Therefore, if we apply differencing using the latter of the above equations, we can (in theory) remove the effect of the seasonal trend Therefore, fitting an ARMA(p, q) model to the differenced series Y t = (1 B s )X t is the same as fitting the model φ(b)(1 B S )X t = θ(b)z t where S represents the season This is a special case of SARIMA models Definition 613 If d, D are non-negative integers, then {X t t T } is a seasonal ARIMA(p, d, q) (P, D, Q) S process with period S if the differenced series Y t = d D S X t = (1 B) d (1 B S ) D X t is a causal ARMA process defined by φ(b)φ(b S )Y t = θ(b)θ(b S )Z t, Z t W N(0, σ 2 ) 18

22 Winter PARAMETER ESTIMATION IN ARMA PROCESSES where φ(z) = 1 φ 1 z φ p z p Φ(z) = 1 Φ 1 z Φ P z P θ(z) = 1 + θ 1 z + + θ q z q Θ(z) = 1 + Θ 1 z + + Θ Q z Q Remark 64 Notice that the process {X t, t T } is causal iff φ(z) 0 Φ(z) 0 for all z : z < 1 Remark 65 In practice D is rarely more than 1 and typically P, Q are typically less than 3 Example 611 Write down the equation form of the ARMA(1, 1) 12 process This is equivalent to and the general form is ARMA(1, 1) 12 = SARIMA(0, 0, 0) (1, 0, 1) 12 (1 Φ 1 B 12 ) = (1 + Θ 1 B 12 )Z t, Z t W N(0, σ 2 ) If d 0 or D 0 then SARIMA models are not stationary This model, (ARMA(1, 1) 12 ) looks like ARMA(1, 1) In fact, this model is an ARMA(1, 1) sitting on the season s = 12 Example 612 Derive the ACF of SARIMA(0, 0, 1) 12 = SARIMA(0, 0, 0) (0, 0, 1) 12 This gives us the general form X t = Z t + Θ 1 Z t 12, Z t W N(0, σ 2 ) Show, as an exercise, that (1 + Θ 2 1)σ 2 h = 0 γ(h) = Cov(X t, X t+h ) = Θ 1 σ 2 h = 12 0 otherwise ρ(h) = γ(h) 1 h = 0 γ(0) = θ 1+θ h = otherwise 7 Box-Jenkins Methodology To use Box-Jenkins methodology you do the following: 1 Check for seasonal and non-seasonal trends (stationarity) 2 Use differencing to make the process stationary 3 Identify p, q, P, Q visually from ACF, PACF or with formal model selection methods 4 Forecast the future with the appropriate model (See slides for more info) 8 Parameter Estimation in ARMA Processes This section concentrates on estimation of the parameters φ i, i = 1,, p and θ j, j = 1,, q as well as σ 2, the variance of the WN, in the ARMA(p, q) process φ(b)x t = θ(b)z t where {Z t } W N(0, σ 2 ) We assume that p and q are correctly specified If the mean of the series in not zero, we will use the model φ(b)(x t µ) = θ(b)z t where µ = E[X t ], t Also, µ = X = 1 n The common parameter estimation methods are maximum likelihood, least squares, Yule-Walker, innovations algorithms, and the Durbin-Levinson method We will only focus on the first two for the remainder of this course n i=1 X i 19

23 Winter PARAMETER ESTIMATION IN ARMA PROCESSES 81 Yule-Walker Methods Consider a causal AR(p) model (1) X t φ 1 X t 1 φ p X t p = Z t with causal solution X t = j=0 ψ jz t j where {Z t } W N(0, σ 2 ) Multiply both sides of (1) by X t j with j = 0, 1, 2,, p and taking expectations will give us E[X t X t j ] φ 1 E[X t 1 X t j ] φ p E[X t p X t j ] = E[Z t X t j ] = γ(j) φ 1 γ(j 1) φ p γ(j p) = E[Z t X t j ] We then have { ] E[Zt X t j] = E[Z t X t ] = E [Z t j=0 ψ jz t j = E[Zt 2 ] = σ 2 j = 0 So the original equation reduces to E[Z t X t j ] = 0 j > 0 { γ(0) φ 1 γ(1) φ p γ(p) = σ 2 j = 0 γ(j) φ 1 γ( j 1 ) γ( j p ) = 0 j 0 These are called the Yule-Walker equations This can be easily generalized to a matrix form Γ p φ = γ p Based on a sample {x 1, x 2,, x n } the parameters φ and σ 2 can be estimated by ˆφ = 1 ˆΓ p ˆγ p where the matrices are defined in a similar fashion as the best linear predictor section The system above is called the sample Yule-Walker equations We can write Yule-Walker equations in terms of ACF too Explicitly, if we divide ˆγ p by γ(0) and multiply it in ˆΓ p then ˆφ = 1 ˆR p ˆρ p ˆR p = ˆΓ p ˆγ(0) = 1 1 ˆR p = ˆΓ p ˆγ(0) ˆρ p = ˆγ p /ˆγ(0) where ˆσ 2 = ˆγ(0) [1 ˆφ ] ˆρ p Notice that ˆγ(0) is the sample variance of {x 1,, x n } Based on a sample {x 1,, x n }, the above equations will provide the parameter estimates Using advanced probability theory, it can be shown that φ = φ 1 φ p MV N φ = φ 1 φ p, σ2 n Γ 1 p for large n If we replace σ 2 and Γ p by their sample estimates ˆσ 2 and ˆΓ p we can use this result for large-sample confidence intervals for the parameters φ 1,, φ p Example 81 Based on the following sample ACF and PACF, an AR(2) has been proposed for the data Provide the Yule- Walker estimates of the parameters as well as 95% confidence intervals for the parameters in φ The data was collected over a window of 200 points with sample variance 369 with the following table: h ˆf(h) ˆα(h) We want to estimate φ 1 and φ 2 in X t = φ 1 X t 1 + φ 2 X t 2 + Z t, {Z t } N(0, σ 2 ) 20

24 Winter FORECASTING The system is Similarly, Therefore the estimated model is [ ˆφ = ] 1 [ [ [ ˆσ 2 = ˆγ(0) 1 }{{} ˆφ ˆρ(1) ˆρ(2) 369 ] = [ ]] = 1112 ] X t = 0594X t X t 1 + Z t, {Z t } W N(0, 1112) Now φ N ) (φ, σ2 n Γ 1 2 ([ ] 0594 = N, 1112 [ ([ ] [ ]) = N, ]) So the 95% CI s for φ 1, φ 2 are ˆφ 1 ± 196 Vˆar( φ) = 0594 ± = (0455, 0733) ˆφ 2 ± 196 Vˆar( φ) = 0276 ± = (0137, 0415) (Johnson & Wichard discuss ellipsoid CI s in Applied Multivariate Statistical Analysis ) 9 Likelihood Models To use likelihood models, we have to make some distributional assumptions Consider {X t, t T } to be a Gaussian process We have that Z t in φ(b)x t = θ(b)z t is iid G(0, σ) Based on the observations x 1,, x n at times 1, 2,, n, the likelihood function of the parameters φ, θ and σ 2 is L(θ, φ, σ 2 ) = 1 (2π) n/2 Γ n 1/2 e 1 2 xt Γ 1 n x Notice that it is assumed that E[X t ] = 0 for all t To estimate φ, θ and σ 2, we maximize the likelihood function above Usually, it is easier to maximize the log of L(θ, φ, σ 2 ) which is called the log-likelihood In this likelihood function, γ(h) depends on the parameters θ, φ and σ 2 in a linear way Furthermore, as the dataset gets larger (n increases), the inversion Γ 1 n can be computationally challenging Therefore, efficient computational methods are needed for likelihood estimation 10 Forecasting We discuss how forecasting works under our studied processes 21

25 Winter FORECASTING 101 Forecasting AR(p) Let X t = p φ jx t j + Z t, Z t W N{0, σ 2 } be a causal AR(p) process We have ˆX n+h = E[X n+h X 1,, X n ], h > 0 h 1 p = E φ j X n+h j + φ j X n+h j X 1,, X n + E [Z n+h X 1,, X n ] }{{} j=h h 1 p = E φ j X n+h j X 1,, X n + E φ j X n+h j X 1,, X n due to the uncorrelatedness of Z n+h with respect to X k If h = 1, then the above equation becomes If h = 2, 3,, p then remark that and so ˆX n+1 = j=h p φ j X n+1 j j < h = n + h j > n j h = n + h j n =0 ˆX n+h = = p j=h h 1 φ j X n+h j + φ j E (X n+h j X 1,, X n ) h 1 φ j ˆXn+h j + p φ j X n+h j j=h If h > p, then n + h j > n and ˆX n+h = p φ j E (X n+h j X 1,, X n ) = p φ j ˆXn+h j In summary, for a causal AR(p), the h step predictor is ˆX n+1 = p φ jx n+1 j h = 1 ˆX n+h = h 1 φ ˆX j n+h j + p j=h φ jx n+h j h = 2, 3,, p p φ ˆX j n+h j h > p In AR(p), the h step prediction is a linear combination of the previous steps We either have the previous p steps in X 1,, X n so we substitute the values (like the h = 1 case), or we don t have all or some of them, in which case we recursively predict Given a dataset, φ j can be estimated and ˆX n+h will be computed Example 101 Based on the annual sales data of a chain store, an AR(2) model with parameters ˆφ 1 = 1 and ˆφ 2 = 021 has bee fitted If the total sales of the last 3 years have been 9, 11 and 10 million dollars Forecast this year s total sales (2013) as well as that of 2015 We have Now X t = X t 1 021X t 2 + Z t, {Z t } W N(0, σ 2 ) ˆX 2013 = X X 2011 = 669 ˆX 2015 = ˆX ˆX 2013 = ˆX (669) 22

26 Winter FORECASTING and since then ˆX 2014 = ˆX ˆX 2012 = = 48 ˆX 2015 = (669) = Forecasting MA(q) MA processes are linear combinations of white noise To do forecasting in MA(q), we need to estimate θ 1,, θ q as well as approximate the innovations Z t, Z t+1, First, consider the very simple case of MA(1) where X t = Z t + θz t 1, {Z t } W N(0, σ 2 ) We have If h = 1, then the above equation is ˆX n+h = E [X n+h X 1,, X n ] = E [Z n+h X 1,, X n ] + θe [Z n+h 1 X 1,, X n ] ˆX n+1 = E [Z n+1 X 1,, X n ] +θe [Z n X 1,, X n ] }{{} =0 = θe [Z n X 1,, X n ] = θz n and if h > 1 then the equation becomes ˆX n+1 = E[Z n+h ] + θe Z n + h 1 X 1,, X n = 0 }{{} >n Now we need to plug in a value for Z n We approximate the Z i s by U i s as follows Let U 0 = 0 and we estimate from the fact that Z t = X t θz t 1 We can then get that Ẑ t = U t = X t θu t 1, U 0 = 0 U 0 = 0 U 1 = X 1 U 2 = X 2 θx 1 U 3 = X 3 θx 2 + θ 2 X 1 Notice that as i, U i will need a convergence condition where θ < 1 is sufficient This was the invertibility condition for MA(1) We see that the U i s are recursively calculable and for an invertible MA(1) process, we have { θu n h = 1 ˆX n+h = 0 h > 1, U t = X t θu t 1, U 0 = 0 Now consider an MA(q) process X t = Z t + θ 1 Z t θ q Z t q We have ˆX n+h = E [X n+h X 1,, X n ] = E [Z n+h X 1,, X n ] + θ 1 E [Z n+h 1 X 1,, X n ] + + θ q E [Z n+h q X 1,, X n ] 23

27 Winter FORECASTING If h > q then the above equation s value is zero since we have n + h q > n If 0 < h q then at least some of the terms in the above are non-zero In particular, ˆX n+h = = q θ j E [Z n+h 1 X 1,, X n ] q θ j E [Z n+h 1 X 1,, X n ] j=h and for j = h, h + 1,, q we know E[Z n+h j X 1,, X n ] = Z n+h j and hence ˆX n+h = q θ j Z n+h j j=h Similar to MA(1), we approximate Z i s by U i s, provided the MA(q) process is invertible That is, θ(z) = 1+θ 1z++θ q z q 0 for all z 1 Therefore, assuming that U 0 = U 1 = U 2 = = 0 then U t = X t q θ ju t j and U 0 = 0 U 1 = X 1 U 2 = X 2 θ 1 X 1 U 3 = X 3 θ 2 X 2 + θ 2 θ 1 X 1 In summary, for an invertible MA(q) process, we have { q j=h ˆX n+h = θ ju n+h j 1 h q 0 h > q where U 0 = U i = = 0, i < 0 and U t = X t q θ ju t j for t = 1, 2, 3, Example 102 Consider the MA(1) process X t = Z t + 05Z t 1 where {Z n } W N(0, σ 2 ) If X 1 = 03, X 2 = 01, X 3 = 01, predict X 4, X 5 Notice that ˆX 5 = ˆX 3+2 which is a 2-step prediction based on the history X 1 = X 2 = X 3 Since this is an MA(1) model, hence 1-correlated, ˆX5 = 0 For X 4 we have where and hence ˆX 4 = 05(0225) = ˆX 4 = U 0 = 0 1 = θ j U 3+1 j = θ 1 U 3 = 05U 3 U 1 = X 1 05U 0 = X 1 = 03 U 2 = X 2 05U 1 = 01 (05)(03) = 025 U 3 = X 3 05U 2 = 01 (05)( 025) = 0225 Example 103 Consider the MA(1) process X t = Z t + θz t 1 with {Z t } W N(0, σ 2 ) and θ < 1 Show that the one-step predictor ˆX n+1 = θu n is equal to the predictor ˆX n+1 = n ( θ) j X n j+1 24

28 Winter FORECASTING This is by definition of U n which we can write the closed form n 1 U n = X n + ( θ) i X n i, n 2 i=1 and hence n 1 n 1 n ˆX n+1 = θu n = θx n ( θ) i+1 X n i = ( θ) i+1 X n i = ( θ) j X n j+1 = ˆXn+1 i=1 i=0 Clearly for n = 0, 1 we have ˆX n+1 = ˆXn+1 as well This shows that even in the MA process, the predictor may be written as a linear function of the history 103 Forecasting ARMA(p,q) Processes For the causal and invertible ARMA(p, q) process φ(b)x t = θ(b)z t where {Z t } W N(0, σ 2 ), the predictors for MA(q) and AR(p) are combined We will not go into the theory of forecasting in ARMA processes and will use R for that matter For example p q ˆX n+1 = φ j X n+1 j + θ i U n+1 i subject to some conditions i=1 25

STAT 443 Final Exam Review. 1 Basic Definitions. 2 Statistical Tests. L A TEXer: W. Kong

STAT 443 Final Exam Review. 1 Basic Definitions. 2 Statistical Tests. L A TEXer: W. Kong STAT 443 Final Exam Review L A TEXer: W Kong 1 Basic Definitions Definition 11 The time series {X t } with E[X 2 t ] < is said to be weakly stationary if: 1 µ X (t) = E[X t ] is independent of t 2 γ X