X t = a t + r t, (7.1) - PDF Free Download

Chapter 7 State Space Models 71 Introduction State Space models, developed over the past 10 20 years, are alternative models for time series They include both the ARIMA models of Chapters 3 6 and the Classical Decomposition Model of Chapter 2 as special cases, but go well beyond both They are important because (i) they provide a rich family of naturally interpretable models for data, and (ii) they lead to highly efficient estimation and forecasting algorithms, through the Kalman recursions see 73 They are widely used and, perhaps in consequence, are known under several different names: structural models (econometrics), dynamic linear models (Statistics), Bayesian forecasting models (Statistics), linear system models (engineering), Kalman filtering models (control engineering), The essential idea is that behind the observed time series X t there is an underlying process S t which itself is evolving through time in a way that reflects the structure of the system being observed 72 The Model Example 1: The Random Walk plus Noise Model For a series X t with trend, but no seasonal or cyclic variation the Classical Decomposition of 23 is based on X t = a t + r t, (71) where a t represents deterministic trend the underlying general level or signal at t and r t represents random variation or noise To make this into a more precisely specified model we might suppose that r t is white noise WN(0, σ 2 r) Also, instead of supposing that a t is deterministic we could take it to be random as well, but not changing much over time A way of representing this would be to suppose a t = a t 1 + η t, (72) 76

where η t is white noise WN(0, σ 2 η), uncorrelated with the r t Equations (71) and (72) together define the Random Walk plus Noise model, also known as the Local Level model A realization from this model, with a 0 = 0 and σ 2 r = 6, σ 2 η = 3, is given below Realization from Local Level Model -15-10 -5 0 5 0 20 40 60 80 100 Time The model might be suitable for an industrial process, with process level a t intended to be within design limits, but not directly observable itself The sizes of σ 2 r and σ 2 η will determine local and large-scale smoothness respectively Review Question: How will σ 2 r and σ 2 η affect local and large-scale smoothness? How would the graph of a process for which σ 2 r/σ 2 η is large differ from that of one for which it is small? Example 2: A Seasonal Model with Noise In the Classical Decomposition Method the seasonal/cyclic components s t with a period c say were taken to be numbers which repeated themselves every c time units and summed to 0 over the c seasons : that is, for each t, c s t+c = s t, s t+j = 0 for each t j=1 Given any c 1 values s 1,, s c 1, we can generate such a sequence by setting and s c = s 1 s c 1 (73) s t+c = s t for all t = 1, 2, (74) The result is a pattern exactly reproducing itself every c time units Note that (73) and (74) together amount to saying that s t can be found successively from s t = s t c+1 s t c+2 s t 1, t = c, c + 1, (75) 77

We can introduce some variability into the seasonal pattern by adding a white noise perturbation to (75): s t = s t c+1 s t c+2 s t 1 + η t, where η t is, say, WN(0, σ 2 η) On average (in expectation) the s t s will still sum to 0 over the seasons, but individual variability is now possible If, as in the previous example, we suppose that actual observations on the process are subject to error, we get X t = s t + r t (76) c 1 s t = s t j + η t, (77) j=1 where the observation error r t might be taken to be WN(0, σ 2 r) The two equations (76) and (77) define the model A realization with c = 12 is shown below Realization from Seasonal Model -2-1 0 1 2 3 4 10 15 20 25 30 35 40 Time in Cycles If we write s t,, s t c+2 as a vector, S t say, then (77) can be written in matrix form as s t 1 1 1 s t 1 η t s t 1 1 0 0 s t 2 0 S t = = 0 1 0 +, 0 0 1 0 0 that is, s t c+2 s t c+1 S t = F S t 1 + V t, (78) 78

say, where F is the (c 1) (c 1) matrix above and V t is the random vector V t = (η t, 0,, 0) The other equation defining the model, (76), may be written in terms of S t as X t = (1, 0,, 0)S t + r t (79) General Form of State Space Model In the examples each model consists of two parts: an underlying process m t or S t, called in general the state process of the system, whose evolution is governed by one equation ((72) and (78) respectively), and another process the observed time series X t itself called in general the observation process, which is related to the state process by another equation ((71) and (79) respectively) In general a state space model consists of a pair of random quantities X t and S t whose evolution and relationship are described by the equations: X t = G t S t + ɛ t (710) S t = F t S t 1 + V t (711) where S t denotes the state at time t, G t and F t are known matrices, possibly depending on time, ɛ t is WN(0, σ 2 ɛ ), V t is a vector of white noise processes, each uncorrelated with the ɛ t process Equation (710) is called the observation equation, and equation (711) the state or system equation The component white noise processes of V t may be correlated with each other, though components of V t and V s for t s are taken to be uncorrelated We use the notation V t WN(0, {Q t }) to mean that the random vector V t consists of univariate WN components (so that it has mean 0) and has variance-covariance matrix Q t, that is: E(V t V t) = Q t In Example 1 S t can be identified directly with a t, and ɛ t with r t G t and F t are both equal to the degenerate 1-dimensional unit matrix, and V t is the 1-dimensional vector with component η t The covariance matrix Q t is therefore 79

simply σ 2 η In Example 2 the matrix G t = (1, 0,, 0) as in (79), ɛ t is r t, the matrix F t and the vector V t are as in (78), and Q t is the (c 1) (c 1) matrix with all entries zero except the top left hand one, which is equal to ση 2 Example 3: AR(1) Model The stationary AR(1) process given by X t = αx t 1 + η t (712) is another example of a state space model Identify the state S t with X t itself, so that the state equation can be taken to be (712) if we set F = α and V t = η t ; and the observation equation is just X t = S t, which has the form (710) with G = 1 and ɛ = 0 Notes (a) By iterating the state equation (711) we get S t = F t S t 1 + V t = F t (F t 1 S t 2 + V t 1 ) + V t = (F t F t 1 F 2 )S 1 + (F t F 3 )V 2 + + F t V t 1 + V t (713) = f t (S 1, V 2,, V t ) for a function f t From the observation equation therefore X t = G t f t (S 1, ) + ɛ t = g t (S 1, V 2,, V t, ɛ t ) for a function g t Thus the process is driven (through the G t and F t ) by the white noise terms and the initial state S 1 (b) It turns out to be possible to put a large number of time series models including, for example, all ARIMA models into a state space form An advantage of doing so is that the state equation gives a simple way of analysing the process S t, and from that it is easy via the observation equation to find out about the observation process X t If S 1 and V 2,, V t are independent (as opposed to just being uncorrelated) then S t has the Markov property, that is, the distribution of S t given S t 1, S t 2,, S 1 is the same as the distribution of S t given S t 1 alone 73 The Kalman Recursions 731 Filtering, Prediction and Smoothing In state space models the state is generally the aspect of greatest interest, but it is not usually observed directly What are observed are the X t s So we d like to have methods for estimating S t from the observations Three scenarios are: 80

Prediction Problem Estimate S t from X t 1, X t 2, Filtering Problem Estimate S t from X t, X t 1, Smoothing Problem Estimate S t from X n, X n 1,, where n > t A further problem, which turns out to have an answer useful for other things too, is X-Prediction Problem Estimate X t from X t 1, X t 2, 732 General Approach Note that equation (713) above shows that S t and X t are linear combinations of the initial state S 1 and the white noise processes V t and ɛ t If these are Gaussian, then both S t and X t will be Gaussian too for every t and their distributions will be completely determined by their means and covariances Thus the whole evolution of the model will be known if the means and covariances can be calculated The Kalman recursions give a highly efficient way of computing these means and covariances by building them up successively from earlier values The recursions lead to algorithms for the problems above and for fitting the models to data They are an enormously powerful tool for handling a wide range of time series models The basis of the Kalman recursions is the following simple result about multivariate Normal distributions 733 Conditioning in a Multivariate Normal Distribution Let Z and W denote random vectors with Normal distributions and with covariance matrix Z N (µ z, Σ zz ) W N (µ w, Σ ww ) E((Z µ z )(W µ w ) ) = Σ zw, so that the distribution of the vector obtained by stacking Z on W is Then ( Z W ) N (( µz µ w ) ( Σzz Σ, zw Σ wz Σ ww )) Z W N ( µ z + Σ zw Σ 1 ww(w µ w ), Σ zz Σ zw Σ 1 wwσ wz ) (714) For a proof, write down the ratio of the probability densities of (Z, W ) and of W and complete the square in the exponent term 734 The Recursions Suppose after data D t 1 = {X 1,, X t 1 } have been observed we know by some means that the state S t 1 has mean m t 1 and covariance matrix P t 1, so that S t 1 D t 1 N (m t 1, P t 1 ) 81

The recursions are built on relating the distributions of to this (a) S t D t 1, and (b) S t {X t, D t 1 } For (a), because S t is related to S t 1 through the state equation (711) (S t = F t S t 1 + V t ), it follows that, given D t 1, S t is also Normally distributed and its mean vector and covariance matrix are and S t D t 1 N (m t t 1, P t t 1 ), (715) m t t 1 = E(S t D t 1 ) = E(F t S t 1 D t 1 ) + E(V t D t 1 ) = F t m t 1 (716) P t t 1 = E ( (S t m t t 1 )(S t m t t 1 ) D t 1 ) = Ft P t 1 F t + Q t (717) For (b), from (715) and the observation equation (710), (X t = G t S t + ɛ t ) we find E(X t D t 1 ) = G t m t t 1, (718) so that and hence Similarly X t E(X t D t 1 ) = G t (S t m t t 1 ) + ɛ t Var(X t D t 1 ) = G t P t t 1 G t + σ 2 ɛ (719) Cov(X t, S t D t 1 ) = G t P t t 1, Cov(S t, X t D t 1 ) = P t t 1 G t Thus, given the data D t 1, S t and X t have the joint distribution ( ) (( ) ( St X t D mt t 1 Pt t 1 P t 1 N, t t 1 G t G t m t t 1 G t P t t 1 G t P t t 1 G t + σɛ 2 )) It follows from the result in 733 that the conditional distribution of S t given the new observation X t in addition to D t 1 (that is, the distribution of S t D t ) is where S t {X t, D t 1 } N (m t, P t ), m t = m t t 1 + P t t 1 G tvar(x t D t 1 ) 1 (X t G t m t t 1 ) (720) P t = P t t 1 P t t 1 G tvar(x t D t 1 ) 1 G t P t t 1 (721) The three equations (719), (720) and (721) called the updating equations together with (716) and (717) called the prediction equations are collectively referred to as the Kalman Filter equations Given starting values m 0 and P 0 they can be applied successively to calculate the distribution of the state vector as each new observation becomes available At any time they give values which contain all the information needed to make optimal predictions of future values of both the state and the observations, as follows 82

735 The Prediction and Filtering Problems By the general result about minimum mean square error forecasts in 61, the conditional mean of S t given D t 1, m t t 1, is the minimum mean square error estimate of the state S t given observations up to and including time t 1 The covariance matrix P t t 1 gives the estimation error variances and covariances Thus the prediction equations (716) and (717) give the means to solve the Prediction Problem of 731 In the same way the conditional mean of S t given D t, m t, is the solution to the Filtering Problem of 731, and the variances and covariances of the error in estimating S t by m t are given by P t 736 The X-Prediction Problem The minimum mean square error forecast of X t given observations up to time t 1, that is, given D t 1, is simply X t = G t m t t 1, by (718) The prediction error, which we will denote by e t, is therefore e t = X t X t = X t G t m t t 1 = G t (S t m t t 1 ) + ɛ t e t is also known as the innovation at time t since it consists of the new information in the observation at t From the updating equation (720) it can be seen that the innovations play a key part in the updating of the estimate of S t 1 to S t The further e t is from the zero vector, the greater the correction in the estimator of S t 1 The innovations have means E(e t ) = 0, and variances, which we will denote by φ t, given by φ t = Var(e t ) = E(X t G t m t t 1 ) 2 = G t P t t 1 G t + σ 2 ɛ, (722) from (719) and (721) The φ t can be calculated straightforwardly from the Kalman filter equations 737 Likelihood The likelihood function, L say, for any model is the probability density (or probability in the discrete case) of the observed data, taken as a function of the unknown parameters, θ say For a state space model therefore, if data X 1 = x 1,, X t = x t have been observed, and if p is the joint probability density function of X 1,, X t, L(θ : x) = p (x : θ) = p (x 1 θ) t p(x s D s 1 : θ) where the density function p(x s D s 1 ) is that of the Normal distribution (of X s given D s 1 ) with mean E(X s D s 1 ) = X s = G s m s s 1 and variance φ s given by (722) 83

Thus log L = const + log p(x 1 θ) 1 2 = const + log p(x 1 θ) 1 2 t log φ s 1 2 t log φ s 1 2 t (x s X s ) 2 which is easily calculated from the innovations and their variances, and p (x 1 θ) if necessary Standard methods of numerical maximization may then be used to estimate the unknown parameters θ This is the approach described in 531 t e 2 s φ s, φ s Summary of Ideas in Chapter 6 State space models specified by state variables observation variables and by state equations describing the evolution of states, and observation equations describing the relationship of observations to states Special cases include the ARIMA models studied in earlier chapters, and models underlying the Decomposition Method of 23 of Chapter 2 Various problems in relation to forecasting in state space models may be specified: prediction filtering smoothing X-prediction Solutions are based on the fact that if variables are Gaussian, then to describe the whole evolution of the system all that s needed are the means and variances-covariances of the variables through time These can be calculated recursively by the Kalman recursions The Kalman recursions yield immediately forecasts and their error variances efficient computation of the likelihood function, and therefore a powerful way of fitting models 84