Markov Switching Models

Size: px

Start display at page:

Download "Markov Switching Models"

Janel Suzan Goodwin
5 years ago
Views:

1 Applications with R Tsarouchas Nikolaos-Marios Supervisor Professor Sophia Dimelis A thesis presented for the MSc degree in Business Mathematics Department of Informatics Athens University of Economics and Business Athens 29 September 2015

2 Contents Abstract Acknowledgements Intro Introduction Classical Approach Classical Analysis Maximum Likelihood Estimation Markov Chains Classification of States Reducible Markov Chains Ergodic Markov Chains Statistical Analysis of i.i.d. Mixture Distributions Inference About the Unobserved Regime Time Series Models with Regime Switching Description of the Process Evaluation of the Likelihood Function Forecasts for the Regimes Forecasts for the Observed Variables Maximum Likelihood Estimation of Parameters EM Algorithm Markov Switching Model of Conditional Mean A Simple AR Model Markov Trend Markov Switching Model of Conditional Variance Markov Switching Model of Conditional Mean and Conditional Variance Hypothesis Testing Linearity Test for Markov Switching Model Determining the Number of States Testing Other Hypothesis State Space Models and the Kalman Filter State-Space Models The Kalman Filter

3 Contents in State-Space Format Specification of the Markov Switching in State-space Format Estimation of the Model Bayesian Approach Bayesian Analysis Estimation Methods for Bayesian Approach Markov Switching Model-Bayesian Approach State Space Models with Markov Switching-Bayesian Approach Applications Applications of the Markov Switching Model with R-programming An Application for Indian GDP An Application for DJIA Index Conclusions Appendix 82 A. Appendix for GDP Example A.1 White Neural Network Test for Non-linearity A.2 Simple Linear Model of GDP A.3 Q-Q Plots A.4 ACF-PACF A.5 Smooth Probabilities A.6 Q-Q Plots for the lagged model A.7 ACF-PACF for the lagged model B. Appendix for DJIA Index Example B.1 OLS-Based MOSUM Tests B.2 ACF-PACF B.3 Q-Q Plots B.4 Regime Residuals C. Dirichlet Process Bibliography

4 List of Figures 4.1 Plot of the log GDP of India Plot of the smoothed probabilities for Regime Plot of the smoothed probabilities for Regime Plot of the smoothed probabilities for Regime Plot of the smoothed probabilities for Regime Plot of the log Dow Jones index Plot of the Recursive CUSUM test Normal Q-Q plot of pooled residuals for the MSM-AR Plot of the smoothed probabilities Dependent variable vs. smoothed probabilities for low volatility regime Dependent variable vs. smoothed probabilities for high volatility regime A.1 Q-Q plot for Regime 1 Residuals A.2 Q-Q plot for Regime 2 Residuals A.3 ACF-PACF Plot of Regime 1 Residuals A.4 ACF-PACF Plot of Regime 2 Residuals A.5 Plot of the smoothed probabilities A.6 Q-Q plot for Regime 1 Residuals A.7 Q-Q plot for Regime 2 Residuals A.8 ACF-PACF Plot of Regime 1 Residuals A.9 ACF-PACF Plot of Regime 2 Residuals B.1 Plot of the OLS-Based MOSUM B.2 ACF-PACF Plot of Regime 1 Residuals B.3 ACF-PACF Plot of Regime 2 Residuals B.4 Q-Q plot for Regime 1 Residuals B.5 Q-Q plot for regime 2 Residuals B.6 Plot of the Regime Residuals

5 Forecasting is the art of saying what will happen, and then explaining why it didn t! ANONYMOUS

6 Abstract This thesis displays a presentation of the Hamilton s Markov Switching model both in simple and State Space form. Moreover, the model is applied in the India s GDP and DJIA Index using R. This thesis is based on three chapters of Markov Switching models. First chapter covers the Classical approach, the parameters of which are estimated taking into consideration only the data sample and inferences are made conditional to that data. This presentation consists of two parts. The first part refers to the simple form of Markov Switching model which can be estimated by the EM algorithm. The second part refers to the State Space form of Markov Switching model which can be estimated by Kalman F ilter. The second chapter presents the Bayesian approach, according to which we treat the parameters as individual random variables with their own prior distributions which are determined by researcher beliefs or randomly by a Dirichlet process before the posterior distribution is determined taking into consideration the sample data. Similar to the first chapter both forms of the model are presented. In the Bayesian approach the parameters are estimated with Markov Chain Monte Carlo methods such as the Gibbs Sampling. In the last chapter a two-state Markov Switching model is applied in India s real GDP and the DJIA Index. The results of the implementation show that the Markov Switching model can fit well financial data and can detect the regimes with effectiveness. 6

7 Acknowledgements First of all, I would like to thank my supervisor, Professor Sophia Dimelis. I would never have been able to finish my thesis without her guidance, interest and encouragement. I would like to express sincere gratitude to Athens University of Economics and Business, Department of Informatics for giving me the opportunity of research. Finally, I feel infinite love and gratitude to my family and especially to my mother who shows me what is worth fighting for. I owe everything I have to them.

8 Acronyms ARCH AIC Autoregressive Conditional Heteroskedasticity Akaike Information Criterion AR Autoregressive ARMA Autoregressive of Moving Average EM Expectation Maximization GARCH Generalized Autoregressive Conditional Heteroskedasticity MA MLE MSM Moving Average Maximum Likelihood Estimation Markov Switching Model MSM-AR MCMC Markov Switching Model Autoregressive Markov Chain Monte Carlo OLS VAR Ordinary Least Squares Vector Autoregressive

9 1Intro 1.1 Introduction In modern Econometrics it is very common to employ various time series models to analyse the dynamics of economic and financial variables. When we have to face linear behaviours we use linear models such as the autoregressive (AR), moving average (M A) and the mixed autoregressive moving average model (ARM A). Linear time series models are widely known. As a matter of fact, every econometric and statistical package has ready equations for use. Even thought these models are quite useful and successful in numerous applications they are unable to simulate complicate non linear dynamic patterns. An example of non linear behaviour is the stock index prices that typically fluctuate around higher levels and are more stable during expansions, but they stay at relatively lower levels and less persistent during contractions. These kind of data have non linear behaviour and can not be simulated by a common single linear model. In the last two decades, we have witnessed a rapid growth in the development of non-linear time series models. Non-linear time series models have also their own limitations. First of all, implementing non-linear models is obtrusive. For instance, the non-linear optimization algorithms are easy to get stuck at a local optimum in the parameter space. Secondly, these models are designed to simulate certain dynamic patterns of data and are not flexible to describe a different behaviour any further. The Markov Switching model of Hamilton(1989,1990), also known as the regime-switching model is one of the most popular non-linear time series models in the literature. M SM employ multiple structures which replicate the time series behaviour in different regimes. With the ability of switching between 9

10 Intro these structures the model captures more complex dynamic patterns. The innovative feature of the Markov Switching model is that the switching mechanism is governed by an unobservable state(regime) variable that follows a Markov process. More specific the Markovian property indicates that the state variable depends on its immediate past value. As such, the structures are changing when a switch takes place. There is a sharp contrast with the random switching model of Quandt (1972) where the state variables are completely independent in opposition to the Markovian property. Another regime-switching model is the T hreshold Autoregressive model developed by Tong(1983,1990) which allows a single switch when we reach a threshold in the sample. In this document we are going to present the Markov Switching model originally developed by Hamilton(1989). Many time series exhibit dramatic breaks in their behaviour. These breaks are caused by events such as financial crisis or abrupt changes in government policy. Abrupt changes in an attribute of financial data also such as the asset prices and indexes. Dramatical changes indicate the existence of different regimes and the need of a model that can replicate those dynamical patterns like the M SM. Markov Switching regressions were introduced in econometrics by Goldfeld and Quandt(1973). A simple Markov Switching model with switch in the mean without any autoregressive element and simple time-invariant 1 Markov chain appears to have been first analysed by Lindgren(1978) and Baum et al. (1980). Models with autoregressive elements were described by Poritz(1982), Juang and Rabiner(1985) and Rabiner(1989). These models were named as Hidden M arkov models. Until Hamilton(1989,1990) with his famous papers developed the Markov Switching model in the general form. After Hamilton developed the general form of the model many extensions have been made. The most important was the Bayesian approach of the model and the State Space form with time-varying parameters introduced by Kim and Nelson(1999). 1 The transition probability depends only on the past value of the most recent regime 10

11 2 - Classical Approach 2.1 Classical Analysis In the classical approach of statistics we estimate the unknown parameters taking into consideration only the sample data. Hence we construct a distribution for the parameters by the observed data only. The main method for the estimations of parameters is the Maximum Likelihood. Similar method which is widely known and functional is the Least Squares method. Maximum likelihood method incorporates all the information in a model by using the complete joint distribution of the observations instead of Least Squares which uses only the first two moments Maximum Likelihood Estimation To make the presentation of Maximum Likelihood we assume a model with a vector of parameters θ and a vector of observations ỹ T = [y 1, y 2,..., y T ]. The observations vector is actually the sample data which is the only information we have to make estimation for the unknown parameters of the model θ. The likelihood function is defined as follows: L(θ ỹ T ) (2.1.1) Likelihood function specifies the plausibility or likelihood of the data given the parameter vector θ. Maximum Likelihood Estimations (MLE) are the parameters estimations through the method which maximizes the probability of having 11

12 Classical Approach generated the observed sample. To make simplifications we usually maximizing the log of the likelihood function for estimating the parameters θ. ˆθ MLE = Argmax lnl(θ ỹ T ) With the log likelihood it is possible to get the summarized amount of information in the sample by computing the information matrix I(θ) as follows in the next equation: [ ] ϑ 2 lnl(θ ỹ T ) I(θ) = E ϑθϑθ (2.1.2) The inverse of the information matrix gives us the covariance matrix of an unbiased estimator ˆθ. Thus we set θ = ˆθ MLE and we get the covariance matrix of ˆθ MLE. [ ] 1 Cov(ˆθ MLE ) = ϑ2 lnl(θ ỹ T ) ϑθϑθ θ=ˆθmle (2.1.3) Analytically, the Likelihood function 2 is derived as follows: L(θ ỹ T ) = T p(y t θ) (2.1.4) where p(y t θ) is the marginal density of an individual observation. t=1 Example 1: Normal Distribution If [y 1, y 2,..., y T ] are i.i.d. N(µ, σ 2 ) random variables and the parameters we want to estimate are mean and variance θ = (µ, σ) their density is being written as follows: f(y t µ, σ) = 1 σ 2π exp( 1 2 [x t µ ] 2 ), for t = [1, 2,..., T ] σ To estimate the two parameters θ = [µ, σ] we want the log Likelihood function which is l(µ, σ ỹ T ) = lnl(µ, σ ỹ T ) = T 1 σ 2π exp( 1 2 [x t µ ] 2 ) σ where L(µ, σ ỹ T ) is the Likelihood function. After some computations to the log Likelihood we get t=1 l(µ, σ ỹ T ) = T lnσ T 2 ln2π 1 2σ 2 T (y t µ) 2 To make Maximum Likelihood Estimations we have to maximize the log Likelihood function with respect to parameters µ, σ. The estimations ˆθ MLE 2 If the observations are independent. Otherwise, the Likelihood function is derived L(θ ỹ T ) = T t=2 p(yt ỹ t 1, θ)p(y 1 θ) 12 t=1

13 Classical Approach can be obtained by setting the first derivative of the log Likelihood function equal to 0 with respect to µ and σ. Hence, we get the following equations: { ϑl ϑlnl(θ ỹ T ) = 0 ==> ϑµ = 1 T σ 2 t=1 (y t µ) 2 = 0 ϑθ ϑl ϑσ = T σ + σ 3 T t=1 (y t µ) 2 = 0 where l is the lnl. Solving these equations we get the Maximum Likelihood Estimations for µ and σ 2. Finally to show that these solutions are the maximum of Likelihood we should compute the second derivative of l(θ ỹ T ) with respect to µ and σ 2 and prove that is negative. Example 2 : Poisson Distribution Assume that [y 1, y 2,..., y T ] are i.i.d. probability function is as follows: Poisson random variables, thus the P (Y = y) = λy e λ where λ is the unknown Poisson parameter, thus θ = [λ]. The log Likelihood will be: y! l(λ ỹ T ) = lnl(λ ỹ T ) T = (Y t lnλ λ lny t!) t=1 = lnλ T Y t T λ t=1 T lny t! For the maximum of the log Likelihood we set the first derivative equal to 0. ϑl ϑλ = 1 T y t T = 0 λ t=1 By solving we get ˆλ MLE = Ȳ and we know that is the maximum since the function l is concave. 2.2 Markov Chains A Markov process is a random process for which the future state depends on the present state only. This means that it has completely no memory of how the present state was reached. Let s t be a random variable that can assume only an integer value { 1, 2, 3,..., N }. For Markov property we assume that the probability that s t equals to j depends only to the most recent past value s t 1 : t=1 P [s t = j s t 1 = i, s t 2 = k,...] = P [s t = j s t 1 = i] = P ij (2.2.1) 13

14 Classical Approach That kind of process is described as an N-state Markov chain with transition probabilities [P ij ] i,j=1,2,3,...,n. p ij is the transition probability that gives us the probability of transit state j if we are already in state i. Note that p i1 + p i p in = 1 (2.2.2) As an example we show the transition probabilities in an (N N) transition matrix P : p 11 p p N1 p 12 p p N2 P = (2.2.3) p 1N p N2... p NN The row j, column i element of P is the transition probability p ij, for example the row 1, column 2 element of matrix P gives us the p 12 probability which is the probability that state 1 will be followed by state 2. The example with the (2.2.3) transition matrix gives us the probabilities to pass from a state [1, 2, 3,..., N] at time t to another state at time t + 1. As a matter of fact with matrix (2.2.3) we have the transition probabilities of 1 period. We consider the question of determining the probability, given the chain is in state i at time t, it will be in state j at time t Classification of States Definition 2.2.1: State j is accessible from i if Pij k > 0 for some k 0. Definition 2.2.2: States i and j communicate if they are accessible from each other. This can be written as i j and is an equivalent relation. The meaning of this relation is That we have reflexivity i i We have symmetry if i j and then j i. We have transitivity if i j and j k then i k Definition 2.2.3: A state is being called transient if we start from a state and there is positive probability not to return ever to that state. if i j and not vice versa then state i is transient. Definition 2.2.4: A state is being called recurrent, if we start from that state and the probability returning to it equals to 1 Definition 2.2.5: An absorbing state is a state that once entered cannot be left. This means that the transition probability is as follows P ii = 1. Definition 2.2.6: We define period of a state i as the greatest integer t which satisfies the condition P (n) ii = 0 if n t, 2t, 3t, 4t,... If period = 1 then the state is being called aperiodic. 14

15 Classical Approach Reducible Markov Chains Let us suppose that we have a two-state Markov chain and the transition matrix is [ ] p11 1 p P = 22 (2.2.4) 1 p 11 p 22 We have two states [1, 2]. In the transition matrix p 11 give us the probability to transit from state 1 to state 1. Suppose that p 11 = 1, such that the matrix P is upper triangular. In that case when the process enters the state 1 there is no possibility transit to other state and state 1 is an absorbing state and the Markov chain reducible. Generalizing, an N state Markov chain is said to be reducible if there exists a matrix can be written in the form [ ] B C P = (2.2.5) 0 D where B denotes a (K K) matrix for some 1 K N. If P is upper blocktriangular, then so is P m for any m. Thus, once such a process enters state j such that j K, there is no possibility of ever returning to one of the states K + 1, K + 2,..., N. If a Markov chain is not reducible is said to be irreducible. A Markov chain is irreducible if all the states communicate with each other. As an example, for the same two-state chain (2.2.4) is irreducible if p 11 < 1 and p 22 < Ergodic Markov Chains Definition 2.2.7: A Markov chain with finite state space is called a regular chain if some power of the transition matrix has only positive elements. A Markov chain is called ergodic if every state is accessible by every other state in one or more moves. Theorem (Fundamental Limit Theorem): Denoting P as the transition matrix for a regular Markov chain with finite state space. For the sake of simplicity we assume S = 1, 2, 3,..., s. Then lim P n = W n where W is an s s matrix, all rows of which are a positive probability vector This is p(x) for x 1, 2,..., s and have w = [p(1),..., p(s)] x=1 lim P n (x, z) = p(z) n independent of x. We actually have the estimate 15 p(x) = 1. More specific, for all z we

16 Classical Approach max { [P n (x, z) p(z)] : x, z S } Ce Dn where C, D are positive finite constants independent of n. Once the lim n P n = W is known, where W has all rows the same, we use this fact to compute W and provide an interpretation for the common row vector of W. Next theorem covers this interpretation. Theorem 2.2.2: Let P be a regular transition matrix for a finite state space Markov chain with states S = { 1, 2, 3,..., s } and assume that lim n P n = W, with common row w. Then the s s system of linear equations given by x P = x has a unique probability row vector solution, the common row w. In addition if v is an arbitrary probability row vector of length s, then lim vp n = w n where w is the common row vector of W. Thus the long run probability of being in state y, s v x P n (x, z) x=1 is approximately w y for all y = 1, 2,..., s no matter what is the initial probability distribution v = [v 1, v 2,..., v s ] we have used. Furthermore, if w is the common row vector of W, then we get wp n = w for any n 0. The probability of being in x is w x for all times n = 0, 1, 2,..., the chain is in equilibrium if we start with initial distribution w. Since w is the unique probability vector that satisfies the equation x = xp where P is regular and finite,we can use this to solve for w and hence compute lim P n (u, v) n for all u, v = 1, 2,..., s since the limit equals w v = v th term in w. That is x = x P is an s s linear system in the variables x 1, x 2,..., x s and the solution is the probability vector w. Example 1: Consider a Markov chain and P is the transition matrix with state space S = [1, 2, 3] Then if we take P P = P 2 we get 1/2 1/4 1/4 P = 1/2 0 1/2 1/4 1/4 1/2 16

17 Classical Approach 7/16 3/16 3/8/4 P 2 = 3/8 1/4 3/8 3/8 3/16 7/16 thus P is regular and lim n P n = W exists. To find the unique probability vector w we solve the system x = xp subject to x being a probability vector, which means that x 1 + x 2 + x 3 = 1 and x i 0. The system we should get if we try to solve x = xp is x 1 /2 + x 2 /2 + x 3 /4 = x 1 x 1 /4 + 0 x 2 + x 3 /4 = x 2 x 1 /4 + x 2 /2 + x 3 /2 = x 3 Solving this linear system we get w = x = [2/5, 1/5, 2/5] and hence lim n P n (x, z) = 2/5 for z = 1, 3 and lim n P n (x, z) = 1/5 for z = 2. Remark 1: The fundamental limit theorem may fail some times for non-negative ergodic chains. Assume transition matrix [ ] 0 1 P = 1 0 The chain [ is irreducible ] and aperiodic, hence [ is] ergodic, but P n = for n even and P 0 1 n = for n odd. However, ergodic 1 0 Markov chains with finite state space have a unique probability vector w such that wp = w. Theorem 2.2.3: Let P be the transition matrix of an ergodic Markov chain. Then there is a unique probability vector w such that wp = w. Hence,using w as the initial distribution of the chain, the chain has the same distribution for all times since w = wp n for any n 1. For a regular Markov chain,the initial distribution w which satisfies wp n = w can be interpreted as the long run probability vector. lim n pn (i, j) = w j for j = 1, 2,..., s when w = [w 1, w 2,..., w s ]. However as we have mentioned before the limits of the individual n step probabilities do not necessarily exist for ergodic chains. Nonetheless, the following averages hold: lim n n k=0 p k (i, j) n 1 = w j for i, j = 1, 2, 3,..., s where w = [w 1, w 2,..., w s ] is the stationary probability vector of P. This is proved by the next theorem which is a weak law of the large numbers for Markov chains. 17

18 Classical Approach Theorem 2.2.4: Let w be the initial stationary distribution of an ergodic Markov chain. For m = 0, 1, 2,..., let Y m = 1 if the n th step is in state j and 0 otherwise. Hj n = (Y Y n )/(n + 1) = average count of times in stage j in the first n + 1 steps. Thus, for every e > o independent of the initial distribution. lim P n ( Hn j w j > e) = Statistical Analysis of i.i.d. Mixture Distributions In this section we are going to introduce the statistical analysis of i.i.d. mixture distributions. Let the regime that a process is in it at time t be indexed by the unobserved variable s t, where there are N possible regimes (s t = 1, 2,..., N). When the process is in regime 1, the observed variable y t is presumed to have been drawn from a Gaussian distribution N(µ 1, σ1). 2 If the process is in regime 2 then y t would be drawn from another Gaussian distribution N(µ 2, σ2) 2 and so on for all the regimes. Hence, the density of y t conditioned to the random variable s t taking the value i is { } 1 (y f(y t s t = i; θ) = t µ i ) 2 exp (2.3.1) 2πσi for i = 1, 2,..., N. Where θ is a vector of population parameters that includes µ 1,..., µ N and σ 2 1,..., σ 2 N. The unobserved regime s t is presumed to have been generated by some probability distribution. The unconditional probability that s t takes the value i is denoted π j : 2σ 2 i P (s t = i; θ) = π i for i = 1, 2,..., N. (2.3.2) Since the regime is unobserved the probabilities π 1,..., π N are also included in θ which is given by θ = (µ 1,..., µ N, σ 2 1,..., σ 2 N, π 1,..., π N ) For example, if we were interested in the joint event probability that s t = i and that y t falls in the interval [a, b], we could find this by integrating over all values of y t between a and b. This will be called as the joint density-distribution function of y t and s t. p(y t, s t = i; θ) = f(y t s t = i; θ) P (s t = i; θ) (2.3.3) From equations (2.3.1),(2.3.2) this function can be written as { } p(y t, s t = i; θ) = π i (y t µ i ) 2 exp 2πσi 2σ 2 i (2.3.4) 18

19 Classical Approach Summing (2.3.4) over all possible values for i we can get the unconditional density of y t : f(y t ; θ) = N p(y t, s t = i; θ) i=1 { } = π 1 (y t µ 1 ) 2 exp 2πσ1 2σ π N 2πσN exp { } (y t µ N ) 2 2σ 2 N (2.3.5) Equation (2.3.5) shows us the density of the observed data y t since the the s t is unobserved. If the regime variable s t is distributed i.i.d. across different times t, then the log likelihood for the observed data is calculated from (2.3.5) as (θ) = T logf(y t ; θ) (2.3.6) t=1 The maximum likelihood estimations of θ is taken through maximization of (2.3.6) and under the conditions that π 1 + π π N = 1 and π i 0 for i = 1, 2,..., N Inference About the Unobserved Regime If we calculate the estimates of θ through various methods such as the EM algorithm we can make an inference about the regime which is responsible for producing the date t observation of y t. Recall from statistics the definition of a conditional probability, for any events A and B, the conditional probability of A given B is defined as P (A B) = P (AandB) P (B) where the probability P (B) has to be 0. if we rewrite the latter equation in respect to the the joint probability of A and B occurring together we take the next equation P (AandB) = P (A B) P (B) Now from the definition of a conditional probability, it follows that P (s t = i y t ; θ) = p(y t, s t = i; θ) f(y t ; θ) = π i f(y t s t = i; θ) f(y t ; θ (2.3.7) Assume as an example that we have the Density of mixture of two Gaussian distributions with y t s t = 1 N(0, 1) and y t s t = 2 N(2, 1) with P (s t = 1) = 0.6. It is possible to compute (2.3.7) using (2.3.1) and (2.3.5) for each observation y t in the sample. Equation (2.3.7) give us the probability being in 19

20 Classical Approach regime i for observation t. For example if we have an observation y t which is equal to 0 someone could say that almost surely this observation comes from a N(0, 1) distribution rather than a N(2, 0) distribution so that P (s t = 1 y t ; θ) for that time t would be close to Time Series Models with Regime Switching Description of the Process Developing a time series model with regime switching means that this model allows to variables to follow different process over different sub samples. Considering for example a simple first-order autoregression process in which both the constant term and the autoregressive coefficient switch under the regimes. y t = c st + φ st y t 1 + ε t, (2.4.1) where ε t N(0, σ 2 ). We are going to assume that regime s t is coming from a N state Markov chain with s t independent of ε t for all t. Markov chains can simulate that kind of processes, supposing we have a regime with permanent change this can be represented as an absorbing state in a Markov chain. We might also want a regime-switching time series model that can interpret short-lived events such as a war(e.g. World War). It is possible to choose parameters for a Markov chain, such that given 200 years of data there will be an event of 4 years like a war. This means that we should have a regime in our model that simulates war period. Hence giving to this model another 200 years of data it is possible to have another similar event. The essence of scientific method is the presumption that the past is going to be reproduced. Another important fact is that Markovian processes have the advantage of flexibility. Markov chains can specify a probability law consistent with a broad range of different outcomes, and choosing particular parameters within that class on basis of the data and alone. We are going to investigate the next model in this section. Let Y t be an (n 1) vector of observed variables endogenous variables and x t an (k 1) vector of observed exogenous variables. Let y t = (y t, y t 1,..., y m, x t, x t 1,..., x m) be a vector that contains all the observations through time t. if the process is governed by regime s t = i at time t, the conditional density of y t is given by the next equation f(y t s t = i, x t, Y t 1 ; α) (2.4.2) where α is a vector of parameters characterizing the conditional density. Equation (2.4.2) can represent the density of y t for N different regimes. For i = 1, 2,..., N we create a (N 1) vector η t with the densities of y t for every different regime s t. For example we are going to present a vector η t for N = 2 regimes in the following equation 20

21 Classical Approach η t = [ ] f(yt s t = 1, y t 1 ; α) = f(y t s t = 2, y t 1 ; α) 1 2πσ exp 1 2πσ exp { { (y t c 1 φ 1y t 1) 2 2σ 2 (y t c 2 φ 2y t 1) 2 2σ 2 } } From this equation it is clear that we had make the assumption that in (2.4.2) the conditional density depends only on the current regime s t and not on the previous regimes: f(y t x t, Y t 1, s t = i; α) = f(y t x t, Y t 1, s t = i, s t 1 = j,...; α) (2.4.3) Actually, the conditional density could depend on previous regimes and not really restrictive to the current. Consider, for example a conditional density of y t depends on both s t and s t 1 and where s t is described by a two state Markov chain. With this property we add more memory to our model. We can define a new variable s t that characterizes the regime for time t as follows: s t = 1 if s t = 1 and s t 1 = 1 s t = 2 if s t = 2 and s t 1 = 1 s t = 3 if s t = 1 and s t 1 = 2 s t = 4 if s t = 2 and s t 1 = 2, where p ij denotes P { s t = j s t 1 = i } and now s t follows a four state Markov chain with the following transition matrix: p 11 0 p 11 0 P = p 12 0 p p 21 0 p 21 0 p 22 0 p 22 Hence, η t could represented in the same way as in the previous equation for four states as follows: { } f(y t y t 1, s t = 1; α) = 1 (y t c 1 φ 1 y t 1 ) 2 exp 2πσ 2σ 2 { } f(y t y t 1, s t = 2; α) = 1 (y t c 2 φ 2 y t 1 ) 2 exp 2πσ 2σ 2 { } f(y t y t 1, s t = 3; α) = 1 (y t c 3 φ 3 y t 1 ) 2 exp 2πσ 2σ 2 { } f(y t y t 1, s t = 4; α) = 1 (y t c 4 φ 4 y t 1 ) 2 exp 2πσ 2σ 2 21

22 Classical Approach It is assumed that s t represents a Markov chain that is independent of past observations on y t or x t current or past. Thus the transition probability can be written more generalized as follows: P { s t = j s t 1 = i, s t 2 = m,..., x t, Y t 1 } = P { st = j s t 1 = i } = p ij (2.4.4) Evaluation of the Likelihood Function The population parameters consist of α and the transition probabilities p ij. We can collect these parameters in a vector θ. In the i.i.d. case, the inference about the the value of s t depends only on the value of y t and not on all observations available Y t. Let P { s t = i Y t ; θ } denote the inference about the value of s t based on all observations and the parameters θ. We can collect all the conditional probabilities of P { s t = i Y t ; θ } for i = 1, 2,..., N in a (N 1) vector ˆξ t t. It is possible to make forecast of future regime in period t + 1 given all the observation until t. Collecting this forecasts in an (N 1) vector ˆξ t+1 t which is a vector whose j th element is P { s t+1 = j Y t ; θ }. The optimal inference and forecast for every time t in the sample can be found by iterating on these pair of equations: ˆξ t t = (ˆξ t t 1 η t ) 1 (ˆξ t t 1 η t ) (2.4.5) ˆξ t+1 t = P ˆξ t t (2.4.6) Where η t represents the (N 1) vector of conditional densities f(y t s t = i, x t, Y t 1 ; α) for every i = 1, 2,..., N, P represents the (N N) transition matrix and 1 is the (N N) vector of 1s. Given a starting value for ˆξ 1 0 and set initial values for the population parameter vector θ we iterate on (2.4.5) and (2.4.6) for t = 1, 2,..., T to calculate ˆξ t t and ˆξ t+1 t for each time t in the sample. The log likelihood function for the observed data Y T is: where T L(θ) = logf(y t x t, Y t 1 ; θ) (2.4.7) t=1 f(y t x t, Y t 1 ; θ) = 1 (ˆξ t t 1 η t ) (2.4.8) For the sake of simplicity we will try to explain the derivation of the equations (2.4.5) through (2.4.8) and the explanation of the algorithm we had defined. One of the assumptions we had made is that x t is exogenous which means that it contains no information about s t beyond that contained in Y t. Hence the i th element of ˆξ t t 1 could be described as P { s t = i x t, Y t 1 ; θ } and the ith element of η t is f(y t s t = i, x t, Y t 1 ; θ). Through element by element 22

23 Classical Approach multiplication of these two vectors (ˆξ t t 1 η t ) we get an (N 1) vector that gives us the conditional joint density distribution of y t and s t as follows: P { s t = i x t, Y t 1 ; θ } f(y t s t = i, x t, Y t 1 ; θ) = = p(y t, s t = i x t, Y t 1 ; θ) (2.4.9) The density of y t conditioned on the past observables is the sum of (2.4.9) for i = 1, 2,..., N and can be written in a vector notation : f(y t x t, Y t 1 ; θ) = 1 (ˆξ t t 1 η t ) as mentioned in equation (2.4.8). For the derivation of conditional distribution of s t we divide the joint density distribution by the density of y t : p(y t, s t = i x t, Y t 1 ; θ) f(y t x t, Y t 1 ; θ) = P { s t = i y t, x t, Y t 1 ; θ } = P { s t = i Y t ; θ } Hence, if we use equation (2.4.8) and change the conditional distribution of y t with 1 (ˆξ t t 1 η t ) we get: P { s t = i Y t ; θ } = p(y t, s t = i x t, Y t 1 ; θ) 1 (ˆξ t t 1 η t ) (2.4.10) But because of equation (2.4.9) we get that the numerator in the equation (2.4.10) is the ith element of the vector (ˆξ t t 1 η t ), while the the left side of equation (2.4.10) is the ith element of the vector ˆξ t t. Thus, we can derive the equation (2.4.5) if we collect the equations of (2.4.10) in a (N 1) vector for i = 1, 2,..., N. For the derivation of ˆξ t+1 t = P ˆξ t t we take expectations of this equation conditioned to information Y t and we get E(ξ t+1 Y t ) = P E(ξ t Y t ) + E(v t+1 Y t ) (2.4.11) Since v t+1 is a martingale difference sequence with respect to Y t through properties we have E(v t+1 Y t ) = 0. Thus we conclude to the equation ˆξ t+1 t = P ˆξ t t To start the algorithm there is a need of setting initial value to ˆξ 1 0. One approach could be to set ˆξ 1 0 = ρ, where ρ is an (N 1) vector of non negative constants Forecasts for the Regimes The (N 1) vector of ˆξ t τ whose ith element is P { s t = i Y τ ; θ }. For t > τ this represents a forecast about the regime for a future time, if t < τ it represents the 23

24 Classical Approach smoothed inference about the regime. Referring to smoothed inference we mean the inference we get through the density conditioned to the whole information of the sample. The n period ahead forecast of ˆξ t+n can be found by taking expectations to the equation (2.4.8) on both sides conditioned to information up to time t. E( xi t+ ) Y t ) = P n E(ξ t Y t ) Thus we have the next equation for the forecast of ˆξ t+n, ˆξ t+n t = P n ˆξ t t (2.4.12) where the ˆξ t t can be calculated from equation (2.4.5). For the inferences where we have the whole information sample (smoothed inferences) we can calculate them using the algorithm being developed by Kim(1993). This algorithm can be written in a vector form as follows: ˆξ t T = ˆξ t t { P [ˆξ t+1 T ( )ˆξ t+1 t ] } (2.4.13) Iterating this equation backward for t = T 1, T 2,..., 1 will give us the smoothed probabilities ˆξ t T.The algorithm works under some assumptions: that s t follows a first order Markov chain, the conditional density of y t as in equation (2.4.2) depends only on the current state s t and x t is independent of s t for all t and τ Forecasts for the Observed Variables From the conditional density of y t (2.4.2) we can forecast y t+1 if we know Y t, x t+1 and s t+1. For example we are going to use for the sake of simplicity the same AR(1) model y t = c st + φ st y t 1 + ε t. To make forecast for this model we are going to take expectations conditioned to Y t, s t+1 and we set t = t + 1. E(y t+1 s t+1 = i, Y t ; θ) = ŷ t+1 = c i + φ i y t (2.4.14) For every one of the states we have a different forecast. Thus, there are N different conditional forecasts associated with the possible values of the future state s t+1. if we take expectations to make forecasts based only on observable variables we shall see the relation with the conditional forecasts based on unobservable variables also. 24

25 Classical Approach E(y t+1 x t+1, Y t ; θ) = y t+1 f(y t+1 x t+1, Y t ; θ)dy t+1 { N } = y t+1 p(y t+1, s t+1 = i x t+1, Y t ; θ) dy t+1 i=1 { N = y t+1 [f(y t+1 s t+1 = i, x t+1, Y t ; θ)p { s t+1 = i x t+1, Y t ; θ } } ] = = i=1 N P { s t+1 = i x t+1, Y t ; θ } y t+1 f(y t+1 s t+1 = i, x t+1, Y t ; θ)dy t+1 i=1 N P { s t+1 = i Y t ; θ } E(y t+1 s t+1 = i, x t+1, Y t ; θ) i=1 dy t+1 (2.4.15) Finally, we can see that the appropriate forecast for ith regime is simply multiplied by the probability being in the ith regime and the resulting N different products for every regime are added together. Thus, we can collect for every forecast for i = 1, 1,..., N from equation E(y t+1 s t+1 = i, Y t ; θ) = ŷ t+1 in an (1 N) vector b t and we get E(y t+1 Y t ; θ) = b t ˆξ t+1 t (2.4.16) Note that the optimal forecast of y t+1 is a non-linear function of observable variables, since the ˆξ t t depends non-linearly on Y t. Markov chains suited well enough for multi period forecasts. For further discussion see Hamilton(1989, 1993b,1993c) Maximum Likelihood Estimation of Parameters In the previous section we have discussed about the derivation and the evaluation of the likelihood function. In that section we compute the value of θ that maximizes the log likelihood. If the transition probabilities satisfy only the conditions (p ij 0 and p i1 + p i p in = 1 for all i, j and the initial value of ˆxi 1 0 is taken as a fixed ρ then Hamilton 3 (1990) shown that the maximum likelihood estimates for the transition probabilities satisfy T t=2 P { s t = j, s t 1 = i Y T ; ˆθ } ˆp ij = T t=2 P { } (2.4.17) s t 1 = i Y T ; ˆθ where ˆθ denotes the vector of the maximum likelihood estimates. Hence, the estimation of the transition probability p ij is the number of times where the 3 James D. Hamilton [19] ANALYSIS OF TIME SERIES SUBJECT TO CHANGES IN REGIME, Journal of Econometrics 45 (1990)

26 Classical Approach state j follows after state i divided by the number of times where the process was at state i. If ρ satisfies 1 ρ = 1 and ρ 0 is regarded as a separate vector the maximum likelihood estimate of ρ is the smoothed inference about the initial state: ˆρ = ˆξ 1 T (2.4.18) The maximum likelihood estimate of the vector α that governs the conditional density f(y t s t = i, x t, Y t 1 ; α) is characterized by ( ) T ϑlogη t ϑα ˆξt T = 0 (2.4.19) t=1 where η t is an (N 1) vector produced by vertically stacking the density of y t for i = 1, 2,..., N and (ϑlogη t )/ϑα is an (N k) matrix of derivatives of the log densities. An example would be a Markov-switching regression model of the following form y t = z tβ st + ɛ t (2.4.20) where z t is a vector of variables containing lagged values of y and not only and ɛ t i.i.d.n(0, σ 2 ). The coefficient of the regression for this process is the vector β st which switches along with the state. The vector η t would be { } η t = 1 2πσ exp 1 2πσ exp { (y t z t β1)2 2σ 2... (y t z t β N ) 2 2σ 2 } for an α of this form α = (β 1, β 2,..., β N, σ2 ) and the condition (2.4.19) changes to T (y t z t ˆβ i )z t P { s t = i Y T ; ˆθ } = 0 for i = 1, 1,..., N (2.4.21) t=1 ˆσ 2 = T 1 T t=1 i=1 N (y t z t ˆβ i ) 2 P { s t = i Y T ; θ } (2.4.22) In equation (2.4.21) ˆβ i is described as a weighted OLS orthogonality condition. Hence, observations are weighted by the regime probability. More specific, the estimate of ˆβ i can be found from an OLS regression of ỹ t (i) on z t (i) as follows ˆβ i = [ T ] 1 [ t=1 [ z T ] t(i)][ z t (i)] t=1 [ z t(i)][ỹ t (i)] 26 (2.4.23)

27 Classical Approach where ỹ t (i) = y t P { s t = i Y T ; θ } z t (i) = z t P { s t = i Y T ; θ } (2.4.24) for further information see Hamilton(1989,1990) EM Algorithm Another method for maximizing the likelihood function with unobserved variables is the EM algorithm, which was originally developed by Dempster [9], Liard et al. (1977). EM algorithm consists of two steps, the expectation and the maximization at the k th iteration. For the presentation we are going to assume the following Markov Switching model with structural breaks in the parameters with two regimes s t = 0 or 1 governed by a two state Markov process: y t = β st x t + ɛ t, t = 1, 2,..., T with and β st = (1 s t )β 0 + s t β 1 σ 2 s t = (1 s t )σ s t σ 2 1 ɛ t N(0, σ 2 ) the transition probabilities are as follows P [s t = 0 s t 1 = o] = p 00 P [s t = 1 s t 1 = 1] = p 11 When we are under regime 1 with s t = 0 the parameters are given by β 0, σ 2 0 and under regime 2 with s t = 1 the parameters are given by β 1, σ 2 1. As we have mentioned above EM algorithm has two steps: Step 1: If we know the estimations of the parameters θ 4 obtained from the k 1 iteration, we can get inference about the unobserved variables s t. This can be done with the implementation of Kim s 5 smoothing algorithm. Kim s Smoothing Algorithm Given the parameter estimates we want to make inference on the state variable s t conditional on the whole information in the sample. That is the P [s t ψ T ] for t = 1, 2,..., T. 4 θ is the vector of the parameters 5 see Kim [30] (1994) 27

28 Classical Approach Setting s t = j and s t+1 = m we should make the derivation of the joint probability of s t, s t+1 conditional on the whole information. We have and P [s t = j, s t+1 ψ T ] = P [s t = j s t+1 = m, ψ T ] P [s t+1 = m ψ T ] = P [s t = j s t+1 = m, ψ t ] P [s t+1 = m ψ T ] = P [s t = j, s t+1 = m, ψ t ] P [s t+1 = m ψ T ] P [s t+1 = m ψ t ] = P [s t+1 = m s t = j] P [s t = j ψ t ] P [s t+1 = m ψ T ] P [s t+1 = m ψ t ] (2.4.25) P [s t = j ψ T ] = N P [s t = j, s t+1 = m ψ T ] (2.4.26) m=1 Step 2: After estimating the unobserved variables, we should maximize the likelihood function with respect to the parameters θ k6 of the model. Every iteration of this algorithm gives us a higher value for the likelihood function. We begin with arbitrary initial values for the model parameters θ 0 and we iterate these two steps until we have a converge for θ k. The maximization 7 step of the EM algorithm would be implemented to the above Markov Switching model. The parameters of the model are θ = (β 0, β 1, σ 2 0, σ 2 1, p 00, p 11 ). We denote ỹ T as Ỹ T = (y 1, y 2,..., y T ) and s T = (s 1, s 2,..., s T ). Now we can split the parameters vector in two sets, θ 1 = (β 0, β 1, σ 2 0, σ 2 1) and θ 2 = (p 00, p 11 ), thus the parameter vector is rewritten as θ = (θ 1, θ 2). Hence, the log likelihood function can be written as follows: p(ỹ T, s T ; θ) = p(ỹ T s T ; θ 1 ) p( s T ; θ 2 ) T T = p(y t s t ; θ 1 ) p(s t s t 1 ; θ 2 ) t=1 The log of this equation would be as follows ln[p(ỹ T, s T ; θ)] = T ln[p(y t s t ; θ 1 )] + t=1 t=1 (2.4.27) T ln[p(s t s t 1 ; θ 2 )] (2.4.28) In that stage of the algorithm we have two options dependent on the knowledge we have about s If s T is observed then the log likelihood function would be maximized with respect to θ 1 only. Thus we should have: ϑ ln[p(ỹ T, s T ; θ)] ϑθ 1 = T t=1 ϑ ln[p(y t s t ; θ 1 )] ϑθ 1 = 0 t=1 6 θ k is the parameter estimates obtained from the k th iteration 7 The maximization step for a Markov Switching model was discussed in Hamilton (1990) and applied in Engel and Hamilton (1990) and Nelson et al. (1989) 28

29 Classical Approach If s t is unobserved then we have to maximize the log likelihood with respect to the whole parameter vector θ. For this situation we can maximize the expected log likelihood function. We define the expected log likelihood as Z(θ; ỹ T, θ k 1 ) so the expectation is formed conditional on θ k 1. Thus we have Z(θ; ỹ T, θ k 1 ) = ln[p(ỹ T, s T ; θ)]p(ỹ T, s T ; θ k 1 ) s T (2.4.29) = ln[p(ỹ T s T ; θ 1 )p( s T ; θ 2 )]p(ỹ T, s T ; θ k 1 ) s T Since we have taken the expected log likelihood function we are going to maximize Z with respect to θ 1. As above we have the following condition: ϑ Z(θ; ỹ T, θ k 1 ) ϑ ln[p(ỹ T s T ; θ 1 )] = p(ỹ T, s T ; θ k 1 ) = 0 ϑθ 1 s T ϑθ 1 If we divide both sides by p(ỹ T ; θ k 1 ) we get ϑ ln[p(ỹ T s T ; θ 1 )] p(ỹ T, s T ); θ k 1 s T ϑθ 1 p(ỹ T ; θ k 1 = 0 ) ϑ ln[p(ỹ T s T ; θ 1 )] => p( s T ỹ T ; θ k 1 ) = 0 s T ϑθ 1 T ϑ ln[p(y t s t )] => p( s T ỹ T ; θ k 1 ) = 0 ϑθ 1 => s T t=1 T 1 t=1 s t=0 ϑ ln[p(y t s t )] ϑθ 1 p( s T ỹ T ; θ k 1 ) = 0 (2.4.30) This equation provides us closed form solutions for θ k 1 which is the estimation of θ 1 in the k th iteration. Given S t = j we have n[p(y t s t = j; θ 1 )] = 1 2 log(2π) 1 2 ln(σ2 j ) 1 (y t x tβ j ) 2 2 σj 2 (2.4.31) and with the smoothed probabilities of s t we compute straightforward the equation to get the estimates. To complete the algorithm we continue with the straightforward computation of (2.4.30) using (2.4.31) with respect to β j and σj 2, j = 0, 1. Thus we have T 1 t=1 s t=0 ϑ ln[p(y t s t )] ϑβ j p( s T ỹ T ; θ k 1 ) = 0 (2.4.32) and T t=1 x t (y t x tβ j ) σj 2 p(s t = j ỹ T ; θ k 1 ) = 0 (2.4.33) 29

30 Classical Approach Hence we have T 1 t=1 s t=0 { T t=1 ϑ ln[p(y t s t )] ϑβ j p( s T ỹ T ; θ k 1 ) σ 2 j + (y t x tβ j ) 2 σ 4 j } p(s t = j ỹ T ; θ k 1 ) = 0 (2.4.34) Solving equations (2.4.33), (2.4.34) we get solutions for β k j and σ2k j. ( ) 1 ( ) βj k = x t x tp(s t = j ỹ T ; θ k 1 ) x t y t p(s t = j ỹ T ; θ k 1 ) for j = 0, 1. t σ 2k j = t (2.4.35) t (y t x tβj k)2 p(s t = j ỹ T ; θ k 1 ) t p(s t = j ỹ T ; θ k 1, j = 0, 1 (2.4.36) ) Differentiating the log likelihood function in equation (2.4.36) we get solution for p k jj p k t jj = p(s t = j, s t 1 = j ỹ T ; θ k+1 ) t p(s t 1 = j ỹ T ; θ k 1, j = 0, 1 (2.4.37) ) Persuasive we can see that EM algorithm give us closed-form solution. 2.5 Markov Switching Model of Conditional Mean In the previous section we made an introduction to time series models with regime switching. Numerous empirical evidence has shown that different patterns over time are quite common for economic and financial variables. Hence, the need of a model which can simulate all the patterns at once is necessary. The Markov switching model is one of the most known model of this kind in the literature which is constructed by combining N different dynamic models via a Markovian switching mechanism. In this section we are going to present a simple Markov switching AR model following again Hamilton(1989,1994) A Simple AR Model Let s t denote an unobservable variable that represents the state of the process. Assuming that s t has only two states, thus it takes only two values { 0, 1 }, 0 for state 1 and 1 for state 2. A simple switching model for the variable y t is presented with two AR specifications: y t = { α0 + βy t 1 + ɛ t, s t = 0 α 0 + α 1 + βy t 1 + ɛ t, s t = 1 (2.5.1) 30

31 Classical Approach where ɛ t N(0, σɛ 2 ) and β < 1. When we are in state 1 and s t = 0 AR(1) is stationary with mean α 0 /(1 β) and when it switches to s t = 1 we have another stationary AR(1) since β < 1 with mean (α 0 + α 1 )/(1 β). If α 1 0 then we have two different dynamic structures dependent on the unobserved variable s t. Hence, y t is governed by two different distributions with distinct means and s t is responsible for the switching between these two distributions. For a Markov Switching model we have assumed that s t is governed by Markovian property and for this example follows a first order Markov chain with the following transition matrix: [ ] P (st = 0 s P = t 1 = 0) P (s t = 1 s t 1 = 0) P (s t = 0 s t 1 = 1) P (s t = 1 s t 1 = 1) [ ] (2.5.2) p00 p = 01 p 10 p 11 where p ij denotes the transition probabilities of s t = j given that previous state was s t 1 = i. As we have mentioned before one of the properties that transition probabilities satisfy is p i0 + p i1 = 1. The transition matrix governs the behaviour of the states and simulates the switching between the regimes over time. This matrix contains only two parameters (p 00, p 11 ) since s t follows a first order Markov chain. In the Markov Switching model, the properties of y t are determined by the random characteristics of the driving innovations ɛ t and the state variable s t. In particular, the Markovian state variable yields random and frequent changes of the model structure and its transition probabilities determine the persistence of each regime. The Markov Switching Model is relatively easy to implement because there is no need to choose a priori the mechanism of switching like the threshold model which needs a priori to determine a threshold variable λ t which is responsible for the switching mechanism. In addition, the regime classification in the M SM is probabilistic and determined by data. One of the difficulties of the Markov Switching Model is that is not always easy to implement because the state variables are unobservable. We can extend the previous model to allow for more general dynamic structures. Extending the equation (2.5.1): y t = α 0 + α 1 s t + β 1 y t β k y t k + ɛ t (2.5.3) where s t again takes two values { 0, 1 }, the transition matrix is still the same as before and ɛ t N(0, σ 2 ɛ ). The model we have presented is actually a general AR(k) dynamic structure. We rewrite the previous model for d dimensional time series y t. y t = α 0 + α 1 s t + B 1 y t B k y t k + ɛ t (2.5.4) where s t stands for the state variable as before with the same transition matrix, B i for i = (1, 2,..., k) are (d d) matrices of parameters and ɛ t are i.i.d. random vectors with mean zero and variance-covariance matrix Σ 0. (2.5.4) is a V AR model with switching intercepts. We can make more generalizations extended 31

32 Classical Approach the previous model allowing the state variable to take m values and we shall have the m state Markov Switching Model. The only difference with the above models is that the transition matrix P would be expanded Markov Trend In this section we are going to present the form of a Markov Trend in a Markov Switching Model. Let x t be the observed time series which contains a unit root. The existence of a unit root means that the time series is not stationary and for that reason we are going to take the first difference. Thus, we should apply the Markov Switching Model to the y t = x t = x t x t 1. if x t are quarterly data containing a seasonal unit root, we should apply the model to seasonally differenced series y t = 4 x t = x t x t 4. When the time series x t has a unit root, the switching intercept in y t results in a deterministic trend with breaks in x t. Let us assume that y t has the form of (2.5.3), hence x t can be presented as follows t ) y t = (α 0 t + α 1 s i + β 1 y t β k y t k + i=1 t ɛ t (2.5.5) where the terms in the parenthesis is a trend function with changes, the β i y t i terms is the dynamic component of the model and t i=1 ɛ t is the stochastic trend. The resulting trend is widely known in the literature as a Markov trend. it is obvious that the trend function is dependent on s t. The slope of this trend is characterized by α 0. When the value of s i = 1, the trend function moves upward if α 1 > 0 and downward if α 1 < 0. So, when s i = 1 we have a slope change in the trend function. This function would resume the original slope without any changes in the slope when s i = Markov Switching Model of Conditional Variance In addition to the Markov Switching Model of conditional mean, it is very important to present a Markov Switching mechanism into conditional variance models. We are going to show the GARCH model with Markov Switching. A simple GARCH(p, q) model as y t = z t ɛ t, where z t = c + q α i yt i 2 + i=1 i=1 p β i z t i (2.6.1) which is the conditional variance of y t given all the information up to time t 1 and ɛ t are i.i.d. random variables with mean zero and variance 1. if z t does not depend on its lagged values, the model above reduces to an ARCH(q) model. Assuming that p = q = 1 we have an GARCH(1.1) model as follows: i=1 z t = c + α 1 y 2 t 1 + β 1 z t 1 32

33 Classical Approach The GARCH model is widely known and through many empirical studies the evidence has shown that the GARCH(1, 1) model simulates the volatility patterns in many time series. Interestingly, the sum of the estimated α 1 and β 1 coefficients is typically close to 1. From equation (2.6.1) we can write yt 2 using an ARMA(1, 1) model: y 2 t = z t ɛ 2 t = c + (α 1 + β 1 )y 2 t 1 β 1 (y 2 t 1 z t 1 ) + (y 2 t z t ) (2.6.2) where yt 2 z t is the innovation with mean zero. Thus, when α 1 +β 1 = 1, yt 2 has a unit root so the resulting z t are highly persistent. In this case the z t is said to be an integrated GARCH(IGARCH) process. There is a problem in this model because as Lamoureux and Lastrapes (1990) mentioned the IGARCH pattern does not have theoretical motivation and may ignore parameter changes in the GARCH model. Let Φ t 1 denote the information set up to time t 1 and z i,t = var(y t s t = i, Φ t 1 ). Let us consider an ARCH(q) model with switching intercepts: y t = zi,t ɛ t, where q z i,t = α 0 + α 1 i + α j yt j, 2 i = 0, 1 (2.6.3) j=1 that was the proposal of Cai(1994). Also Hamilton and Susmel(1994) proposed the SW ARCH(q) model: y t = z i,t ɛ t,where ) q z i,t = λ i η t = λ i (c + α j ζt j 2, i = 0, 1 (2.6.4) j=1 These are two propositions for the conditional variance in the two regimes { 1, 2 }. It is clear that in (2.6.3) we have left shifts, but in (2.6.4) we have different scales. Extending the two models (2.6.3), (2.6.4) lagged conditional variances is not straightforward, however. To understand this we have to observe that when the conditional variance z i,t depends on z i,t 1, it is determined not only by s t but also by s t 1 due to the presence of z i,t 1. The dependence of z i,t 1 on z i,t 2 implies that z i,t must also be affected by the value s t 2 and so on. Consequently, the conditional variance at time t is in effect determined by (s t, s t 1,..., s 1 ) which has 2 t possible values. This path dependence lead us to a very complex model with difficulties in the estimation. This problem was solved by Gray(1996) by postulating that z i,t depends on z t = E(yt 2 Φ t 1 ), the sum of z i,t weighted by the probability of prediction for the state P (s t = i Φ t 1 ). We have y t = z i,t ɛ t and the conditional variance is as follows q p z i,t = c i + α i yt j 2 + β i,j z t j i = 0, 1 (2.6.5) j=1 j=1 z t = z 0,t P (s t = 0 Φ t 1 ) + z 1,t P (s t = 1 Φ t 1 ) Now z i,t is not path dependent because z 0,t j and z 1,t j have been used to form z t j. Thus, there is not any need considering all the possible values of s i in the computation of the model. 33

34 Classical Approach Markov Switching Model of Conditional Mean and Conditional Variance We can generalize the models by allowing both conditional mean and conditional variance to switch. Let µ i,t denote the conditional mean E(y t s t = i, Φ t 1 ) and write y t = µ i,t + v i,t, v i,t = z i,t ɛ t q p z i,t = c i + α i,j vt j 2 + b i,j z t j (2.6.6) j=1 j=1 We must compute the two weighted sums z t, v t as follows z t = E(y 2 t Φ t 1 ) = E(y t Φ 2 t 1) v t = y t E(y t Φ t 1 ) where E(y t Φ t 1 ) and E(y 2 t Φ t 1 ) are calculated as follows: E(y t Φ t 1 ) = µ 0,t P (s t = 0 Φ t 1 ) + µ 1,t P (s t = 1 Φ t 1 ) E(y 2 t Φ t 1 ) = (µ 2 0,t + z 0,t )P (s t = 0 Φ t 1 ) + (µ 2 1,t + z 1,t )P (s t = 1 Φ t 1 ) Thus, under this specification, neither z t nor v t are path dependent. We can write the previous in an extended version. When the state variable has k values, let M t denote the vector whose j th element is µ i,t, Z t denote the vector whose j th element is z i,t and S t t 1 the vector whose j th element is P (s t = i Φ t 1 ). Thus, similar to the simple two state model we can combine the conditional mean and conditional variance in different states as follows: z t = (M t M t + Z t ) S t t 1 (M ts t t 1 ) 2 v t = y t M ts t t 1 where the symbol denotes element-by-element product. 2.7 Hypothesis Testing Constructing a Markov Switching model is the first step but we have to put this model to the test. To justify whether the Markov Switching model is appropriate we have to test some hypothesis such as if the switching really occurs with linearity test, if the state variables are independent, how many states we have in the model and the statistical significance of the parameters. Testing the linearity of the model shows as if the switching really occurs or the parameters are the same thus there is not any switch to the model. Testing the independence of the state variables we test the existence of the Markovian property. There is also the need to have statistical tests about how many states we have in the model and finally to test the statistical significance of the parameters to see the robustness of the model. 34

35 Classical Approach Linearity Test for Markov Switching Model Linearity testing is constructed to indicate if we have switching between the parameters of the model. This kind of testing in the context of Markov Switching models is extremely complicated because standard regularity conditions of the likelihood function are violated. In other words because we have to deal with unidentified parameters 8 under the null hypothesis it is not possible to define the likelihood ratio. Thus, the asymptotic distribution of the test is not the standard χ 2 distribution. We are going to present the Hansen test for linearity under the Bootstrap approach. Hansen Test For the presentation of the Hansen 9 test we assume a simple Markov Switching Autoregressive model. The model is an AR(1) with switches only to the mean as follows: for t = (1, 2,..., T ). y t = µ St + β 1 y t 1 + ɛ t. ɛ N(0, 1) Hansen to overcome the problem with unidentified parameters proposed a theory which allows to perform a test of linearity in the presence of nuisance parameters(assuming zero for the null hypothesis). The likelihood function is constructed as a function of unknown parameters and the empirical process theory was used for creating the asymptotic distribution of a standard LR statistic. To do so, the parameter vector θ is split into two sub vectors, one with the parameters of interest i and one with the nuisance parameters n. Vector i is further partitioned into i1 and i2. The vectors considering the above model would be as follows: θ = { µ (1), µ (2), p 11, p 22, β 1, σ 2} i = { µ (1) µ (2), p 11, p 22 } n = { β 1, σ 2, µ (1)} i1 = { µ (1) µ (2)} i2 = { p 11, p 22 } The hypothesis we want to test is linearity versus a two-state Markov Switching. H 0 : µ (1) = µ (2), H 1 : µ (1) µ (2), linearity Markov Switching 8 Because we assume linearity under the null hypothesis the transition probabilities are unidentified parameters. 9 For further reading see Hansen(1992,1996) 35

36 Classical Approach For the construction of the test statistic we start with the conditional loglikelihood of the tth observation given the vectors i, n. Let f t (i, n) be the conditional log-likelihood. f t (i, n) = logf(y t I t 1 ; i, n) where I t 1 is the σ algebra denotes the information from the sample data up to time t 1. Maximizing the log-likelihood with respect to the nuisance parameters for any given the interest parameters i we get the value of n. Let n(i) ˆ be that value. Hence, f t (i, n(i)) ˆ denotes the conditional log-likelihood given i, n(i). ˆ We define p t (i) for tth observation as follows: p t (i) = f t (i, n(i)) ˆ f t (i 0, n(i ˆ 0 )) where i 0 is the value of i under the null hypothesis of linearity. Taking the mean from all the observation we get: T t=1 p(i) = p t(i) T The LR test for the null hypothesis that i = i 0 could be represented as T p(i). Following Hansen who suggested the standardized LR test we present the statistic where the T p(i) is included. Ĥ = max i G { T T p(i)( [p t (i) p(i)] 2 ) 1/2} t=1 where G is a grid with the possible values of i. The computation of the statistic as was described above has two shortcomings. The first sources from the computational step. The computational step is to set a grid for every element of vector i and after setting the grid we have to optimize the likelihood function for each value of the grid with respect to the nuisance parameters. The conclusion is the really complex computations as a result of this process. With this process we could create a bound for the LR statistic and not a critical value, which means that the test could be conservative. Now we want to find the distribution of the LR test statistic under the null hypothesis. To do so we are going to use the Bootstrap Algorithm 10 to make approximations can be used to calculate the p value. Bootstrap Algorithm The algorithm allows bootstrapping the LR test statistic for the number of components in the Markov Switching model in a normal mixture. The Bootstrap Algorithm consists of 6 steps. 10 For further reading see McLachlan (1987) 36

37 Classical Approach Step 1: We make estimations for the coefficients of the model under the null hypothesis, θ 0 = (ˆµ, ˆβ 1τ, ˆσ 2 ), where τ = (1, 2,..., k). These estimations can be made by MLE. Step 2: We compute the residuals from the estimations under the null hypothesis as follows: k ˆɛ t = y t ˆµ ˆβ 1τ [y t τ ˆµ], τ=1 t = k + 1, k + 2,..., T Step 3: We estimate the model under H 1 and compute the LR statistic as follows: { } LR = 2 L(ˆθ I T ) L(ˆθ 0 I T ) where, ˆθ 0 denotes the the MLE estimates under H 0 and ˆθ the MLE estimates under H 1. Step 4: We generate the bootstrap errors ɛ t for t = k + 1, k + 2,..., T. This process is possible by sampling and replacement of the estimated residuals ˆɛ we estimate in Step 2. The construction of the bootstrap sample follows: m yt = ˆµ + ˆβ τ [yt τ ˆµ] + ɛ t τ=1 with the initial values (y 0, y 1, y 2,..., y k+1 ) which are needed for the computation. The distribution of y t is the bootstrap distribution of the data. Step 5: We use the bootstrap sample y t to calculate the LR statistic. Let LR b be the distribution of the bootstrap sample y t. Step 6: We repeat the above steps N times and compute the bootstrap p value 11 as follows: p b = card(lr b LR) N Determining the Number of States When our model is the Markov Switching model one of the most important questions someone would want to ask is how many different regimes occur in the process. This question has to be constructed into hypothesis testing. Unfortunately, it is very difficult to construct a working hypothesis testing under the likelihood ratio test approach because in this approach we need an asymptotic χ 2 distribution which is possible if we know completely the information 12. The 11 p b is the fraction of LR b values that are greater than the observed value LR. 12 We need the information matrix (which is actually the expectation of the second derivatives of the log likelihood estimate for the parameters) to be non singular 37

38 Classical Approach idea of asymptotic χ 2 distribution fails if we try to fit an N state model when we have actually N 1 states. That is exactly the same problem as described in the linearity test and we overcome it in the same way with Hansen test. The only difference is that we put to the test two Markov Switching models, an example would be a two-state versus a three-state model. For more tests about the determination of the states see White(1993),Ploberger(1992), Hamilton(1996) and Garcia(1998) Testing Other Hypothesis Independence Test Another test for the Markov Switching model is the test of the independence of state variables. Assume that we have a model governed by a two state Markov process with the following transition matrix P : ( ) p00 p P = 01 p 10 p 11 For testing the independence of state variables we denote the null hypothesis: H 0 : p 00 + p 11 = 1 This hypothesis presents the independence of the states since p 00 + p 01 = 1 and p 10 + p 11 = 1. Note that if p 00 = p 10 and p 01 = p 11 the variable has the same exactly probability of being either in state 0 or in state 1. This means that the state variables are independent and there is not any Markov process. The hypothesis of independence as it has been presented can be tested by a Wald test. 2.8 State Space Models and the Kalman Filter In many applications, the driving forces behind the evolution of economic variables are not observable or measurable. State-space models, which deal with the dynamic time series models with unobserved variables can be applied in the Markov-Switching Models. In econometric history many scientists have applied State-space models to various variables. Engle and Watson (1981) apply them to modelling the behaviour of wage rates, Wall, Burmeister and Hamilton (1986) apply it in estimating expected inflation,kim and Nelson (1989) apply it modelling a time-varying monetary reaction function of Federal Reserve and Stock and Watson (1991) apply it to dynamic factor model of coincident economic indicators. When explanatory variables are not observable, standard models like V AR can not simulate this kind of process and to analyse this framework with unobserved variables we need State-space models. The basic tools we need to apply and compute the standard State-space models is the Kalman filter. The Kalman filter is a recursive procedure for computing the estimators of the unobserved 38

39 Classical Approach component of the state vector at time t, based on the information up to time t which is the σ algebra F t. The Kalman filter allows us to apply likelihood based inference, since we can construct the likelihood function associated with a State-space model. For further reading about surveys and applicability of State-space models, refer to Engle and Watson 1987 ; Harvey 1965,1989 and 1990; and Hamilton 1994a and 1994b State-Space Models State-space models were originally developed by Kalman (1960) and they are very useful tools for expressing dynamic systems that involve unobserved state variables. A State-space model consists of two equations, the Measurement equation and the Transition equation or state equation. Measurement equation: Is an equation that describes the relation between the observed variables and the unobserved state variables. Transition equation: Is an equation that describes the dynamics of the state variables. Consider the following presentation: Suppose that we have a state variable, s t, and observed variable y t. Let Y t 1 be all measurable { } y 1,..., y t 1 variables up to time t 1. The State-space formation would be : Measurement Equation Transition Equation f(y t α t, Y t 1 ) F (α t α t 1, Y t 1 ) These two equations allow us to write the joint likelihood of observed variables y t. There could be some unknown parameter θ. T T f(y 1,..., y T ; θ) = f(y 1 ; θ) f(y t y t 1,..., y 1 ; θ) = f(y 1 ; θ) f(y t Y t 1 ; θ) t=2 Hence, to compute the likelihood,we need to find first of all f(y t Y t 1 ). To do so we must follow the next three general steps known as filtering. We construct the f(y t Y t 1 ) as follows: f(y t Y t 1 ) = f(y t s t, Y t 1 )f(s t Y t 1 )ds t t=2 The prediction equation f(s t Y t 1 ) = F (s t s t 1, Y t 1 )f(s t 1 Y t 1 )ds t 1 The updating equation f(s t Y t ) = f(y t s t, Y t 1 )f(s t Y t 1 ) f(y t Y t 1 ) 39

40 Classical Approach This process is straightforward. We start from f(s 1 Y 0 ) to f(y 1 Y 0 ) to f(s 1 Y 1 ) to f(y 2 Y 1 ) and so on to finally get f(y 2 y 1 ), the conditional likelihood. Computing these distributions in continuous time is difficult as a consequence of the integrals. In the discrete time these integrals are replaced by sums. With normal distributions, sub-vector of normal vector is normal, and the conditionals as well. When we have this procedure for normals is called Kalman filter. Suppose we have a state model: The Kalman Filter s t = T s t 1 + Rη t (2.8.1) and a measurement: y t = Zs t + Sξ t (2.8.2) with ( ) ( η t i.i.d. N 0, ( ) ) Q 0 ξ 0 H t F (s t s t 1 ) N(T s t 1, RQR ) f(y t s t, Y t 1 ) N(Zs t, SHS ) We have mentioned that we want to have normally distributed errors for simpler computation. If the initial state is normal, then since s ts and y ts are linear combinations of normal errors then the whole vector (s 1,..., s T, y 1,..., y T ) is normally distributed. Writing a more general form. If [ ] ( [µ1 ] [ ] ) x1 Σ11 Σ N, 12 x 2 µ 2 Σ 21 Σ 22 Then we have x 1 x 2 N( µ, Σ) where µ = µ 1 + Σ 12 Σ 1 22 (x 2 µ 2 ) and Σ = Σ 11 Σ 12 Σ 1 22 Σ 21 We can see that these distributions in the general form above are normal. Notation: s t Y t 1 N(s t t 1, P t t 1 ) s t Y t N(s t t, P t t ) y t Y t 1 N(y t t 1, F t ) We shall see how the Filtering mechanism works. From equation (2.8.1), we take s t t 1 = T s t 1 t 1 (2.8.3) P t t 1 = E((s t = s t t 1 )(s t s t 1 ) Y t 1 ) = T P t 1 t 1 T + RQR (2.8.4) 40

41 Classical Approach Now we are going to use the prediction equation and get what is needed from above equations. y t t 1 = Zs t t 1 (2.8.5) F t = E((y t y t t 1 (y t y t t 1 ) Y t 1 ) = ZP t t 1 Z + SHS (2.8.6) The last step is the updating step where we need to use the normality. ( ) ( (st t 1 ) ( ) ) st Pt t 1 C Y y t 1 N, t y t t 1 C F t where C = E((s t s t t 1 )(y t y t 1 t ) Y t 1 ) = = E((s t s t t 1 )(s t s t t 1 ) Z Y t 1 = P t t=1 Z. We can write the posterior density of s t given Y t using the general form and the latter equation as follows s t Y t =s t y t, Y t 1 N(s t t, P t t ) N(s t t 1 + P t t 1 Z Ft 1 (y t y t t 1 ), P t t 1 P t t 1 Z Ft 1 ZP t t 1 ) (2.8.7) For simplicity we should explain this process. Starting from initial values s 1 0, P 1 0 we use the equations (2.8.5) and (2.8.6) to get the values of y 1 0, F 1, where F t is the conditional density of y t. Next we use equation (2.8.7) to get the values of s 1 1, P 1 1. Afterwards we use equations (2.8.3) and (2.8.4) to get the values of s 2 1, P 2 1, and so on we shall repeat this process and compute the entire likelihood. After computing the likelihood we can estimate the parameters by MLE. Kalman filter uses the information from data up to time t, Y t to make predictions about s t. We want to predict s t because it is necessary for computing the likelihood. If we want to estimate s t the unobserved state variable we have to use the whole data to predict s t. The whole data refers to the σ algebra that contains the information of the data up to time T. This is called the Kalman smoother. Let: E(s t Y T ) = s t T Since (s t, s t+1 ) Y t is normal we have E(s t s t+1, Y t ) = s t t + E((s t s t t )(s t+1 s t+1 t ) Y t )P 1 t+1 t+1 (s t+1 s t+1 t ) where = s t t + J t (s t+1 s t+1 t ) E((s t s t t )(s t+1 s t+1 t ) Y t ) = E((s t s t t )(T (s t s t t )) + Rη t+1 Y t ) = P t t T E(s t s t+1, Y t ) = s t t + J t (s t+1 s t+1 t ) The knowledge we have about y t+j for j > 0 would not gives us more information if we already know s t+1, hence E(s t s t+1, Y T ) is equal to E(s t s t+1, Y t ). E(s t s t+1, Y T ) = s t t + J t (s t+1 s t+1 t ) 41

42 Classical Approach Taking the expectation of s t conditioned to Y T expectations we get: and using the law of iterated E(s t Y T ) = s t t + J t (s t+1 T s t+1 t ) Finally, we can see that the Kalman smoother mechanism works backward so starting from t = T we compute s T T next the s T 1 T and so on. On the other side the Kalman filter works straightforward. We are going to present some examples of time series models in the Statespace form. Example 1: Assume that we have an AR(2) model : where y t = α + β 1 y t 1 + β 2 y t 2 + ɛ t ɛ iid N(0, σ 2 ) We should write this model in the State-space form The Measurement equation is: The Transition equation is: [ ] st = s t 1 where, y t = α + [ 1 0 ] [ ] s t s t 1 [ ] [ ] β1 β 2 st s t 2 α α = 1 α 1 α 2 [ ] yt 1 s t = y t 2 [ ] ɛt 0 Example 2: Assume that we have the following M A(1) model: y t = ɛ t + θɛ t 1 The State-space presentation of the above model is The Measurement equation: The Transition equation: where s t = y t = [ 1 θ ] [ ] s t s t 1 [ ] 0 0 s 1 0 t 1 + s t = 42 [ ɛt ɛ t 1 ] [ ] ɛt 0

43 Classical Approach Example 3: Assume that we have an ARMA(2, 1) model: y t = β 1 y t 1 + β 2 y t 2 + ɛ t + θɛ t 1 The Measurement equation is: y t = [ 1 θ ] [ ] ɛ 1t ɛ 2t The Transition equation is: [ ] s1,t = s 2,t [ ] [ ] β1 β 2 s1,t s 2,t 1 [ ] ɛt in State-Space Format In the previous section we introduce the State-space format. Now we should present the Markov Switching Model to the State-space format. This presentation was made by Kim(1994) and the purpose was to extend the Hamilton s model (1989). Many people have investigate the State-space presentation of regime-switching models, Harrison and Stevens (1976), Bar-Shalom(1978), Harvey (1989), Shumway and Stoffer (1991) Specification of the Markov Switching in State-space Format Let us assume that we have the next measurement and transition equations. with Measurement equation Transition equation y t = H st β t + A st ζ t + ɛ t β t = µ st + F st β t 1 + G st v t ( et ) N v t ( 0 ( )) Rst o 0 Q s t where H st is an (N J) matrix, A st is an (N K) matrix, F st is an (J J) matrix and G st is an (J L) matrix. The measurement equation describes us the evolution of observed time series as a function of the unobserved variable(state) and exogenous or lagged dependent variables (z t ). The transition equation describes the dynamics of the unobserved state variable as a function of shocks v t and the lagged unobserved state variable. All the matrices in the measurement and the transition equation plus the µ term are dependent on the unobserved state variable. For simplification we assume that the unobserved state variable follows a discrete time Markov process, hence the state variable is an M state Markov-switching variable s t. Since it 43

44 Classical Approach has M states it takes values of this form s t = i, i = { 1, 2,..., M } with the following transition matrix p 11 p p M1 p 21 p p M2 P = p M1 p M2... p MM where, p ij = P [s t = j s t 1 = i] with M j=1 p ij = 1 for all i Estimation of the Model For the sake of simplicity assume that the parameters of the model in the previous section are known. For the estimation of the State-space model with Markov Switching we should use the Kalman filter. As we have mentioned in the previous section Kalman filter is a recursive procedure for computing the optimal estimation of the unobserved state variable β t, { 1, 2,..., T }. Considering the information set we have the basic filter and the smoothing filter. The first one refers to the estimation of the state variable β t based on the information available up to time t and the second one refers to the estimation based on the information up to time T which is the whole sample. For further reading see Kim et.al.(1999). The notation we are going to use is as follows: ψ is the information set, β t t 1 = E(β t ψ t 1 ) is the expectation of β t conditional on information up to time t 1, P t t 1 = E[(β t β t t 1 )(β t β t t 1 ) ] is the covariance matrix of β t conditional to information up to time t 1, β t t = E(β t ψ t ) is the expectation of β t conditional on information up to time t, P t t = E[(β t β t t )(β t β t t ) ] is the covariance matrix of β t conditional on information up to time t, y t t 1 = E(y t ψ t 1 ) is the forecast of the observed variable y t conditional on information up to time t 1, η t t 1 = y t y t t 1 is the prediction error, f t t 1 = E(ηt t 1 2 ) is the conditional variance of the prediction error, β t T = E(β t ψ T ) is the expectation of the state variable conditional on the information up to time T and P t T = E[(β t β t T )(β t β t T ) ] is the covariance matrix of β t conditional on information up to time T which is the whole sample. We are going to write the the Kalman filter procedure as it was presented by Kim and Nelson (1990). As we have mentioned for the basic filter we have two steps, the prediction where we have to compute β t t 1, P t t 1, η t t 1, f t t 1 (to make this computations we take expectations and with the implementation of expectation properties we get closed form equations ) and the updating step where we recompute the state variable taking into consideration the new information we get from the prediction error. Updating β t t = β t t 1 + K t η t t 1 where K t is the Kalman gain, which determines the weight assigned to new information about β t contained in the prediction error. 44

45 Classical Approach In state-space model with Markov switching the goal is to form a forecast of β t based not only on information up to time t 1 but also conditional on the random variable s t taking the value j and s t 1 taking the value i. β (i,j) t t 1 = E(β t ψ t 1, s t = j, s t 1 = i) Hence the covariance matrix would be as follows: P (i,j) t t 1 = E[(β t β t 1 )(β t β t 1 ) ψ t 1, s t = j, s t 1 = i] Now we are going to implement the Kalman filter to the model. Assume that s t = i and s t 1 = j the implementation of the Kalman filter is as follows: Prediction: Updating: β (i,j) t t P (i,j) t t β (i,j) t t 1 = µ j + F j β i t 1 t 1 P (i,j) t t 1 = F jp i t 1 t 1 F j + G j Q G j η (i,j) t t 1 = y t H j β (i,j) t t 1 A jz t f (i,j) t t 1 = H jp (i,j) t t 1 H j + R j = β (i,j) t t 1 + P (i,j) t t 1 H j[f (i,j) t t 1 ] 1 η (i,j) t t 1 = (I P (i,j) t t 1 H j[f (i,j) t t 1 ] 1 H j )P (i,j) t t 1 where β (i,j) t t 1 is an inference of β t based on the information from the sample up to time t 1, given s t 1 = i and s t = j, P (i,j) t t 1 is the covariance matrix of β (i,j) t t 1 based on information up to time t 1 conditional on s t = j and s t 1 = i, η (i,j) t t 1 is the prediction error of y t conditional on information up to time t 1 given s t = j and s t 1 = i and finally f (i,j) t t 1 is the conditional variance of the prediction error η (i,j) t t 1. It is clearly that we should have many different number of cases to consider for every iteration of the Kalman filter even if we have only two regimes. These kind of approximations were made by Gordon and Smith (1988), Harrison et.al (1976), Highfield (1990) and Smith and Markov(1980). The idea behind the approximation is to reduce the number of possible cases when we imply the iteration of the above Kalman filter. Consider the simplest Markov Switching model with only two regimes. In this example for t = 10 we should have more than 1000 different cases for β (i,j) t t and P (i,j) t t, thus the use of an approximation to reduce the number of possible cases would make the things easier and the implementation approachable. In addition denote as N the number of the states we should have (N N) posteriors of β (i,j) t t and P (i,j) t t, so with the approximation we want to reduce the posteriors into N β j t t, P j t t. The approximation of the updating step would be as follows. 45

46 Classical Approach The expectation of β t conditional on information up to time t,given s t, s t 1 is presented above by the Kalman filter β (i,j) t t = β (i,j) (i,j) t t 1 +P t t 1 H j [f (i,j) t t 1 ] 1 η (i,j) t t 1 = E[β t s t 1 = i, s t = j, ψ t ]. Taking the expectation it would be straightforward to show that N β j t t == i=1 P (s t 1 = i, s t = j ψ t )β (i,j) t t P (s t = j ψ t ) (2.9.1) where E[β t s t = j, ψ t ] is represented by β j t t. For the derivation of P j t t we are going to denote t as follows Hence the derivation would be t = P (s t 1 = i, s t = j ψ t ) P (s t = j ψ t ) P j t t = E[(β t E[β t s t = j, ψ t ])(β t E[β t s t = j, ψ t ]) s t = j, ψ t ] = E[(β t β j t t )(β t β j t t ) s t = j, ψ t ] = = N t E[(β t β j t t )(β t β j t t ) s t 1 = i, s t = j, ψ t ] i=1 N i=1 t E[(β t β (i,j) t t + β (i,j) t t β j t t )(β t β (i,j) t t + β (i,j) t t β j t t ) s t 1 = i, s t = j, ψ t ] N { = t E[(βt β (i,j) t t )(β t β (i,j) t t ) s t 1 = i, s t = j, ψ t ] i=1 + (β j t t β(i,j) t t )(β j t t β(i,j) t t ) } N + t (E[β t s t 1 = i, s t = j, ψ t ] β (i,j) t t )(β (i,j) t t β j t t ) + = if P (i,j) t t rewrite P j t t i=1 N i=1 N i=1 t (β (i,j) t t β j t t )(E[β t s t 1 = i, s t = j, ψ t ] β (i,j) t t ) { t E[(βt β (i,j) t t )(β t β (i,j) t t ) s t 1 = i, s t = j, ψ t ] + (β j t t β(i,j) t t )(β j t t β(i,j) t t ) } is written as E[(β t β (i,j) t t as follows N P j t t = i=1 P [s t 1 = i, s t = j ψ t ] { P (i,j) t t )(β t β (i,j) t t ) s t 1 = i, s t = j, ψ t ] then we can P [s t = j ψ t ] 46 + (β j t t β(i,j) t t )(β j t t β(i,j) t t ) } (2.9.2)

47 Classical Approach To that end we have achieved to reduce the N N posteriors for β (i,j) t t, P (i,j) t t into N posteriors to make the filter operable. With the use of the approximations we derive the equations (2.9.1) and (2.9.2). To complete the Kalman filter we need to make inferences on the probability terms that show up in these two equations. To do so we have to follow three steps: Step 1: In the beginning of the t th iteration, given the value of P [s t 1 = i ψ t 1 ] for { i = 1, 2,..., N } we can calculate the joint probability P [s t, s t 1 ]. P [s t, s t 1 ψ t 1 ] = P [s t = j s t 1 = i]p [s t 1 = i ψ t 1 ] where (i, j = 1, 2,..., N) and P [s t = j s t 1 = i] = p ij is the transition probability. Step 2: After computing the P [s t, s t 1 ψ t 1 ] we should compute the conditional density of y t. Starting with the joint density of y t, s t, s t 1 we have f(y t, s t = j, s t 1 = i ψ t 1 ) = f(y t s t = j, s t 1 = i, ψ t 1 )P [s t = j, s t 1 = i ψ t 1 ] from this equation we can take the marginal density of y t f(y t ψ t 1 ) = = N j=1 i=1 N j=1 i=1 N f(y t, s t = j, s t 1 = i ψ t 1 ) N f(y y s t = j, s t 1 = i, ψ t 1 )P [s t = j, s t 1 = i ψ t 1 ] where we need to compute the marginal density of y t conditional on s t, s t 1, ψ t 1. We can achieve that by the prediction error decomposition. We have: f(y t s t 1 = i, s t = j, ψ t 1 ) = for i, j = 1, 2,..., N. = (2π) N/2 f (i,j) t t 1 1/2 exp( 1/2(η (i,j) t t 1 ) (f (i,j) t t 1 ) 1 η (i,j) t t 1 ) Step 3: When we observe the Y t at the end of time t we should update P [s t, s t 1 ψ t 1 ] to get P [s t = j, s t 1 = i ψ t ] = P [s t = j, s t 1 = i ψ t 1, y t ] = f(s t = j, s t 1 = i, y t ψ t 1 ) f(y t ψ t 1 ) = f(y t s t = j, s t 1 = i, ψ t 1 )f(s t = j, s t 1 = i ψ t 1 ) f(y t ψ t 1 ) for i, j = 1, 2,..., N, with p[s t = j ψ t ] = N i=1 P [s t = j, s t 1 = i ψ t ] To sum up, we begin by setting initial values β j 0 0, P j 0 0, P [s t = j ψ 0 ]. After setting the initial values we follow the next steps: 47

48 Classical Approach We run the Kalman filter for i, j = 1, 2,..., N in equations β (i,j) t t 1, P (i,j) η (i,j) t t 1, f (i,j) t t 1, β(i,j) t t, P (i,j) t t. t t 1, We calculate for i, j = 1, 2,..., N the probability terms P [s t, s t 1 ψ t ], P [s t ψ t ] We use the probability terms we had compute and with the help of the approximations we made (2.9.1), (2.9.2) the posteriors collapse to N 1 instead of N N. Finally, the last step is to construct the likelihood function to estimate the parameters of the Markov switching model. For the computation of Maximum Likelihood Estimation (M LE) we could use a non-linear optimization procedure to maximize the log likelihood 13. We have already compute the density of y t conditional on past information f(y t ψ t 1 ) for t = 1, 2,..., T, hence the approximate log likelihood is as follows: Smoothing L(θ) = ln[f(y 1, y 2,..., y T )] = T ln[f(y t ψ t 1 )] (2.9.3) Smoothing is the method of getting inferences using all the information of the sample up to time T. Once we estimate the parameters of the model we want to get smoothing inference about s t, β t. For further reading see Kitagawa (1987), Hamilton(1994) and Kim et al. (1999). To do the smoothing we are going to use Kim s algorithm. Kim s algorithm: The derivation of the joint probability of s t, s t 1 based on the whole sample with s t = j, s t+1 = m would be: P [s t = j, s t+1 = m ψ T ] = P [s t = j s t+1 = m, ψ T ] P [s t+1 = m ψ T ] Finally we get t=1 P [s t = j s t+1 = m ψ t ] P [s t+1 = m ψ T ] = P [s t = j, s t+1 = m ψ t ] P [s t+1 = m ψ T ] P [s t+1 = m ψ t ] = P [s t+1 = m ψ T ] P [s t = j ψ t ] P [s t+1 = m s t = j] P [s t+1 = m ψ t ] P [s t = j, s t+1 = m ψ T ] = P [s t+1 = m ψ T ] P [s t = j ψ t ] P [s t+1 = m s t = j] P [s t+1 = m ψ t ] (2.9.4) and N P [s t = j ψ T ] = P [s t = j, s t+1 = m ψ T ] (2.9.5) 13 Such a method is the EM algorithm m=1 48

49 Classical Approach Notice that we had made an approximation in the second line assuming that P [s t = j s t+1 ψ T ] equals to P [s t = j s t+1 ψ t ]. Now we are going to the derivation of smoothing algorithm for β t. Given s t = j, s t+1 = m we get: where P (j,m) t P (j,m) t T β j t T = βj t t = P j t t (j,m) + P t (βt+1 T m βj,m t+1 t ) (2.9.6) + P j,m t (P m t+1 T P (j,m) t+1 t ) P (j,m) t (2.9.7) = P j t t F m[p (j,m) t+1 t ] 1, β (j,m) t T is the smoothing inference of β t, P (j,m) t T are given by equations (2.9.1), is the covariance matrix of β (j,m) t T and β j t t, P j t t (2.9.2). All things considered, we can sum up with four steps to explain how Kim s algorithm works and how we get the smoothing inferences. Step 1: We run the basic filter for t = 1, 2,..., T and we use the previous equations for β (i,j) t t 1, P (i,j) t t 1, βj t t, P j t t, P [s t = j ψ t 1 ], P [s t = j ψ t ]. Step 2: We get the smoothed joint probability of P [s t = j, s t+1 = m ψ T ] and P [s t = j ψ T ] from equations (2.9.4),(2.9.5) for t = T 1, T 2,..., 1 with starting value for smoothing the value of the final iteration of the basic filter. Step 3: We use the smoothed probabilities from the previous step to collapse the N N elements of β (j,m) t T and P (j,m) t T into N by taking weighted average.by taking weighted average over state s [s t+1 we get: N β j t T = m=1 P [s t = j, s t+1 = m ψ T ]β (j,m) t T P [s t = j ψ T ] (2.9.8) N P j t T = m=1 P [s t = j, s t+1 = m ψ T ] { P (j,m) t T P [s t = m ψ T ] + (β j t T β(j,m) t T )(β j t T β(j,m) t T ) (2.9.9) Step 4: From previous step we can see that the smoothed value of β j t T is dependent only upon states at time t. If we take weighted average over the states at time t we get the next equation: β t T = N P [s t = j ψ T ]β j t T (2.9.10) j=1 49

50 3 - Bayesian Approach 3.1 Bayesian Analysis Bayesian Statistics depend on a simple idea that the inference of uncertainty can be successfully satisfied only through probability measures. In econometrics the application of Bayesian methods are very old 14. The power of Bayesian statistics lies to the a prior distribution which is chosen by the analyst as he wishes conditioned to his prior information about the variable. To make clear the difference between the Classical and the Bayesian framework in statistics we are going to present the Classical linear regression model: Y = βx + e. e N(0, σ 2 I N ) where the model is in matrix format, so Y, e are (N 1) vectors and X is a (N K) matrix. K denotes the number of the independent variables. The parameters we have to estimate in this model is the vector β and the variance of the model σ 2. in the Classical framework β, σ 2 are treated as unknown constants and are commonly estimated by least squares method which provides the best linear unbiased estimator for β. Applying the OLS method we get: ˆβ = (X X) 1 X Y Taking the estimation errors from ê = Y ˆβX we can compute the unbias estimator of variance: ˆσ 2 = ê ê N K 14 For further information see Zellner (1971) 50

51 Bayesian Approach where N K is the degrees of freedom. These estimators in the Classical framework are treated as random variables which means that they are following these distributions: ˆβ N(β, σ 2 (X X) 1 ) (N K) ˆσ2 σ 2 χ2 (N K) The estimations for β and σ 2 are made from the sample data we have on X and Y only. If we had I sets of sample data on X, Y we could make I estimates for β, σ 2. Thus we could have [( ˆβ) 1, ( ˆβ) 2,..., ( ˆβ) I ] and [(ˆσ 2 ) 1, (ˆσ 2 ) 2,..., (ˆσ 2 ) I ]. Another unbiased estimator of β could be derived from the following property using the asymptotic theory. 1 p lim I I I ( ˆβ) i = β i=1 Which means that the OLS estimator ˆβ gives us on average the correct answer when I tends to infinity. This was the presentation of the estimation of unknown parameters β, σ 2 in the Classical framework. In contrast with the Classical framework within a Bayesian framework the parameters of the model we want to estimate θ = [β, σ 2 ] are treated as random variables. Thus if we treat the parameters as random variables we have probability distributions for both β and σ 2 without considering the sample data of Y and X. These distributions shows the information knowledge of model s parameters that the researcher has about them. This distributions could be much different from researcher to researcher and depends not only to the existing information about the parameters but also to what researcher wants to prove and achieve with the estimating model. As we mentioned before the power of the Bayesian framework is the prior distribution. The prior distribution is the distribution of β, σ 2 before any sample data of Y, X is being observed, this prior would be denoted as h(θ). Bayesian framework treats the parameters θ as random variables with their own a priori distribution, but this does not mean that the sample data is out of use. Once the sample data Y is being observed we take into consideration the new information from the data and revise the new distribution of parameters. This new distribution is the posterior one conditioned to the sample data and the prior distribution. The derivation of the posterior distribution combining prior distribution with the new information from the sample data Y is possible with the use of Bayes theorem. First of all we denote the distribution of the sample observations Y given the parameters θ by f(y θ). Secondly, we denote the marginal distribution of the data by f(y ) and thirdly the joint distribution of the parameters and the data by j(θ, Y ). We want to derive the posterior distribution of the parameters given the data which is denoted by p(θ Y ). For the derivation we are going to start the process with the joint density of the parameters and the data. j(θ, Y ) = f(y θ)h(θ) = p(θ Y )f(y ) (3.1.1) 51

52 Bayesian Approach where we can obtain the Bayes theorem p(θ Y ) = f(y θ)h(θ) f(y ) This can be rewritten ignoring the density of the data because of none operational significance as follows: p(θ Y ) f(y θ)h(θ) (3.1.2) Finally we are going to write the posterior distribution of the parameters in a form that combines the likelihood function L(θ Y ). This can be done by noting the functional equivalence of f(y θ) and the likelihood function. Thus we get: We are going to present an example p(θ Y ) L(θ Y )h(θ) (3.1.3) Example 1: Bayesian inference of β and σ 2. Firstly we want the joint prior distribution for the parameters β, σ 2. Assuming a Gamma distribution for the marginal prior of 1/σ 2 and a Gaussian distribution for the conditional prior of β 1 σ the joint prior density would be as follows: 2 h(β, where the distributions of 1 σ 2 and β 1 σ 2 1 σ 2 ) = h(β 1 σ 2 )h( 1 σ 2 ) (3.1.4) 1 σ 2 Γ(ν 1 2, δ 1 2 ) β 1 σ 2 N(β 1, Σ 1 ) are assumed as follows: The joint posterior density of β and 1 σ 2 using the likelihood form would be: p(β, 1 σ 2 Y ) = L(β, 1 σ 2 Y )h(β, 1 σ 2 ) = p(β 1 σ 2, Y )p( 1 Y ) (3.1.5) σ2 Hence, the posterior distribution of β conditional on 1 σ 2 and data Y is as follows: β 1 σ 2, Y N(β 2, Σ 2 ) and the marginal posterior distribution of 1 σ 2 1 σ 2 Y Γ(ν 2 2, δ 2 2 ) is as follows: where β 1, β 2, Σ 1, Σ 2, ν 1, ν 2, δ 1, δ 2 are known. The 1 s from assumptions and the 2 s by computation. Finally, to make an Bayesian inference on β we should obtain the marginal posterior density of β as follows: h(β Y ) = 0 52 h(β, 1 σ 2 Y )dσ2 (3.1.6)

53 Bayesian Approach The obvious conclusion to be drawn by this example is that with this process we could derive Bayesian inference. The most important thing is to set the prior density and compute the marginal posterior density. The computation of the marginal posterior distribution usually requires integration with difficult and tricky computations Estimation Methods for Bayesian Approach Bayesian approach treats the parameters as random variables which mean that before we observe the sample data these parameters have already a prior distribution which is chosen by the researchers information and beliefs about the variable. The main idea of this approach stands in the posterior distributions of the parameters which are derived combining the prior distributions of each one of them with the information contained in the sample with the use of Bayes theorem. As we have seen before we need to compute the marginal posterior distributions of individual parameters which contains the integration of the joint posterior distribution of all unknown parameters. For these kind of computations there are many numerical methods in the literature such as Approximating Methods 15 and Markov Chain Monte Carlo (MCMC) methods. We are going to see analytically the Markov Chain Monte Carlo (MCMC) 16 method. Markov Chain Monte Carlo One of the most common Markov Chain Monte Carlo methods in the literature is the Gibbs Sampling. In this section we are going to write the basic idea and the methodology of Gibbs Sampling 17. The basic idea of this method is the approximation of the joint and marginal distributions by sampling from conditional distributions. Assume that we have n different random variables and the joint density of these variables is known. For Bayesian inferences we are interested in the marginal density for every random variable. Denote the n random variables as [x 1, x 2,..., x n ] the equation that gives us the marginal distribution for every random variable is the following: f(x t ) =... f(x 1, x 2,..., x n )dx 1 dx 2...dx t 1 dx t+1...dx n (3.1.7) where f(x 1, x 2,..., x n ) is the known joint density of n different random variables. Note that in the computation of the marginal density of each variable f(x t ) for t = 1, 2,..., n we do not integrate in terms of dx t as can be seen in the above equation. Then again the joint density of the random variables may be unknown. 15 Such methods are the approximation of posterior mean by the posterior mode and Tierney & Kadane s approximation 16 Some MCMC methods are the Metropolis algorithm, Gibbs sampling, Slice sampling, Multiple-try Metropolis and Reversible jump 17 For further reading see Geman & Geman (1984), Gelfand & Smith (1990) and Casella & George (1993) 53

54 Bayesian Approach The Gibbs Sampling method gives us the opportunity to generate a sample from the joint density even if it is unknown and either we know the marginal density of each variable f(x t ) for t = 1, 2,..., n. The only densities we need to complete the Gibbs Sampling is the conditional densities of the random variables denoted f(x t x i t ) for t = 1, 2,..., n with x i t = [x 1, x 2,..., x t 1, x t+1,..., x n ]. The methodology is as follows: Gibbs Sampling algorithm To explain the idea of Gibbs Sampling we present the next steps Step 1: First we need to set arbitrary initial values for (x 0 2, x 0 3,..., x 0 n) Step 2: We draw x 1 1 from the known conditional density f(x 1 x 0 2, x 0 3,..., x 0 n) with the use of the initial values we set in the previous step Step 3: We draw x 1 2 from the conditional density f(x 2 x 1 1, x 0 3,..., x 0 n) Step 4: We draw x 1 3 from the conditional density f(x 3 x 1 1, x 1 2, x 0 4,..., x 0 n)... Step n+1: We draw x 1 n from the conditional density f(x n x 1 1, x 1 2,..., x 1 n 1) With these steps we complete the first iteration of Gibbs Sampling. The first step is the setting of the arbitrary initial values for the random variables, the next n steps are to complete the Gibbs Sampling iteration and generate a sample x i 1, x i 2,..., x i n. Repeating the last n steps 18 for i times we get i different samples. The question is how many iterations needed to converge to the joint and marginal distributions of (x 1, x 2,..., x n ). It has been proved 19 that we have a convergence when i. For more sufficient precision about the joint and marginal distributions of random variables (x 1, x 2,..., x n ) we can have L iterations where L is big enough to have a convergence for the Gibbs sampler. We set the final values of the L iteration as initial values and we repeat the algorithm to have more sufficient precision 20. The number of the last iterations is chosen by the researcher since the Gibbs sampler has already converge. Denote the last iterations N we draw new values for the random variables (x i 1, x i 2,..., x i n) where i = L + 1, L + 2,..., L + N. 18 The initialization step is needed only for the first iteration 19 See Geman & Geman (1984) 20 There are many suggestions about how is the best way to converge the Gibbs sampler such as plotting the estimates of the posterior densities over Gibbs iteration as it was proposed by McCulloch & Rocci (1994) and the proposition of Gelman & Rubin (1992) to try various different sets of starting values for Gibbs sampling. 54

55 Bayesian Approach 3.2 Markov Switching Model-Bayesian Approach In previous sections we have present the Markov Switching model under the Classical approach. Under the Classical framework we estimated the parameters of the model rely only to the sample data. The estimation method we used is a modified Maximum Likelihood method the EM algorithm. In this section we are going to present an alternative approach to estimating the Markov Switching model in the Bayesian framework via MCMC methods. The method we are going to use is the Gibbs Sampling. In the Bayesian analysis both the parameters of the model and the the unobserved variable s t, t = (1, 2,..., T ) are treated as random variables. Consider a model with Markov-Switching in the mean and the variance. Let the state variable s t governed by a two state Markov Switching process and takes values s t = (0, 1), where s t = 0 for regime 1 and s t = 1 for regime 2. The presentation of the model follows: y t = µ st + ɛ t = { µ0 + ɛ (0) t, ɛ (0) t N(0, σ(s 2 ), when s t=0) t = 0 µ 0 + µ 1 + ɛ (1) t, ɛ (1) t N(0, σ(s 2 ), when s t=1) t = 1 (3.2.1) where, µ st = µ 0 + µ 1 s t and σs 2 t = σ(s 2 (1 s t=0) t) + σ(s 2 s t=1) t = σ(s 2 (1 + k t=0) 1s t ). Let P [s t = 0 s t 1 = 0] = q and P [s t = 1 s t 1 = 1] = p. In this approach we treat all the parameters s t, µ 0, µ 1, σ(s 2, t=0) σ2 (s t=1),q,p as random variables. For Bayesian inference we need to derive the joint posterior density of these random variables. f( s T, µ 0, µ 1,σ(s 2, t=0) σ2 (s, q, p ỹ t=1) T ) = = f(µ 0, µ 1, σ(s 2, t=0) σ2 (s, q, p ỹ t=1) T, s T )f( s T ỹ T ) = f(µ 0, µ 1, σ(s 2, t=0) σ2 (s ỹ t=1) T, s T )f(q, p ỹ T, s T )f( s T ỹ T ) = f(µ 0, µ 1, σ(s 2, t=0) σ2 (s ỹ t=1) T, s T )f(q, p s T )f( s T ỹ T ) (3.2.2) where, ỹ T = [y 1, y 2,..., y T ] are the observed data and s T = [s 1, s 2,..., s T ] the unobserved data. To implement the Gibbs sampling we set arbitrary initial values for the parameters of the value and we repeat the next steps until convergence occurs. Step 1: We generate s t from f(s t µ 0, µ 1, σ 2 (s t=0), σ2 (s t=1), q, p, ỹ T, s t ), t = 1, 2,..., T, where s t = [s 1,..., s t 1, s t+1,..., s T ]. We construct the conditional distribution, from which s t is to be generated: 55

56 Bayesian Approach f(s t ỹ T ; s t ) = f(s t ỹ t, y t+1,..., y T ; s t ) = f(s t, y t+1,..., y T ỹ t, s t ) f(y t+1,..., y T ỹ t, s t ) = f(s t ỹ t, s t ) = f(s t ỹ t 1, y t, s t 1, s t+1,..., s T ) = f(s t, y t, s t+1,..., s T ỹ t 1, s t 1 ) f(y t, s t+1,..., s T ỹ t 1, s t 1 ) f(s t, y t, s t+1,..., s T ỹ t 1, s t 1 ) = f(s t s t 1, ỹ t 1 )f(y t, s t+1,..., s T s t, s t 1, ỹ t 1 ) = f(s t s t 1 )f(y t, s t+1,..., s T s t, s t 1, ỹ t 1 ) (3.2.3) Note that we pass from line eight to nine using the Markov property. But, f(y t, s t+1,..., s T s t, s t 1, ỹ t 1 ) = f(y t s t, s t 1, s t+1,..., s T, ỹ t 1 )f(s t+1,..., s T s t, s t 1, ỹ t 1, y t ) = f(y t s t )f(s t+1 s t, s t 1, ỹ t 1 )f(s t+2,..., s T s t+1, s t, s t 1, ỹ t 1, y t ) = f(y t s t )f(s t+1 s t )f(s t+2,..., s T s t+1 ) f(y t s t )f(s t+1 s t ) from these equations we derive: (3.2.4) f(s t ỹ T ; s t ) f(s t s t 1 )f(y t s t )f(s t+1 s t ) (3.2.5) where, f(s t s t 1 ) and f(s t+1 s t ) are derived by f(y t s t ) = 1 exp( 1 2πσ 2 st 2σs 2 (y t µ st ) 2 ) t and the transition probabilities q, p. Hence, using the above equations we get f(s t = j ỹ T, s t ) P [s t = j ỹ T, s t ] = 1 j=0 f(s (3.2.6) t = j ỹ T, s t ) Step 2: We generate the transition probabilities conditional on s T = [s 1, s 2,..., s T ] we have generate in the previous Step. Transition probabilities are independent of ỹ T and the parameters of the model. For the prior distributions of q, p we assume the independent beta 21 distributions: Prior q beta(α 00, α 01 ) (3.2.7) 21 Beta distribution denoted by x beta(α 0, α 1 ) is dependent on two positive parameters 56

57 Bayesian Approach p beta(α 11, α 10 ) (3.2.8) The joint distribution of the transition probabilities q, p is: f(q, p) q α00 1 (1 q) α01 1 p α11 1 (1 p) α10 1 (3.2.9) Hence, the likelihood function for q and p is: L(q, p s T ) = q v00 (1 q) v01 p v11 (1 p) v10 (3.2.10) where v ij denotes the transition from state i to state j, which can be counted for given s T = [s 1, s 2,..., s T ]. With the prior distribution and the likelihood function we can derive the posterior distribution: Posterior p(q,p s T ) = f(q, p)l(q, p s T ) p α11 1 (1 p) α10 1 q α00 1 (1 q) α01 1 q v00 (1 q) v01 p v11 (1 p) v10 = q α00+v00 1 (1 q) α01+v01 1 p α11+v11 1 (1 p) α10+v10 1 (3.2.11) Thus, the posterior distribution for the transition probabilities given s T is given by two independent beta distributions from which we draw q, p. q s T beta(α 00 + v 00, α 01 + v 01 ) (3.2.12) p s T beta(α 11 + v 11, α 10 + v 10 ) (3.2.13) Step 3: We generate µ 0, µ 1, σ 2 (s t=0), σ2 (s t=1) from f(µ 0, µ 1, σ 2 (s t=0), σ2 (s t=1) s T, ỹ T ) We generate µ 0, µ 1 conditional on σ(s 2, t=0) σ2 (s, s t=1) T, ỹ T. Let us rewrite the equation of y t = µ 0 +µ 1 s t +ɛ t divided by σs 2 t in matrix notation as follows: Y = µx + E (3.2.14) where, E N(0, I T ) and µ = [µ 0, µ 1 ]. The prior distribution is: α 0, α 1. The density function is: with and f(x α 0, α 1 ) = V ar(x) = { x α 0 1 (1 x) α1 1 }, for 0 < x < 1. 0, for x 1 or x 0 α 0 E(x) = α 0 + α 1 α 0 α 1 (α 0 + α 1 ) 2 (α 0 + α 1 + 1) 57

58 Bayesian Approach Prior µ σ 2 (s t=0), σ2 (s t=1) N(m 0, M 0 ) (3.2.15) we assume that matrices m, M are known. Posterior where µ σ 2 (s t=0), σ2 (s t=1), s T, ỹ T N(m 1, M 1 ) (3.2.16) m 1 = (M X X) 1 (M 1 0 m 0 + X Y ) M 1 = (M X X) 1 From this posterior distribution we can draw µ. We generate σ 2 (s t=0), σ2 (s t=1) conditional on µ 0, µ 1, s T, ỹ T. We set the variance as follows: σ 2 s t = σ 2 (s t=0) (1 s t) + σ 2 (s t=1) s t = σ 2 (s t=0) (1 + k 1s t ) (3.2.17) where k 1 > 1. First we want to generate σ(s 2 t=0) conditional on k 1 and then generate k 1 = 1 + k 1 conditional on σ(s 2 t=0). To do so we divide y t = µ 0 + µ 1 s t + ɛ t by 1 + k 1 s t and we get: y t = µ 0 z 0t + µ 1 z 1t + ɛ t, ɛ t i.i.d.n(0, σ 2 (s t=0) ) (3.2.18) where y t = y t 1 + k1 s t z 0t = z 1t = ɛ t = k1 s t s t 1 + k1 s t ɛ t 1 + k1 s t We want the prior of the σ(s 2 t=0) and we assume an inverted Gamma distribution. Prior σ(s 2 k t=0) 1, µ 0, µ 1 IG( b 0 2, d 0 2 ) (3.2.19) with the assumption that b 0, d 0 are known. Posterior σ 2 (s t=0) k 1, µ 0, µ 1, s T, ỹ T IG( b 1 2, d 1 2 ) (3.2.20) 58

59 Bayesian Approach where, b 1 = b 0 + T and d 1 = d 0 + T t=1 (y t µ 0 z 0t µ 1 z 1t ) 2. We want to generate k 1 = 1 + k 1 conditional on σ(s 2 t=0) and to do so we divide the equation y t = µ 0 + µ 1 s t + ɛ t by σ (st=0) and we get: y t = µ 0 z 0t + µ 1 z 1t + ɛ, ɛ i.i.d.n(0, 1 + k 1 s t ) (3.2.21) where y t = y t σ (st=0) z 0t = 1 σ (st=0) Example z 1t = ɛ = s t σ (st=0) ɛ t σ (st=0) Now we shall provide an appropriate posterior distribution for k 1. Let the prior distribution of k 1 to be the following: Prior k 1 σ(s 2, µ t=0) 0µ 1 IG( b 2 2, d 2 2 ) (3.2.22) where, b 2, d 2 are known. Posterior k 1 σ 2 (s t=0), µ 0µ 1, s T, ỹ T IG( b3 2, d 3 2 ) (3.2.23) where b 3 = b 2 + T and d 3 = d 2 + T 1 (yt T 1 = { t : s t = 1 } µ 0 z 0t µ 1 z 1t) 2 with Finally after generating k 1 from the above posterior we can calculate σ 2 (s t=1) = σ2 (s t=0) (1 + k 1). For the sake of simplicity we present a simple example of how the Gibbs sampling works in a Markov Switching model and to do so we pass most of the complex processes we discuss above. The Markov Switching model would be an AR(2) with switch in the mean: y t = γ 1 + γ 2 s t + β 1 y t 1 + β 2 y t 2 + ɛ t (3.2.24) 59

60 Bayesian Approach where s t is the state governed by a two-state Markov process with the transition matrix: ( ) p00 p P = 01 p 10 p 11 and ɛ t are i.i.d. random variables with mean zero and variance σ 2 ɛ. State variable takes the value s t = 0 when regime 1 occurs and s t = 1 when regime 2 occurs. The vector of parameters given the model (3.2.24) is θ = (γ 1, γ 2, β 1, β 2, σ 2 ɛ, p 00, p 11, s t ) Let I t = { i 1, i 2,..., i t } denote the information sourcing of all the observed variables up to time t, which is actually the information set. Thus I T would be the information set based on the full sample, hence I T = { i 1, i 2,..., i T }. The parameter vector θ can be classified into k groups: θ = (θ 1, θ 2,..., θ k) Assuming that the observed data from the whole sample are given I T, let p(θ j I T, θ m ), with j = 1, 2,..., k and j m which denotes the conditional posterior distribution of θ j in the Bayesian analysis. To derive the conditional posterior distributions we need the prior distributions 22 of the parameters and likelihood functions. We set random initial values for the k conditional posterior distributions: θ (0) = (θ (0) 1, θ (0) 2,..., θ (0) k ) The j th realization of parameter vector θ is obtained via Gibbs samples with the following procedure. 1. We draw randomly a realization of θ 1 from the conditional posterior distribution: θ (j) 1 = p(θ 1 I T, θ (j 1) 2,..., θ (j 1) k ) 2. We draw randomly a realization of θ 2 from the conditional posterior distribution: θ (j) 2 = p(θ 1 I T, θ (j) 1 θ(j 1) 3,..., θ (j 1) k ) 3. We draw randomly a realization of θ 3 from the conditional posterior distribution: θ (j) 3 = p(θ 1 I T, θ (j) 1 θ(j) 2, θ(j 1) 4,..., θ (j 1) k ) 4. We continue the process to draw θ j 4,..., θ(j) k. 22 The prior distributions can be draw by a Dirichlet process 60

61 Bayesian Approach The j th realization of vector θ is θ (j) = (θ (j) 1, θ (j) 2,..., θ (j) k ) We have complete with this procedure one iteration of Gibbs sampler. Repeating this procedure N times we get the Gibbs sequence (θ (1), θ (2),..., θ (N) ). Using the Gibbs sequence we can compute N different conditional posterior distributions 23 of θ j. Commonly for better convergence we can drop the first N 1 estimates in the Gibbs sequence keep the remaining N 2 estimates, where N 1 + N 2 = N. The Gibbs sequence converges exponentially fast to the true distribution of θ. θ (N) D p(θ IT ) as N. In exactly the same way each sub-vector θ (N) D j θ j converges to the true marginal distribution of θ j. Also for every measurable function f: 1 N N j=1 f(θ (j) ) a.s. E[f(θ)] where we have an almost surely convergence 24. The sample average of the Gibbs sequence is the desired estimate of unknown parameters. 3.3 State Space Models with Markov Switching-Bayesian Approach We have seen in previous sections the State Space model framework in the Markov Switching under the Classical approach. Now we are going to present briefly the Bayesian approach of State Space models with Markov Switching. Firstly, we consider a simple state-space model with Markov Switching 25 as follows: Measurement equation Transition equation y t = H st b t + A st z t + ɛ t (3.3.1) b t = µ st + F st b t 1 + v t (3.3.2) 23 For example the conditional posterior distributions of θ 1 are p(θ 1 ) I T, θ (j) 2, θ(j) 3,..., θ(j) k, j = 1, 2,..., N 24 For further reading see Gelfand and Smith (1990),Greenberg et.al (1996) Kim (1996) about Gibbs sampling in Markov Switching models 25 For further reading see Kim and Nelson (1998) 61

62 Bayesian Approach with and ɛ t i.i.d.n(0, R st ) v t i.i.d.n(0, Q st ) E(ɛ t v t s t ) = 0 where H st, A st, R st, µ st, F st, Q st are the hyper parameters in the model and all are dependent on the unobserved variable s t witch is governed by a Markov process. y t is a vector of observed time series, b t is a vector of unobserved variables and z t is a vector of exogenous or lagged dependent variables. The conditional features of the model make inferences via Gibbs sampling (MCMC method) with the following steps: Step 1: We set random initial values for the hyper parameters H st, A st, R st, µ st, F st, Q st of the model. Step 2: We generate unobserved variables b T = (b 1, b 2,..., b T ) from: T 1 p( b T ỹ T, s T ) = p(b T ỹ T, s T ) p(b t ỹ t, s t, b t+1 ) Thus, we generate b T conditional on ỹ T = [y 1, y 2,..., y T ] and s T = [s 1, s 2,..., s T ]. Step 3: We generate s T = [s 1, s 2,..., s T ] from: t=1 p( s T ỹ T, b T ) = p(s T ỹ T, b T 1 T ) p(s T ỹ T, b T, s T +1 ) Step 4: We generate hyper parameters H st, A st, R st, µ st, F st, Q st of the model conditional on the observed data, b T and s T. The above densities are derived by a bivariate regression model with a common Markov-Switching variable. t=1 62

63 4 Applications 4.1 Applications of the Markov Switching Model with R-programming In this section we are going to implement the Markov Switching Model. We have two applications to present. The first one is an implementation of a simple Markov Switching model of mean and variance to India s real GDP and the second one is the implementation of an Markov Switching Autoregressive model to Dow Jones Index An Application for Indian GDP We are going to present a simple example of the Markov Switching model fitted to the logarithm of India s GDP. The data we used were taken from FRED 1 and are quarterly from 01/04/2005 to 01/07/2014. The time series plot of the logarithmic GDP of India follows. 1 The data are obtained from FRED: NAEXKP01INQ657S/downloaddata 63

64 Applications Fig. 4.1: Plot of the log GDP of India Looking the plot we can see a tremendous structural break to the GDP time series. This break is the effect of the global financial crisis of We are going to implement a simple Markov Switching model with switches to the mean and the variance. The theoretical model would be: y t = c st + ɛ st,t where y t should be the log GDP, c st should be the mean of log GDP governed by a Markov Switching process and s t denotes the unobserved state variable. Before implementing the Markov Switching model we did a linearity test for the mean of log GDP with the T eraesvirta s neural network test [41]. This test uses a Taylor series expansion of the activation function to arrive at a suitable test statistic. The hypothesis of the test is: The results from R: H 0 : linearity in the log GDP mean H 1 : non linearity in the log GDP mean Teraesvirta Neural Network Test data: tsgdp X-squared = , df = 2, p-value = From the results we get p value = which indicates the rejection of the null hypothesis of linearity in the significance level 0.01 and 0.05 since p-value is smaller. Thus, we have serious evidence (A.1) for non-linearity in the mean. 64

65 Applications We fit the data to a Markov Switching model using R. We use for the estimations of the drift term and the unobserved variable the EM algorithm and for the determination of the number of states the Hansen test. Hansen test showed us evidence for a two-state model against the three-state. The results we took from R follows. Markov Switching Model Call: msmfit(object = model, k = 2, sw = c(t, T)) AIC BIC loglik Coefficients: (Intercept)(S) Std(S) Model Model Transition probabilities: Regime 1 Regime 2 Regime Regime The estimated model is: y t = { 0.22, for st = 0(Regime 1) 0.55, for s t = 1(Regime 2) (4.1.1) where s t is the state variable governed by a Markov Switching process with the following transition matrix: ( ) P = (4.1.2) We compute several statistics in R, the summary follows: Markov Switching Model Call: msmfit(object = model, k = 2, sw = c(t, T)) AIC BIC loglik Coefficients: Regime Estimate Std. Error t value Pr(> t ) 65

66 Applications (Intercept)(S) Residual standard error: Multiple R-squared: 0 Standardized Residuals: Min Q1 Med Q3 Max Regime Estimate Std. Error t value Pr(> t ) (Intercept)(S) e-14 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: Multiple R-squared: 0 Standardized Residuals: Min Q1 Med Q3 Max The R code detects two different regimes one with high volatility (Regime 1) and low mean and the other one with low volatility (Regime 2) and higher mean. In Regime 1 we have negative mean for the log GDP of India and no statistical significance. The reason for the negative value in the mean is because we took logarithms in the GDP and as we can see from the figure (4.1.1) we have abrupt change in the values of GDP due to the financial crisis period. The values of India s GDP in the crisis period was extremely low. Thus, real GDP in the crisis period was close to zero. Looking into the results of Regime 2 we have positive mean which is statistical significant in every significance level α. We could say that Regime 1 presents the recessions of India s economy in GDP terms and Regime 2 presents the neutral period of India s economy in GDP terms. In contrast, the simple linear model of log GDP presented in Appendix (A.2) has positive coefficient 0.42 which is very close to the value Regime 2 coefficient Hence, the linear model fails to simulate the recession of India s economy. The following plots show how the MSM detects the switching. 66

67 Applications Fig. 4.2: Plot of the smoothed probabilities for Regime 1 Fig. 4.3: Plot of the smoothed probabilities for Regime 2 As we can see in the figure (4.2) Regime 1 represents the period of the financial crisis and Regime 2 the period with lower volatility in the economy as 67

68 Applications shown in figure (4.3). For the consistency of the Markov Switching models we did some diagnostic tests in R. The results of the Q-Q plots are in Appendix (A.3) and the AC, P AC are in Appendix (A.4) for both Regimes. The results do not indicate the existence of strong autocorrelation in the residuals(1st order autocorrelation is very close to the statistical significance levels and may exist), thus the Markov Switching model with switch in the mean and variance fits much better the data than the simple linear model but is not the best MSM for interpreting the data. As we saw above there was not interpretation of the data and since we have evidence for autocorrelation in the log GDP we should apply a Markov Switching model which allows autoregressive parts. The results we got from R for the Markov Switching model which allows autoregressive parts 26 are: Markov Switching Model Call: msmfit(object = model, k = 2, sw = c(t, T, T, T), p = 2) AIC BIC loglik Coefficients: (Intercept)(S) lngdp_1(s) lngdp_2(s) Std(S) Model Model Transition probabilities: Regime 1 Regime 2 Regime Regime Thus the new model for the log GDP of India is: y t = { yt y t 2, for s t = 0(Regime 1) y t y t 2, for s t = 1(Regime 2) (4.1.3) where y t is the log GDP of India. The transition matrix has changed and is: ( ) P = (4.1.4) where Regime 1 is the high volatility and Regime 2 the low volatility. From the transition matrix we can see the high probability of staying in Regime 2 since p 22 = It is logical to stay longer in Regime 2 (the low 26 We choose two lags using the AIC criterion 68

69 Applications volatility Regime) because of the stability of the log GDP except the period of financial crisis of 2008 (Regime 2) where we have abrupt changes in the values of GDP. The statistics we got from R for the model follows: Markov Switching Model Call: msmfit(object = model, k = 2, sw = c(t, T, T, T), p = 2) AIC BIC loglik Coefficients: Regime Estimate Std. Error t value Pr(> t ) (Intercept)(S) e-05 *** lngdp_1(s) e-11 *** lngdp_2(s) ** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: Multiple R-squared: Standardized Residuals: Min Q1 Med Q3 Max e e e e e-01 Regime Estimate Std. Error t value Pr(> t ) (Intercept)(S) e-06 *** lngdp_1(s) < 2.2e-16 *** lngdp_2(s) e-06 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: Multiple R-squared: Standardized Residuals: Min Q1 Med Q3 Max e e e e e-01 69

70 Applications Looking at the results provided by R we can see that the coefficients of Regime 2 are all statistical significant for every significance level α. There is difference from previous MSM with switch only in the mean and variance where in Regime 1 the coefficient was not statistical significant. Now R 2 = 0.79 which indicates that for Regime 2 model interprets the volatility pretty well. For Regime 1 we have again both the coefficients statistical significant in every significance level α. The R 2 = 0.88 for Regime 1 which is even bigger than Regime 2. Overall, the Markov Switching model seems to fit pretty well the data of India s GDP. With the entrance of lagged log GDP in the model there is a little bit change in the detection of Regimes. We can see that change in the following figures: Fig. 4.4: Plot of the smoothed probabilities for Regime 1 70

Applications Fig. 4.5: Plot of the smoothed probabilities for Regime 1 We present another figure for the smoothed probabilities for every Regime in the Appendix (A.5).

71 Applications Fig. 4.5: Plot of the smoothed probabilities for Regime 1 We present another figure for the smoothed probabilities for every Regime in the Appendix (A.5). Finally we did diagnostics for the model (4.1.3) where we present the Q-Q plots (A.6) and the ACF P ACF (A.7) plots for every Regime residuals to test if they satisfy white noise properties. From the results of ACF P ACF tests we got from R we can see that there is evidence for 1st order autocorrelation in the residuals of both Regimes but after that the autocorrelation disappears. The Q-Q plots indicate that the residuals follow the Normal distribution. Thus, we could assume that the residuals of the M SM are white noise. Conclusions In this section we implement a simple form of the Markov Switching model with switching parameters the mean and the variance in the logarithm of India s real GDP. The simple linear model only with constant could not interpret the movement of the log GDP because of the abrupt changes such as the changes due to the financial crisis of 2008 which cause radical changes in India s GDP. These changes indicate non-linear process for the GDP and a simple linear model could not interpret the data. Thus, we use the simple form of Markov Switching model of mean and variance. The results from R show that the model identifies well enough the two regimes of low and high volatility. Likewise the MSM fits much better the data than the linear model. The reason we choose this simple example is to show how the model fits and interprets the data. Another key thing to remember is that a simple model with the only parameter 71

72 Applications the constant(mean) is difficult to interpret the volatility of log GDP, for that reason after implementing the Markov Switching model of mean and variance we choose a little more complicated model that involves lagged parameters of the log GDP. The number of the lagged parameters was chosen considering the AIC criterion which indicates two lags. All things considered, the last M SM we implement fit the data in a more complex way and the interpretation of the log GDP volatility was improved An Application for DJIA Index In this example we are going to fit a Markov Switching model to the Dow Jones Industrial Average Index. The data we used were taken from FRED 2 and are weekly from 09/02/2005 to 08/21/2015. We took the logarithm of the index and create a time series plot. Fig. 4.6: Plot of the log Dow Jones index As we can see in the figure (4.6) the logarithm of Dow Jones Index seems to have big variations between the values of the index through time. The model we choose to investigate the time series is an AR(2). The selection of this model was taking into consideration the AIC criterion 27. The model is as follows: dj t = c + β 1 dj t 1 + β 2 dj t 1 + ɛ t (4.1.5) 2 The data are obtained from FRED: 27 AIC denotes the Akaike Information Criterion which is a measure of the relative quality of econometric models for a given data sample. AIC estimates the information that a model loses, as a matter of fact we prefer the model with the smallest AIC value. The formula of Akaike Information Criterion is AIC = 2k 2ln(L) where k is the number of estimated parameters in the model and L the maximized value of the likelihood function 72

73 Applications where dj is the logarithm of Dow Jones index. The idea of Markov Switching Model is to interpret a model with radical structural breaks which occur more than once. Markov process tries to interpret these switches. Before implement the MSM we should test if there are any structural breaks in the AR(2) model. For this hypothesis testing we used the Standard CUSUM test 28. The hypothesis we put to the test is: H 0 : β stable over time t = (1, 2,..., T ) H 1 : β not stable over time t = (1, 2,..., T ). where β = (β 1, β 2 ). The statistic we implement in R is the following: CUSUM t = ˆσ 2 w = 1 T k t i=k+1 ŵ i ˆσ w T (w t w) 2 t=1 where t is the time we estimate the recursive least squares (RLS) estimates ˆβ. Hence, we have the T k estimates ( ˆβ k+1,..., ˆβ T ) for t = (k + 1,..., T ). w t denotes the recursive residuals 29 of the above regression coefficients and is defined as follows: w t = v t Zt = dj t ˆβ t 1 dj t Zt where dj t = [dj t 1, dj t 2 ] and Z t = ˆσ 2 [ 1 + dj t( dj t 1 dj t 1 ) 1 dj t 1 ]. We applied the standard CUSUM test to the (4.1.5) model in R and we took the following results. For testing the hypothesis H 0 we create an Empirical Fluctuation Process and we get the following plot. As we can see in the plot in figure (4.7) there are many structural breaks in the model (4.1.5) taking into consideration the parameters β = [β 1, β 2 ]. After the plot we do the Recursive CUSUM test and the results we took from R are as follows: Recursive CUSUM test data: logdj.cus S = , p-value = 1.5e-11 With these results since p value = 1.5e 11 we can say that in any significance level 30 we reject the null hypothesis H 0. Thus we have evidence of structural 28 The CUSUM (cumulative sum) test is a sequential analysis technique for detecting structural breaks in the parameters of the model. 29 w t are recursive Chow Forecast t-statistic s. For further reading about the recursive CUSUM we used see Brown, Durbin and Evans (1975) 30 The significance level we test is α = 0.01,

74 Applications Fig. 4.7: Plot of the Recursive CUSUM test breaks in the model through time. These structural breaks caused by changes in the parameters β = [β 1, β 2 ]. The last thing we want to know before fitting the Markov Switching is if there are also breaks caused by the drift term c. We did the Recursive CUSUM test with the hypothesis: H 0 : c stable over time t = (1, 2,..., T ) The results we get from R was: H 1 : c not stable over time t = (1, 2,..., T ). Recursive CUSUM test data: logdj1.cus S = , p-value < 2.2e-16 As before because p value < 2.2e 16 we reject the H 0 in any significance level. Thus, we have structural breaks cause by c. We do the same hypothesis testing with the OLS-based MOSUM and the results are presented in the Appendix (B.1). Taking into consideration the results of hypothesis testing we apply the Markov Switching model with changes in all the parameters of the model plus the variance. The Fitted Markov Switching Model 31 follows: dj t = c st + β 1,st dj t 1 + β 2,st dj t 2 + ɛ t, for t = (1, 2,..., T ) (4.1.6) 31 The model is based on the Markov Switching presentation by Hamilton(1989,1990) 74

75 Applications where s t denotes the state. The results we get from R are following: Markov Switching Model Call: msmfit(object = model, k = 2, sw = c(t, T, T, T)) AIC BIC loglik Coefficients: (Intercept)(S) logdj1(s) logdj2(s) Std(S) Model Model Transition probabilities: Regime 1 Regime 2 Regime Regime The R code we run detect two different regimes 32, which can be described as the low volatility Regime 1(s t = 0) and the high volatility Regime 2(s t = 1). All the estimation are made in respect to the Classical Analysis with the EM algorithm. Thus, the estimated model is: { dj dj t = t dj t 2 + ɛ (0) t, σ (0) = 0.012, s t = dj t dj t 2 + ɛ (1) t, σ (1) = 0.037, s t = 1 (4.1.7) with the transition probability: ( ) P = (4.1.8) We can see from the transition probabilities that if we enter one of the two regimes it is difficult to exit from it since the probabilities p 11, p 22 are very high. Considering the consistency of the model we apply several tests in R. The results are following: Markov Switching Model Call: msmfit(object = model, k = 2, sw = c(t, T, T, T)) AIC BIC loglik The likelihood ratio test by Hansen(1992) provide us evidence in favour of a two-state Markov Switching. The three-state model fail the test against the two-state, thus we choose to use the two-state model as detected by R. 75

76 Applications Coefficients: Regime Estimate Std. Error t value Pr(> t ) (Intercept)(S) ** logdj1(s) < 2.2e-16 *** logdj2(s) < 2.2e-16 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: Multiple R-squared: Standardized Residuals: Min Q1 Med Q3 Max Regime Estimate Std. Error t value Pr(> t ) (Intercept)(S) logdj1(s) < 2e-16 *** logdj2(s) * --- Signif. codes: 0 *** ** 0.01 * Residual standard error: Multiple R-squared: Standardized Residuals: Min Q1 Med Q3 Max Transition probabilities: Regime 1 Regime 2 Regime Regime From the results we can see that the Fitted Markov Switching Model interprets pretty good the volatility of the data since the R 2 is close to 1 for both regimes. More specific for the low volatility regime S 1 the drift term is statistical significant for α = with p value = and the other two parameters are statistical significant for any level of α since p value =< 2.2e 16. For the high volatility regime S 2 the parameter of dj t 1 is statistical significant 76

77 Applications for every α since p value < 2.2e 16 and the parameter of dj t 2 is statistical significant for α = 0.01 with p value = We create the Q-Q plot to check the residuals from the complete Markov Switching Model. The residuals look like to be white noise 33 and they fit to Fig. 4.8: Normal Q-Q plot of pooled residuals for the MSM-AR the Normal Distribution, since there is a linear relation between sample and theoretical quantiles and the line is very close to y = x. 33 See Appendix (B.2) for ACF,P ACF plots of Residuals and Squared Residuals of both Regimes and (B.3) for Q-Q plots of both Regimes 77

78 Applications Fig. 4.9: Plot of the smoothed probabilities In the figure (4.9) we can see the comparison between the two regimes (low volatility, high volatility ) of the smoothed probabilities. Then, we present separate for every regime the plot of dependent variable versus smoothed probabilities. Fig. 4.10: Dependent variable vs. smoothed probabilities for low volatility regime 78

Switching Regime Estimation

Switching Regime Estimation Series de Tiempo BIrkbeck March 2013 Martin Sola (FE) Markov Switching models 01/13 1 / 52 The economy (the time series) often behaves very different in periods such as booms