Multivariate GARCH models: Inference and Evaluation

Size: px

Start display at page:

Download "Multivariate GARCH models: Inference and Evaluation"

Ralf Page
5 years ago
Views:

1 Multivariate GARCH models: Inference and Evaluation Alessandro Palandri February 9, 2004 Abstract This paper presents a preliminary evaluation, on exchange rates data, of the main multivariate GARCH models in the literature. It shows how the Dynamic Conditional Correlation model of Engle(2001) outperforms the other models and sets to use it in the estimation of US-Bond data. Results with respect to the latter are controversial: the model proved to be extremely hard to estimate but the non-converged estimates seemed to allow the model to fit the data quite accurately. I also suggest how to carry out the estimation of a large scale GARCH: by a modified version of the DCC and by proposing the H-GARCH and the very fast estimation methodology associated with it. Empirical results w.r.t. the H-GARCH suggest that smoothed realizations are a better proxy, for the realization of the variance covariance matrix, than the simple outer-product of the residuals. Keywords: multivariate GARCH, large scale GARCH, BEKK, DCC, R-GARCH, H-GARCH, conditional correlations, climber. JEL classification: G12, C51, C52. Department of Economics, Duke University, Box 90097, Durham NC 27708, E- mail:palandri@econ.duke.edu. 1

2 1 Introduction Modeling the temporal dependence in the second order moments and forecasting future volatility has key relevance in many financial-econometric issues such as evaluating risk and derivatives pricing. Because it is relatively simple to estimate and because it seems to track volatility in a satisfactory way, the GARCH(1,1) model has become the most popular specification among theorists and practitioners. This is true for GARCH(1,1) models of all flavors: GARCH, AVGARCH, NARCH, EGARCH, ZARCH, GJR-GARCH, APARCH, AGARCH, and NAGARCH. However, in the recent past there has been some debate (Cumby et al. (1993), Figlewski (1997), and Jorion (1995,1996)) on whether simple averages of historical volatility can predict future volatility better than GARCH models. In response, Andersen and Bollerslev (1998) provided integrated volatility as a significantly more accurate measure than simple squared or absolute realized returns. Through the latter they showed how volatility models provide strikingly accurate volatility forecasts. Settled the debate on volatility models, attention turned to the fact that volatilities of assets and markets tend to move together in time. This widely accepted feature of the data has led to the multivariate modeling of volatilities. In the last years the number of different multivariate GARCH, henceforth M GARCH, models has grown steadily. In each new model there was the effort of the authors to overcome one, and possibly more, undesirable feature of the models present in the literature at the time. However, the trade-off faced by every model is the one between parsimony and richness in the description of the second order dynamics. In fact, an M-variate model will comprise M(M + 1)/2 equations each one of which will contain a set of parameters describing the evolution of the dependent variable. The number of parameters of a fairly rich volatility model soon becomes big enough, in some cases even for M < 6, to make the estimation infeasible. This paper has three main goals. The first is to evaluate different MGARCH models for Maximum Likelihood estimation (MLE or QMLE) of fully specified densities, by measuring how well they track volatilities and correlations, and possibly identify the best model for a given class of data-sets. Preliminary results on exchange rates suggest that the benchmark model is Engle and Kroner (2003) DCC with no interactions among the volatilities and correlations. The second goal is to push forward the limit of the dimensions that can be handled by an MGARCH model without degrading its forecasting power. This will be done following the mainstream approach 2

3 in the literature of a multi-step estimation procedure. In this regard, I will propose a vary fast and easy to estimate VEC-like MGARCH and suggest the path along which the DCC should evolve in order to handle bigger sets of data. The third goal of this paper is to estimate multivariate stochastic volatility models using Indirect Estimation procedures based on MGARCH(1,1) auxiliary models. 3

4 2 Multivariate GARCH models for fully specified densities For those applications where the main objective of the econometrician is to make inference on the model s parameters values, it is of primary importance to be able to compute the correct asymptotic distribution of such parameters. This requires the estimation of the various components of the model to be carried out in one step. In this setting it means that the VARMA, the volatilities, and the correlations (or the covariances, depending on the model) cannot be estimated separately. Unfortunately, the maximization of such a likelihood function seems to be feasible only for a relatively small number of series M. Estimation for M = 3 does not seem to present major problems for most data-sets, while M = 6 seems to be the limit for the most well behaved series of data. Therefore, given the state of the art, if the main interest is estimating a VARMA-MGARCH model with correct asymptotics then the number of series M will necessarily be quite small. While the residuals of a VARMA-MGARCH model are non-gaussian for most financial time series data, if the only concern is to consistently estimate the model s parameters and their standard errors the assumption of Gaussianity will still produce the desired results. The Quasi-Maximum-Likelihood-Estimates (QMLE) will be consistent and the associated asymptotic variance-covariance matrix I 1 JI 1 will guarantee the consistency of the standard errors. When the correct specification of the error density function is also at stake, the latter cannot be assumed anymore to be Gaussian. In general the density function employed to carry out MLE is required to match higher order moments of the data: particularly skewness and kurtosis. To match higher order moments and possibly obtain more stable estimates and forecasts, Bollerslev (1987) proposed the use of a t-distribution with estimated degrees of freedom, and Nelson (1991) suggested a generalized error density, function of the gamma distribution, with a parameter controlling for tail thickness. Other approaches to the problem are the Pearson fitting, through which it is possible to construct a density function that is consistent with the first four moments of the data, and the Semi-Non-Parametric (SNP) approximation of the density of Gallant and Tauchen (1989) based on an expansion in Hermite functions. While these densities provide a better fit to the data in terms of higher order moments, allow to simulate paths that are more similar to those of the observations, and theoretically produce tighter confidence intervals, their use should not 4

5 be taken lightly. In fact, unless the approximated density is the true density, in which case we have MLE, the parameters estimates will be of QML. In the latter case the necessary and sufficient condition for the consistency of the location and scale parameters is that the approximated density belongs to the quadratic exponential family (Gourieroux and Monfort (1984)): l(yt ; µ t, H t ) = exp{a(µ t, H t ) + B(y t ) + C(µ t, H t )y t + y td(µ t, H t )y t } (1) where µ t = µ(i t 1 ; θ) and H t = H(I t 1 ; θ). In this framework I will estimate a variety of MGARCH models on a wide collection of data-sets and try to establish which model is best at describing the evolution of the conditional variance-covariance matrix according to a suitable measure. 2.1 Competing models for fully specified densities In general, for an M-dimensional vector stochastic process y t, the location can be described by: y t = µ t (θ) + ε t (2) where µ t (θ) is a function, of the vector of parameters θ, describing the evolution of the mean vector conditional to I t 1. The scale will be described as: ε t = H 1/2 t (θ) η t (3) where H 1/2 t (θ) is a positive definite matrix and η t satisfies the following moment conditions: E(η t ) = 0 (4) V AR(η t ) = I M (5) H 1/2 t is such that H t is the conditional variance-covariance matrix of the process y t, while its unconditional variance-covariance matrix is given by Σ = E(H). The way in which to specify the functions µ t (θ) and H t (θ) depends on the model we are interested in. Since a discussion on the most appropriate model for the location is beyond the scope of this paper, I will follow the mainstream literature and set µ t (θ) to be a VARMA. The conditional variance covariance matrix H t will be modeled as a MGARCH of which I will describe and evaluate the different specifications. 5

6 2.2 The VEC(p,q) The VEC specification of Bollerslev, Engle, and Wooldridge (1988) is fairly general as it allows each element of the conditional variance-covariance matrix to be function of every element of the lagged conditional variance-covariance matrices and outer products of lagged realizations. For a VEC(p,q) we have: p q h t = K + i h t i + Θ j ɛ t j (6) i=1 j=1 h t = vech(h t ) (7) ɛ t = vech(ε t ε t) (8) The vech( ) operator stacks the lower triangular part of an M M matrix into an M(M + 1)/2 1 vector. K is a M(M + 1)/2 1 coefficient vector, while i and Θ j are square coefficient matrices of dimension M(M + 1)/2. This specification, while rather flexible, has two major drawbacks. The first is that it is difficult to guarantee the positivity of H t without imposing strong parameter restrictions. In other words, in its unrestricted version the VEC(p,q) allows interactions among all elements of the conditional variance-covariance matrix but does not guarantee that the latter will be positive definite. The second major drawback for this model is the rate at which the number of parameters grows with the dimensionality of the model: it is of order M 4. Since fully specified densities require the simultaneous estimation of all the likelihood s parameters, the VEC model will not be included in this study as a simple 3-variate VAR(1)+VEC(1,1) would require the estimation of 90 parameters. Furthermore, these models do not seem to perform better than more parsimonious MGARCH(1,1): probably due to the fact that allowing every element of h t to be function of each element of h t 1 and vech(ε t 1 ε t 1) causes the parameters estimates to be very noisy and hence generate very noisy forecasts. 2.3 The R-GARCH(p,q) This specification, due to Gallant and Tauchen (2002), solves the VEC s problem of the positive definiteness of H t by modeling its Cholesky decomposition: H t = R t R t (9) p q vech(r t ) = K + i vech(r t i ) + Θ j ε t j (10) i=1 j=1 6

7 where K is a vector of dimensions M(M + 1)/2, i is a square matrix of dimensions M(M + 1)/2, and Θ j is a rectangular matrix of dimensions M(M + 1)/2 M. This specification achieves the positive definiteness of H t without sacrificing the richness in the VEC s dynamics. A problem with the R-GARCH is that, because the residuals ε t j enter through an absolute value, the covariances loose information on the sign of the realizations. Even though this problem could be overcome by adding lagged residuals without taking their absolute value, the rich dynamics it shares with the VEC make it also share the VEC s problem of the dimensionality: the number of parameters to be estimated is of the order M The BEKK(p,q) This parametrization of H t, proposed by Engle and Kroner(1995), can easily guarantee its positivity and significantly reduces the number of parameters to be estimated. The BEKK(p,q) is defined as: p q H t = KK + i H t i i + Θ j ε t j ε t jθ j (11) i=1 j=1 where K, i, and Θ j are M M matrices and K is triangular for identification purposes. For this MGARCH specification the number of parameters to be estimated is of order M 2. To obtain this result it gives up the flexibility of the VEC in modeling the single components of H t. For instance, in a BEKK(1,1) the parameters governing the dynamics of the covariances are the products of the corresponding parameters of the two variance equations: H ll = K l K l + l H t 1 l + Θ l ε t 1 ε t 1Θ l (12) H kk = K k K k + k H t 1 k + Θ k ε t 1 ε t 1Θ k (13) H lk = K l K k + l H t 1 k + Θ l ε t 1 ε t 1Θ k (14) 2.5 The F-GARCH(p,q,r) To reduce the number of parameters involved in the estimation of VEC and even BEKK models, Diebold and Nerlove (1986) and Engle, Ng, and Rothschild (1990) propose a parametrization of H t that imposes a common dynamic structure to all its elements. The idea is that co-movements of the data are driven by a small number of common underlying factors. 7

8 Following Bawens, Laurent, and Rombouts (2003) and Lin (1992) I ll start by showing how the F-GARCH can be seen as a particular case of a BEKK: H t = KK + H t 1 + Θε t 1 ε t 1Θ (15) If the coefficient matrices and Θ have rank one and are proportional to each other (have the same left and right eigenvectors but different eigenvalues) then: = CΛ δ C 1 = λ δ C 1 C 1 (16) Θ = CΛ θ C 1 = λ θ C 1 C 1 (17) where C 1 and C 1 satisfy the following restrictions: C 1C 1 = 1 (18) M C m1 = 1 (19) m=1 Substituting into the BEKK equation yields the one Factor GARCH: H t = KK + C 1 C 1(λ 2 δc 1 H t 1 C 1 + λ 2 θc 1 ε t 1 ε t 1C 1 ) (20) Notice that the time-varying component has rank 1 but H t is still full rank because of KK. In the F-GARCH terminology, C 1 is referred to as the factor loading and f t C 1 ε t as the factor. The factor f t is obtained by extracting the information in ε t through the vector C 1. Pre-multiplying H t by C 1 and post-multiplying by C 1 yields: h t = σ + λ 2 δh t 1 + λ 2 θft 1 2 (21) where h t = C 1H t C 1 and σ = C 1KK C 1. As it can be seen, h t follows a univariate GARCH(1,1) process. Substituting h t into the expression for H t we can easily get: H t = Ω + C 1 C 1h t (22) where Ω = KK C 1 C 1σ. This equation clearly shows how all the elements of H t share the same dynamic. In particular, they are all linear functions of the GARCH(1,1) process h t. Therefore, if H t is an F-GARCH(1,1,1) then there exists a linear combination of all its elements that follows a univariate GARCH(1,1). For a general F-GARCH(p,q,r) we have: r ( p q ) H t = KK + C l C l λ 2 δic l H t i C l + λ 2 θjc l ε t j ε t jc l (23) l=1 i=1 j=1 8

9 In this case the time varying component of H t has rank r but still the whole matrix is full rank because of KK. An interesting implication of this setup is that, for models with more than one factor, any pair of factors has time-invariant (constant) conditional covariance: E t 1 (f at f bt ) = C a KK C b (24) A particular type of F-GARCH is the O-GARCH (orthogonal GARCH) of Alexander and Chibumba (1997) and Alexander (2001). A critical drawback of this model is that it implies a reduced rank conditional variance-covariance matrix which in turn is a major problem in many applications. 2.6 Conditional Correlations Models Somewhat conceptually different from the BEKK and F-GARCH models are the Conditional Correlations models. Models that do not belong to this class provide dynamics for variances and covariances, whereas the Conditional Correlations models provide dynamics for the variances and the correlations. This, however, is not the only difference between the two philosophies: the C.C. models usually constrain every conditional variance h i,t to be function of its own lags h i,t j and realizations ε 2 i,t j. This feature will keep the number of parameters of these models in equilibrium with those of a BEKK or an F-GARCH: less parameters for the variances are compensated by the extra parameters for the dynamics of the correlations The CCC Bollerslev (1990) introduced a new class of MGARCH models (Constant Conditional Correlations or CCC) for which variances and covariances can be modeled separately. Each of the M variances can be modeled with a univariate GARCH model allowing for different specifications and lags. The whole conditional variancecovariance matrix is the constructed from these univariate processes and a correlation matrix. In the CCC the conditional correlations are constant and thus the conditional covariances are proportional to the product of the corresponding standard deviations: H t = D t RD t (25) where D t is a diagonal matrix whose squared elements follow a univariate GARCH(p,q) process, and R is the constant correlation matrix with R ii = 1, i. In general the univariate GARCHes do not need to have the same number of lags 9

10 and can follow different specifications: GARCH (Bollerslev(1986)), AVGARCH (Taylor(1986)), NARCH (Higgins and Bera(1992)), EGARCH (Nelson (1991)), ZARCH (Zakonian(1994)), GJR-GARCH (Glosten, Jaganathan, and Runkle(1993)), APARCH (Ding, Engle, and Granger(1993)), AGARCH (Engle(1990)), and NAGARCH (Engle and Ng(1993)). The VEC and the BEKK allow every element of H t to be function of each element of H t i and ε t j ε t j. The CCC on the other hand constrains the dynamics of the conditional variances to be functions only of their own lags and lagged realizations. Empirically this does not seem to be a problem. On the contrary, this parsimonious parametrization of the variances leads to less noisy parameters estimates and better forecasts. The only critique that has been moved to this approach is the assumption that the conditional correlations are constant may seem unrealistic in many empirical applications. As pointed out in Cappiello, Engle, and Sheppard (2003):...a number of studies documents that correlation between equity returns increases during bear markets and decreases when stock exchanges rally, indicating that correlation is dynamic and varies over time The DCC E Engle (2001) proposed a generalization of the CCC model by providing dynamics for the conditional correlation matrix. Relaxing the constraint that correlations are constant creates the problem of how to impose positivity on the conditional correlation matrix. The solution of Engle (2001) and Engle and Sheppard (2002) is to model them as a BEKK. The conditional variance-covariance matrix is defined as: H t = D t R t D t (26) where D t is a diagonal matrix and the square of each element follows a univariate GARCH(p,q) process. The correlation matrix Q t obeys the following dynamics: Q t = ( Q Q Θ QΘ ) + Q t 1 + Θη t 1 η t 1Θ (27) where η i,t = ε i,t / h i,t and Q = T 1 T t=1 η t η t. In Engle (2001) the matrices and Θ are respectively equal to δ 2 I and θ 2 I. In Sheppard (2002) and Θ are diagonal matrices. If the coefficient matrices were full, then the equations describing the correlations dynamics of the DCC would have as many parameters as a full BEKK. Therefore it seems natural, in the spirit of 10

11 parameter parsimony, to concentrate on the specification of Sheppard (2002) which, w.r.t. Engle (2001), allows for asset specific news impact parameters. While using a BEKK for the conditional correlations guarantees that Q t is positive definite, it does not constrain it to be a correlation matrix. Therefore, rather than directly using Q t to compute the conditional variance-covariance matrix, the following transformation must first be made: R t = diag(q t ) 1 Q t diag(q t ) 1 (28) R t is a correlation matrix with ones on the main diagonal and every other element less than one in absolute value. Its positive definiteness is guaranteed by the positive definiteness of Q t. The conditional variance-covariance matrix will then be given by: H t = D t R t D t (29) With correlation targeting and a diagonal BEKK(1,1), the DCC E has only 2M parameters to be estimated for the correlation dynamics. Even when richly parametrized GARCHes(1,1) are used to model the variances, such as the EGARCH(1,1) with asymmetries and level effects, the number of parameters is no more than 5. This means that at most a DCC will have as few as 7M parameters to be estimated. The difference between the number of parameters of a DCC an that of a VEC (order M 4 ) or a BEKK (order M 2 ) is striking! The DCC T The model of Tse and Tsui (2002) differs from the DCC E in the way it models the conditional correlations and the way it uses the information contained in the lagged outer products of the residuals. The conditional variance-covariance matrix is defined as: H t = D t Q t D t (30) where D t is a diagonal matrix and the square of each element follows a univariate GARCH(p,q) process. Q t is the correlation matrix which obeys the following dynamics: Q t = (1 θ 1 θ 2 ) Q + θ 1 Q t 1 + θ 2 Ψ t 1 (31) where θ 1 and θ 2 are non-negative scalars satisfying θ 1 + θ 2 < 1, Q is a symmetric M M positive definite correlation matrix, and Ψ t 1 is the lag S sample correlation 11

12 matrix given by: Ψ ij,t 1 = Ss=1 η i,t s η j,t s ( S s=1 ηi,t s)( 2 S s=1 ηj,t s) 2 (32) where η i,t = ε i,t / h ii,t. A necessary condition for the positivity of Ψ t 1 and consequently that of Q t, is that S M. If Q, Q t 1, and Ψ t 1 are correlation matrices then Q t will also be a correlation matrix. This formulation allows for flexible variance modeling by allowing any univariate GARCH structure to govern the dynamics of these components. These models will then provide the conditional variances h ii,t 1 which will then be used to standardize the lagged residuals ε ii,t 1 and from there compute the lagged sample correlation matrix Ψ t 1. A drawback of this setting is that θ 1 and θ 2 must be scalars for Q t to be positive definite and a correlation matrix. Hence, all the conditional correlations will obey the same dynamics. 2.7 Evaluation A natural way to evaluate different MGARCH specifications would be to compare the model s predictions with the realizations. Unfortunately, a direct in-sample measurement presents some difficulties. First and foremost, the realization of the conditional variance-covariance matrix is not observable. The outer product ε t ε t = H 1/2 t η t ηt H 1/2 t, while an unbiased estimator of the quantity of interest, will generally yield very noisy measurements because of the idiosyncratic term η t η t. Andersen and Bollerslev (1998) propose the use of integrated volatility, which consists of cumulative cross-products of intra day residuals over the forecast horizon, as a better measure of the realized conditional variance-covariance matrix. It is also worthwhile noticing, particularly in the Conditional Correlation framework, that ε t ε t will always imply that the realized correlations are equal to one! The integrated volatility approach will also solve this problem, making it possible to meaningfully compare predictions and realizations of models such as the DCC. In a setting where high frequency data is not available, a less precise but still useful proxy of the integrated volatility can be employed as in Ledoit, Santa-Clara, and Wolf (2002). For example, if the model is estimated on daily data it will then be used to produce forecast horizons of 1,2, and 4 weeks. This is done conditioning at time t = τ and respectively making forecasts up to τ + 5, τ + 10, and τ + 20 respectively. Cumulating the results will yield the forecast volatilities for the next periods of 5, 10, 12

13 and 20 days. These forecasts will then be compared to the cumulative cross-products of the residuals. Such comparison will involve two measures of the deviation from the realized standard-deviations and correlations: the mean absolute deviation (MAD) and the mean square error (MSE). To have an idea, not only of how the MGARCH models perform w.r.t. one another, but of how they perform in general they will be compared to a Random Walk. The meaning of the latter, in this setting, is that the cumulative volatilities and correlations of the next 5,10, and 20 days will be predicted by the realizations of the last 5,10, and 20 days respectively. 2.8 Optimization of the likelihood function Maximum likelihood estimation requires the optimization of the likelihood function w.r.t. the parameters of interest. Estimation of MGARCH models is non trivial because of the shape of their likelihood which makes life hard for the optimizer. An MGARCH software survey by Brooks, Burke, and Persand (2002) shed some light on optimization problems. In their paper, they estimate a bivariate diagonal-vec(1,1) on 3580 daily observations on the FTSE 100 stock index and stock index futures contract from 1/1/85 to 4/9/99. To estimate their model they selected 4 packages that contain pre-programmed routines for the MGARCH estimation: Gauss-Fanpac, Rats, Sas, and S-Plus-Finmetrics. Using gradient methods based on numerical derivatives (BFGS and BHHH), each package reached convergence. Unfortunately the parameter estimates show that no convergence was achieved: for identical models and data-set each package produced a different set of parameter estimates. While on one hand this can be due to the fact that convergence tolerance was set very loose, it also indicates that at least all but one of the software producers felt it infeasible to set it tighter. Also the standard errors estimates present significant differences and so do the t- ratios which can reach differences of 9, 900%. However, when the different estimates were used to compute the standard deviations of portfolio returns these turned out to be almost identical. This result would suggest that the DVEC is not identified: different parameter values may yield the same likelihood value or alternativelly that the top of the likelihood is flat. However, because the DVECs performance is the same as that of an OLS-Hedge, unless we accept the hypothesis that hedge ratios are time-invariant, the only reasonable conclusion seems to be that none of the packages did manage to optimize the likelihood function. While the paper of Brooks, Simon, and Persand (2002) provided more than an 13

14 overview on statistical packages: it indirectly provided an overview of gradient based optimization algorithms. Connecting this finding with the quite common opinion among researchers and practitioners that MGARCH models are difficult to estimate, it seemed natural to study the optimization problem related to it. Since a recurring issue in the estimation of GARCH models with gradient methods is that of initial conditions and possible local maxima, I moved toward optimization schemes that do not use derivatives. Among these, the two that seemed most promising were the MCMC and Simulated-Annealing algorithms. While both proved to be powerful and indispensable tools in their own fields, when applied to the estimation of MGARCH models they needed to be re-thought and adjusted. The objective of an MGARCH estimation is to obtain a set of parameters for which the value of the likelihood is maximized, possibly in as little time as possible. The MCMC algorithm allows to sample from the posterior distribution and even though, in probability, it will stay closer to the mode than to any other point there is no clear direction in which it moves. Moreover, because it needs to sample from the whole distribution, sooner or later the MCMC chain will hit any point on the support with probability one. These features are not very appealing when the main objective of the optimizer should be to reach the maximum of the function as quickly as possible. The Simulated-Annealing algorithm, on the other hand, searches the likelihood but will ultimately converge. Like for the MCMC a new state of the system is chosen by randomly displacing one parameter and computing the new value of the likelihood. If the likelihood has increased the system remains in the new state, otherwise the probability of remaining in the new state is given by a function of the Annealing Temperature. The latter is chosen to be high at initialization, to allow the algorithm to compute random searches, and will gradually decrease to zero. The tuning of the Annealing Temperature is crucial: a system too slow to cool will never converge but will keep wondering around the support, a system that cools too quickly will freeze without reaching the maximum. In an MGARCH setting, where the likelihood function needs to be optimized along no less than 50 dimensions, both MCMC and Simulated-Annealing do not seem appropriate because of their tendency to get distracted and start wondering around instead of maximizing the likelihood. To correct this behavior, because inefficient in this setting, both algorithms can be forced to accept the new state of the system only if the associated level of the likelihood has not decreased w.r.t. the previous state. In 14

15 this way the optimizer will be nothing else but a climber: l ( θ (n+1)) l ( θ (n)) accept θ (n+1) (33) l ( θ (n+1)) < l ( θ (n)) reject θ (n+1) (34) The advantage of this method w.r.t. MCMC and Simulated-Annealing is that it is faster at finding the mode since through the iterations the value of the likelihood is non-decreasing. The fact that the search algorithm always moves in the direction of higher values of the likelihood bounds it to converge to the first maximum it encounters. To make sure that the climber converged to a global rather than a local maximum, standard counter-measures can be employed: i.e. random perturbations of the whole vector of parameters. The way in which a new state should be proposed can be found in the MCMC as well as Simulated-Annealing literature. In both, it seems as if the best results are obtained when the proposal is a random walk. Therefore the climber will propose accordingly: θ (n+1) i = θ (n) i + δ i z (n+1) i (35) where δ i is the step-size and z (n+1) is sampled from a standardized, generally symmetric, distribution. The distribution from which the random perturbations are sampled does play a role in the efficiency of the Simulated-Annealing algorithm. Ingber (1989) shows how a Fast-Annealing is obtained by sampling from a Cauchy rather than a Gaussian distribution. In that setting, however, the efficiency of the algorithm is also a measure of the portion of the surface of the likelihood it can explore in the smallest amount of time. Since the climber will not explore, but will try to reach the highest point from the current position, sampling from a Gaussian distribution seems a fair choice. The parameter δ i will be relatively large at the start of the algorithm and go to zero as convergence is achieved. This is necessary to avoid steps that are too big and that will persistently make the climber fall off the peak of the likelihood. 2.9 Empirical results Evaluation of MGARCH(1,1) models A preliminary and incomplete comparison of the main MGARCH(1,1) models in the literature has been conducted for a BEKK(1,1) with Gaussian and t densities, 15

16 the R-GARCH, a diagonal DCC E (1, 1), and the H-GARCH(1,1). For all models the location was modeled as a VAR(1). The data-set consists of three exchange rates time series from DATASTREAM: Japanese Yen/US Dollar, British Pound/US Dollar, and Swiss Franc/US Dollar. The sample size is 4606 daily observations from 1/3/1986 to 8/29/2003. The full-bekk(1,1) and the VAR(1) have been estimated jointly with a Gaussian density and a t distribution where the 3 degrees of freedom have been considered as parameters of the likelihood. The VAR(1) + Gaussian BEKK(1,1) had a total of 36 parameters while the VAR(1) + t-student BEKK(1,1) had 39. Needless to say, the latter model was a bit slower to converge than the first: not only because of the extra parameters that had to be estimated but mostly because of their nature. The VAR(1) + R-GARCH has been estimated using the SNP package of Gallant and Tauchen (2002). Following the authors proposed estimation strategy the first model to be fit to the data was a Gaussian VAR(1), followed by an unrestricted ARCH, and finally an unrestricted GARCH( in SNP language) for a total of 72 parameters. The NPSOL optimizer had problems in maximizing the likelihood function: it often did not move the parameter values. Convergence was achieved through the extent use of random perturbations whenever the optimizer either did not make any move or diverged. The VAR(1) + DCC E (1, 1) was estimated using a Gaussian error density and adopting a specification as close as possible to that of the BEKK(1,1). At this stage it seemed important to compare the different modeling philosophies and therefore the flexibility and potential of the DCC E (1, 1) was not fully exploited. The univariate GARCHes were chosen to be those of Bollerslev (1986) because most similar to the way that variances are modeled by the BEKK. An attempt that was not made at this stage, however, was to estimate the DCC in one step. Therefore, following Engle (2001), the whole model was estimated in 3 steps: first the VAR(1), then the univariate GARCH(1,1) processes, and last the conditional correlation matrix. The latter was modeled as a diagonal BEKK(1,1) with targeted unconditional correlation matrix. The multi-step estimation of the DCC E (1, 1) should not be part of this analysis as it cannot deliver correct standard errors for the parameters. The comparison with the BEKK should be intended as a mere indication of the potential of this model in its most congenial setting: as it will be discussed in the next section, the DCC E (1, 1) turns out not to be easy to estimate in one-step. The log-likelihoods of these models have been maximized using the climber which, 16

17 in this setting, proved to be quite efficient. In fact, convergence was not sensitive to starting values (both random and deterministic), and to random perturbations of the parameters in between the optimizer s iterations. However, what did play an important role in the estimation procedure were the starting values for the conditional variances and correlations. In general, setting H 1 = ε 1 ε 1 turned out to be the a bad choice: the inevitably noisy starting point seemed to induce a bias in the parameters estimates due to the need of a fast mean reversion. While the best choice seemed to be the estimation of the initial conditions, other alternatives produced similar results and did not require extra parameters. The first alternative is to follow the indications of the univariate-garch literature and set the initial conditions equal to the sample unconditional variances and correlations. The second is to invert the variance targeting equation and express the unconditional variances and correlations as functions of other parameters of the model. The latter seemed to produce the second best results: as it did not increase the number of parameters and helped stabilize the estimates of the other parameters of the model. Still, estimating the initial conditions rather than setting them equal to the unconditional mean seems the best choice, whenever feasible. The H-GARCH(1,1) model clearly does not belong here. It is a model for large scale estimation not suitable to be estimated along with a fully specified density. In such case it would just be a VEC(1,1) parametrization that guarantees positivity of the conditional variance-covariance matrix with all the drawbacks of the VEC modeling. It has been included in this comparison simply to get a feeling of its performances w.r.t. other MGARCHes. An alternative specification of the H-GARCH(1,1), based on smoothed realized variance-covariance matrices, was also estimated. The preliminary results of Table 1 clearly indicate that the Gaussian DCC E (1, 1) outperforms all the other models and tracks variances and covariances better than the Random Walk. The g-bekk(1,1) performs uniformly better than the t-bekk(1,1) w.r.t. the MSE and still better than the R.W. The t-bekk(1,1), on the other hand, has smaller MAD 5-days and MAD 10-days than the g-bekk(1,1). However, its MAD 20-days is not only higher than the one of the g-bekk(1,1) but also of the R.W. The feeling is that the t-bekk(1,1) is overlooking some information in the data because it is interpreting it as draws from the tails: the estimated parameters determine very smooth dynamics of the conditional variance-covariance matrix that are not able to explain most of the movements in the data. 17

18 The H-GARCH performs very poorly as it is the MGARCH with the largest MSE and MAD, and performs slightly better than the R.W only for the 5 and 10 days measures. The smoothed H-GARCH however performs better the two BEKKs w.r.t. the MAD and better than the t-bekk(1,1) w.r.t. the MSE. The big improvement of the smoothed H-GARCH is due to the fact that smoothing the residuals, before modeling them for conditional second moments, greatly reduces the noise and hence improves the point estimates and the forecasts. The DCC E performs better than all the other MGARCHes and R.W., especially for 20 days forecasts. This result seems to suggest that there is no loss of information in modeling the conditional variances as functions of their own lags and lagged realizations only. Furthermore, eliminating cross-dependencies among conditional variances reduces the tendency of richly parametrized models to over-fit the data and therefore the noise. The 20 days averages of the realized and fitted standard deviations for the exchange rates of Japanese Yen/US Dollar, British Pound/US Dollar, and Swiss Franc/US Dollar are plotted in Figure 1, Figure 2, and Figure 3 respectively. The Gaussian- BEKK, even though very smooth w.r.t. the Data, seems to capture the big movements in volatility and deliver values up to 1.5 and 2.0. On the other hand, it strikes its inability to capture movements below 0.5 in Figure 1, 0.6 in Figure 2, and 0.7 in Figure 3. The t-bekk looks pretty much as a smoothed version of its Gaussian counterpart. It is interesting to notice how the conditional standard deviations for the two BEKKs are centered around different values with the Gaussian-BEKK always above the t- BEKK. This should be due to the fact that big movements in the volatility are filtered by the t-distribution and therefore seen as much smaller by its BEKK which will consequently lower the estimates of the intercepts. The Gaussian R-GARCH seems to lay in between the two BEKKs in terms of smoothness. Even though it does not seem to capture the high levels of volatility as the Gaussian-BEKK, at least for the British Pound/US Dollar series it seems to be doing a much better job: especially between the years 1992 and Often, the graph of the smoothed H-GARCH standard deviations looks like a photocopy of the data. Unfortunately this is not because it is actually tracking the volatilities but rather because it is behaving as a Random Walk. The results in Table 1 provide a more realistic measure of the performance of this model, which could be 18

19 easily overestimated by a simple look at its plots. The DCC E provides very smooth forecasts that seem to be tracking the volatilities quite accurately. This model appears to have all the positive features of the other models without the negative ones: it is smooth but not flat as the t-bekk, it captures the movements in the standard deviations but not the noise and the spikes as the Gaussian-BEKK, sh-garch, and R-GARCH. In Figure 4, Figure 5, and Figure 6 it is possible to appreciate the behavior of the models w.r.t. the correlations. It is quite an unfair comparison for the R-GARCH given the fact that this model does not observe the signs of the realizations. The plot in Figure 5 clearly shows the difficulty of this model in tracking the correlations. The BEKK models do not show the same differences in terms of smoothing and centering as they did on the standard deviations. Because correlations are ratios between covariances and standard deviations, scale effects cancel out delivering almost identical results to the eye. The H-GARCH, at least visually, is performing poorly. The cause of such wandering around is that this model absorbs the noise as if it was information. Hence, the components of the conditional variance-covariance matrix will be very noisy and so will the corresponding correlations. The DCC E, exploiting the advantage of being the only MGARCH to explicitly model the correlations, provides the best fit to the data. Again, it appears to have filtered out most of the noise while the other GARCHes have not Estimation of a richly parametrized DCC E (1, 1) The data employed for this study of the second moment behavior of US bond returns with different maturities consists of the H.15 Selected Interest Rates from the Board of Governors of the Federal Reserve System. In particular, the following 3 returns have been modeled: 3-month treasury bill (secondary market rate), 1-year treasury constant maturity rate, and 10-year treasury constant maturity rate. The sample size is 10,912 daily observations from 2/1/1962 to 11/28/2003. All these series showed very similar paths, as depicted in Figure 7, and therefore a preliminary investigation was conducted to search for the best VARMA parametrization that would yield uncorrelated residuals. The results, that could not be subject to hypothesis testing because the conditional variance-covariance matrix had not been modeled yet, indicated that a suitable model for the location could be a VARMA(2,1). There 19

20 is no claim that this is the model for the location of the process. It is a choice keen at avoiding auto-correlation and cross-correlation in the residuals to be modeled as correlations in the second moments. The drawback of this safe choice is that the location process itself will require 30 parameters to be estimated. The DCC E allows the conditional variances to be modeled as univariate GARCH processes. In a study by Cappiello, Engle, and Sheppard (2003) for a constructed 5 year average maturity US bond the best model to fit the data turned out to be a GJR-GARCH. The latter is a variation of Bollerslev s (1986) GARCH that allows for threshold effects. In this study I wanted to estimate an asymmetric GARCH with level effects and therefore, given the results of the work of Engle et al. (2003), the natural choice seemed to be an EGARCH specification: ln(h i,t ) = ω + α ln(h i,t 1 ) + β ε i,t 1 hi,t 1 + γ ε i,t 1 hi,t 1 + δy i,t 1 (36) which ensures the positivity of h i,t. Notice that the GJR-GARCH, on the other hand, needs to be constrained in its parameters to ensure the positivity of the conditional variances and that the introduction of the level effect would have caused further problems in this direction. Also the GJR-GARCH presents the computational inconvenience of having to deal with the indicator function I[ε t 1 < 0] in order to model the asymmetry in the volatility. As in Engle et al. (2003), I modeled the correlations of the standardized residuals η i,t = ε i,t / h i,t with a BEKK(1,1): Q t = (KK KK ΘKK Θ) + Q t 1 + Θη t 1 η t 1Θ (37) where K is a lower triangular matrix and and Θ are diagonal. In their paper the matrix KK was obtained by targeting the correlations. In this work however, because the estimation will be carried out in one step in order to obtain ML estimates and correct standard errors, correlation targeting cannot be applied as the so obtained matrix KK does not necessarily coincide with the ML estimate. The Cholesky decomposition KK is necessary for the BEKK specification to guarantee the positivity of Q t. Clearly there are no reasons why the latter, even though positive, should be a correlation matrix. Thus, instead of using Q t directly, the DCC literature suggests the use of R t where: R ij,t = Q ij,t Qii,t Q jj,t (38) 20

21 The fact that I am not targeting the correlations, the latter equation might cause an identification problem to the BEKK. Consider the following model: Q t = Ω + Q t 1 + Θη t 1 η t 1Θ (39) and multiply both sides by λ > 0: λq t = λω + (λq t 1 ) + (λ 1/2 Θ)η t 1 η t 1(λ 1/2 Θ) (40) Defining Q t λq t, Ω λω, Θ λ 1/2 Θ and substituting: Q t = Ω + Q t 1 + Θη t 1 η t 1 Θ (41) In spite of their differences, the matrices Q t and Q t will yield the same correlation matrix R t. Hence, the parameters matrices Ω and Θ are not identified. In order to correct this problem it is sufficient to pin down one element of the matrix Ω or, in this specific case, of the matrix K. In this study I normalized the first element on the main diagonal of the Cholesky decomposition: K 11 = 1. The initial values of the univariate EGARCHes were set equal to the unconditional variances of the VARMA(2,1) residuals. Initial values of the BEKK on the correlations, instead, were set equal to the unconditional correlations of the de-garched VARMA(2,1) residuals. This solution seemed to be the best in the sense that it helped contain the total number of parameters of the model. For instance, estimating the initial conditions of the EGARCHes and the BEKK would have required 9 extra parameters. The maximization of the Gaussian likelihood function w.r.t. the 56 parameters of the model was carried out using the climber algorithm and starting values of for every parameter of the model. While convergence to reasonable values was rather quick, it soon became clear that achieving convergence to the maximum would not have been that fast. A good step size for the climber, in this setting, turned out to be one in which its magnitude is proportional to the parameter s value: θ (n+1) i = θ (n) i + θ (n) i δz (n+1) i (42) A drawback of this formulation is that it does not allow the parameters to change sign during the optimization: as θ i approaches zero so does the step size making the algorithm freeze in a non optimal point. To avoid this side-effect the algorithm was monitored and the step size was let to be independent of the magnitude of the parameter every time that the latter was less than in absolute value. 21

22 Because of the slow pace at which the parameters moved at every iteration of the climber algorithm and because the numerical first derivatives of the log-likelihood had a magnitude not less than 0.01 the optimization algorithm was temporally switched to steepest-ascent. While iterating the latter did not improve the convergence it helped to shed light on some aspects of the objective function. First of all, the steepestascent algorithm with tuned relaxation parameter was much slower at increasing the log-likelihood than the climber algorithm with tuned step size. Looking at the single derivatives showed how the parameters were not moved consistently in one direction: for example one parameter might be increasing from one iteration to the other to then start decreasing and so on. An interesting thing to notice is that, in spite of the magnitude of the first order conditions and the fact that the log-likelihood would increase at every iteration, the parameters values did not change dramatically overall but merely by This seems to suggest that the multivariate VARMA+MGARCH model is behaving, w.r.t. the optimization process, as a polynomial where very small changes in one parameter may produce huge movements in the value of the function. The difficulty in maximizing the log-likelihood of this richly parametrized, but only 3-dimensional, model seems to arise from the many feed-backs among the different components. Let alone that a VARMA(2,1) is not an easy model to estimate by itself and that GARCH and MGARCH models are often difficult to fit, estimating them at the same time complicates things even further. A change in one VARMA parameter is going to produce, through the residuals, effects on all the MGARCH parameters which in turn will affect all those of the VARMA. While this problem is common to every VARMA+MGARCH model, the DCC specification further increases the feed-backs among parameters. Here a change in one VARMA parameter affects the residuals and therefore the univariate GARCHes. Since the BEKK on the correlations is based on the de-garched residuals it will be affected by both changes in the VARMA and univariate GARCHes. The latter and the BEKK will then affect all the VARMA parameters. Such perverse behavior is clearly eliminated at once if the estimation is carried out in multiple steps: eliminating all the feed-backs by conditioning each estimation on the results of the precedings renders each step very stable. Furthermore the number of parameters to be estimated at each step is only a fraction of that of the whole model, hence making the optimization at each step a much easier task. The optimization of the DCC E was stopped when the absolute value of every 22

23 element of the score dropped below and the corresponding parameters estimates used to produce the plots in Figure 8 and Figure 9. The former shows the 20 days averages of the standard deviations of the VARMA(2,1) residuals and the averages of the DCC E fit. All three series show very similar movements in the volatility except for the point in time in which the spikes occur. Nevertheless, the MGARCH model seems to track them well (there is no MAD or MSE to confirm these impressions at this time) and is able to reproduce most of the spikes with approximately the correct magnitude. The correlations observed in the data, in Figure 9, are very noisy (an average over a longer time span would probably give a better picture of the underlying movements) and the performance of the DCC E can only be guessed by how it tracks the local trends. For example, the correlation plot of the 1-Year and 10-Year bonds shows a consistent drop in the late 1960 early Such movement would be unjustified according to the data, however it is there. In fact, at that time the 10-Year Treasury Constant Maturity Rate did exhibit a consistent jump in volatility while the same thing did not happen to the 1-Year Treasury Constant Maturity Rate with the consequence that the correlation must have dropped in the data as well. The latter is hard to detect because the data plot exhibits erratic movements of the same magnitude. Overall, even with not fully converged parameters estimates, the diagonal DCC E with variances modeled by EGARCHes(1,1) with asymmetries and level effect appears to capture the movements in the standard deviations and correlations. This suggests that the point estimates obtained so far are somehow good. Nevertheless, the causes that led to non-full convergence should be investigated and possibly removed. While this might not be a relevant problem if the main interest of the econometrician is to use the model to make forecasts, it will certainly prevent further analysis: i.e. computation of the standard errors and hypothesis testing. 23

24 3 Large Scale GARCH Models In order to feasibly estimate large scale MGARCH models the number of parameters must be contained. A multi-step approach allows to break down the problem into smaller components that can be handled: first the VARMA estimation, second the estimation of the conditional variances, and third the estimation of the conditional correlations is the philosophy behind the DCC. However, when M is big the estimation of the conditional correlations might become infeasible when they are modeled with a diagonal BEKK. I will suggest a possible alternative based on a scalar BEKK that will possibly allow the conditional correlations not to be too constrained and at the same time provide better fit and forecasts. An alternative to separating the estimation of the conditional variances from that of the conditional correlations is to separate the estimation of the GARCH component from that of the ARCH component. Following this observation, I will propose the H-GARCH which is a VEC-like model on the square root of the conditional variance covariance matrix. Associated with it I am also proposing an estimation technique for large scale VARMA models that allows the separate estimation of the AR and MA part using nothing more than iterated OLS. 3.1 A large scale conditional correlation model The DCC models of Engle(2001) and Tse and Tsui(2002) allow for complicated dynamics of the volatilities, are parsimonious, and can be estimated in two steps. These characteristics, together with the fact that they perform better than other MGARCH models at forecasting, have made them the benchmark for every researcher and practitioner. An interesting aspect of the DCC T is the way that sample correlations enter the conditional correlation equation. Instead of simply using the outer product of the de-garched residuals, as in the DCC E, Tse and Tsui (2002) use an arithmetic average over a time span of at least M observations. In the DCC T this way of proceeding is necessary in order to have a next-period conditional correlation matrix which actually is a correlation matrix: i.e. elements on the main diagonal equal to one and off diagonal elements are smaller than one in modulus. In the DCC E this is not a concern since the next-period conditional correlation matrix is not the one arising directly from the model but rather its standardized version. Between the two 24

25 approaches the one of Tse and Tsui certainly presents an advantage: the sample correlations that enter their model are less noisy than those entering the DCC E. This reduction of noise due to the averaging of the idiosyncratic component should increase the signal. Moreover, the estimation itself should be easier since averaging the outer-products will smooth the regressors and thus eliminate spikes. A likelihood which is function of smoother regressors should present a smoother surface and hence be easier to optimize. The model for the conditional correlations will then be: Q t = ( Q Q Θ QΘ ) + Q t 1 + ΘΨ t 1 Θ (43) where and Θ are diagonal coefficient matrices, Q is the unconditional correlation matrix, and Ψ t 1 = 1 L ε t l ε t l (44) L l=1 Q t is positive definite because of the BEKK structure of the model. Taking the expression for a generic element of Q t is possible to check that it is also a correlation matrix: Q ii,t = (1 δi 2 θi 2 ) Q ii + δi 2 Q ii,t 1 + θi 2 Ψ ii,t 1 (45) Q and Ψ t 1 are correlation matrices and if the same is true for Q t 1 then the previous equation reduces to: Q ii,t = (1 δi 2 θi 2 ) + δi 2 + θi 2 (46) Hence every element on the main diagonal of Q t is equal to one. The next step is to verify that all the off diagonal elements are less than one in modulus: Q ij,t = (1 δ i δ j θ i θ j ) Q ij + δ i δ j Q ij,t 1 + θ i θ j Ψ ij,t 1 (47) From the stationarity conditions of the BEKK we have: δ i δ j + θ i θ j < 1 i, j (48) δi 2 + θi 2 < 1 = δ i < 1; θ i < 1 i (49) and assuming that all the coefficients of the model are positive: δ i 0; θ i 0 i (50) yields that δ i [0, 1); θ i [0, 1) i. Hence, the conditional correlations are a convex combination of Q ij, Q ij,t 1, and Ψ ij,t 1 and therefore: 1 min{ Q ij, Q ij,t 1, Ψ ij,t 1 } Q ij,t max{ Q ij, Qij, t 1, Ψ ij,t 1 } 1 (51) 25

26 Proven that Q ij,t is indeed a correlation matrix, under the positivity assumption, the question that arises is whether this assumption is plausible or not. The diagonal BEKK on the conditional correlations implies, as we have seen, that each element of Q t is a function of its own lag and smoothed realization. Therefore if we believe that a high correlation today does not imply a low correlation tomorrow and that a high realized correlation today does not imply a low correlation tomorrow we will not have problems accepting the positivity assumption. Also, in most of the applications the sign of the parameters of the BEKK as those of a univariate GARCH turn out to be positive. When implementing the above model it might be convenient to re-parametrize it in terms of δ 1/2 i and θ 1/2 i in order to impose the positivity of δ i and θ i. Even if the positivity assumption of the parameters seems plausible it is always best to make sure that it is supported by the data, hence if some parameters should be very close to zero it might be the case to remove such assumption, let the matrix Q t not be a correlation matrix, and proceed as in Cappiello, Engle, and Sheppard (2003). From a theoretical point of view the flexibility of the model I propose stands in between the DCC T, in which the parameter matrices are replaced with positive scalars, and the DCC E, in which the parameters are not constrained to be positive. The advantages of having Q t be a correlation matrix without further transformations are several. First of all it eliminates the unpleasant feature of the DCC E that makes Q t depend on non-correlation matrices such as Q t 1 and η t 1 η t 1. Second, by dropping the normalization R ij,t = Q ij,t Qii,t Q jj,t it eliminates the unwanted, in the sense that they are not explicitly modeled, interactions among the elements of Q t. Third, it makes the interpretation of the parameters of the BEKK straightforward. For large scale MGARCH models, with M of the order of 100, even the diagonal DCC will turn out to be infeasible to estimate because of its 200 parameters (with correlation targeting). In this case the only feasible solution seems to revert to Engle s (2001) specification with scalar parameters matrices. The main advantage would be that the model s parameters would reduce to 2. The disadvantage is, obviously, that all the conditional correlations are constrained to have the same dynamics. latter, while theoretically unappealing, might have some empirical justification from the fact that usually GARCH and correlation models have coefficients close to 0.9 for the lagged conditional variance and 0.1 for the lagged realizations. Using smoothed lagged realizations might help mitigate this inconvenience by shifting weight from Q t 1 to Ψ t 1. This way the dynamics of Q t will be more dependent on Ψ t 1 and The 26

27 therefore more reactive to the realizations. A scalar DCC with smoothed realizations Ψ t 1, while constrained in the dynamics w.r.t. a diagonal DCC, should be expected to give more weight to the realizations and therefore let the latter govern its dynamics. Hence, even though every conditional correlation will share the same parameters, their paths will still be allowed to differ as they will now heavily depend on Ψ t 1. How to choose the lags L over which the realizations should be smoothed in order to obtain Ψ t 1 is also part of the research. The optimal Ψ t 1 should be computed allowing each element to be smoothed over a different time interval and possibly using different exponential smoothing parameters. However, in practice this would require too many parameters to be estimated: M for the lags over which to smooth and M for the smoothing parameter. Hence, the correlation model will end up having as many parameters as the standard diagonal DCC. To avoid this, either the smoothing parameters β i or the lags L i of the correlations will have to be constrained to be equal. For M of the order of 100 both parameters will need to be constrained: for β < 1 or: for β = 1. Ψ t 1 = 1 β 1 β L L β l 1 η t l η t l (52) l=1 Ψ t 1 = 1 L η t l η t l (53) L l=1 From a computational point of view, the estimation of β will not present major problems as it is a continuous variable. The same is not true for the discrete variable number of lags L. In light of this it will be convenient to define the following mean over a non-discrete number of periods: Ψ t 1 (L + δ) = L L + δ Ψ t 1(L) + δ L + δ η t L 1η t L 1 (54) It is straightforward to check that Ψ t 1 (L + δ δ = 0) = Ψ t 1 (L) and Ψ t 1 (L + δ δ = 1) = Ψ t 1 (L + 1) while for δ (0, 1) we have a linear interpolation between these two values. This version of the DCC will have 1 parameter for the lagged conditional correlations, 1 parameter for the lagged smoothed realizations, 1 smoothing parameter, and 1 parameter that determines the number of lags over which to smooth for a total of 4 parameters, regardless of the dimensionality of M. Another option that deserves to be investigated is the introduction of realizations smoothed over different fixed lags, i.e. Ψ t 1 (20), Ψ t 1 (125), Ψ t 1 (250). 27

28 3.2 The H-GARCH The unconstrained VEC does not ensure positivity of the conditional variance covariance matrix H t. Two main alternatives allow to solve this problem: constrain the coefficients in order to achieve positive definiteness, as in the BEKK, or model H 1/2 t rather than H t, as in the R-GARCH. Following the second alternative the VEC(p,q) will take the following expression: p q h t = K + i h t 1 + Θ j u t j (55) i=1 j=1 where: h t = vech(h 1/2 t ) (56) u t = vech((ε t ε t) 1/2 ) (57) It is straightforward to check that the conditional variance-covariance matrix is positive definite. Assume that the symmetric matrix ε t ε t is equal to C t Λ t C t, where C t is the matrix of eigenvectors and Λ t is the diagonal matrix of eigenvalues. Then its square root (ε t ε t) 1/2 must be equal to C t Λ 1/2 t C t. But ε t ε t is a rank one matrix with 1 positive and M 1 zero eigenvalues: λ 1,t > 0, λ i 1,t = 0. Hence, the first element of Λ 1/2 t will be equal to λ 1/2 1,t while all the others will be equal to zero. Alternatively: Λ 1/2 t = Λ t λ 1/2 1,t (58) The fact that ε t ε t is a rank one matrix implies that its trace will be equal to the only positive eigenvalue: Therefore: from which it is straightforward that: λ 1,t = trace(ε t ε t) = ε tε t (59) Λ 1/2 t = Λ t (ε tε t ) 1/2 (60) (ε t ε t) 1/2 = ε tε t (ε tε t ) 1/2 (61) and immediate to check: ( (ε t ε t) 1/2 ) 2 = ε t (ε tε t )ε t ε tε t = ε t ε t (62) 28

29 This way it is possible to easily compute the square root of the outer-product of the realizations for use in the H-GARCH specification. The latter will then be an alternative specification to the R-GARCH, with the possible advantage that because the signs of the realizations are preserved it will provide better trackings of the covariances and correlations. An interesting feature of this model is that it does not necessarily need to be estimated through maximum likelihood. In fact, as for the VEC(p,q), it can be rewritten and estimated as a VARMA conditional on the location parameters of the model. When the ML-estimation of an unrestricted VEC(p,q) becomes infeasible because of the dimensionality of the model, it is always possible to switch to a H-GARCH(p,q), rewrite it as a VARMA(max(p,q),q) and easily obtain consistent but inefficient estimates. Defining u t = h t + η t, where η t is a zero mean error term, the H-GARCH(p,q) will take the following VARMA form: u t = K + max(p,q) i=1 p A i u t i + B j η t j + η t (63) j=1 where: A i = I[i p] i + I[i q] Θ i (64) B i = i (65) VARMA for Large Scale Models Maximum likelihood estimation of large scale models is generally infeasible and large VARMA models are no exception. The simultaneous equations literature, for example, has proposed alternative estimators to the FIML: in general these estimators are consistent but less efficient (the iterated-five is also efficient). VARMA models can be estimated using techniques developed for simultaneous equations with minor adjustments. Given the general VARMA(p,q): p q y t = K + A i y t i + B j ε t j + ε t (66) i=1 j=1 the first thing to achieve is the separation between the AR(p) and the MA(q). Given a consistent estimate of the AR(p) coefficients it is possible to carry out a consistent estimation of the MA(q) coefficients and vice-versa. The separation of the autoregressive and moving average components of the model still leaves the problem of how to handle their estimation. The preferred alternative would be to estimate each equation 29

30 separately or, in other words, each row of the AR(p) and MA(q). Estimating eq. by eq. the coefficients of the AR(p) using OLS will yield inconsistent estimates as the lagged dependent variables that have been considered as exogenous regressors are in fact correlated with the error term. However, carrying out such estimation would yield: y t = K p + Ã i y t i + η t (67) i=1 where the matrices K and Ãi are constructed from the coefficients of eq. by eq. OLS regressions. While none of these matrices converge to the corresponding K or A i, the model as a whole is still good at fitting the observed values: ỹ t = K p + Ã i y t i (68) i=1 from which we can expect that E(y t I t 1 ) = ỹ t. Therefore, while the parameters estimates are inconsistent the forecast values are unbiased and can be used as instrumental variables. Because the current dependent variable y t is correlated with lagged error terms up to ε t q, the associated IV-regressors should not be based on any information past I t q 1. The first step will be to compute: ỹ t q = K p + Ã i y t q i (69) i=1 and by recursion: ỹ t 1 = K p + Ã i ỹ t 1 i (70) i=1 At this point the whole set of IV-regressors {ỹ t 1, ỹ t 2,..., ỹ t p } will have been computed and ready to be used for a consistent estimation of the AR(p) parameters. Once more, estimating the coefficients eq. by eq. using OLS and the IV-regressors will yield: y t = K + p Â i ỹ t 1 i + ˆη t (71) i=1 Due to the IV approach: plim K K and plimâi A i. Hence, the AR(p) component of the VARMA(p,q) can be consistently estimated through an IV approach that only accounts for the number of lags q of the MA component. Once the parameters of the AR(p) have been estimated it will be straightforward to re-write the VARMA(p,q) the following way: q ẑ t = B j ε t j + ε t (72) j=1 30

31 where: ẑ t = y t K p + Â i y t i (73) i=1 Notice that z t is a function of y t and its lagged values and not of ỹ t i or ŷ t i. The following problem is then how to estimate a vector MA(q) model in a feasible manner. Again, because of the dimensionality of the model ML will not be feasible. However, an eq. by eq. estimation procedure can still be applied by iterating OLS estimations. (n) Let B j be the estimate of B j at the n-th step. Given {ẑ t } T (n) t=1+p and { B j } q j=1, conditioning on ˆε (n) 1 = ˆε (n) 2 =... = ˆε (n) q = 0 it will be possible to compute all future values of ˆε (n) t : ˆε (n) t = ẑ t q j=1 B (n) j ˆε t j (74) the same way that the residuals are extracted in the maximum likelihood approach. The (n+1)-step estimates of B (n+1) j on the following system l from 1 to M: will then be computed through OLS eq. by eq. q M ẑ l,t = j=1 m=1 B (n+1) l,m,j ˆε(n) m,t j + e (n+1) t (75) (n+1) (n) At convergence, B j B j ɛ j, the matrices B j will be consistent, even though inefficient, estimates of B j. This VARMA(p,q) estimator for large scale models pays the price of inefficiency to achieve feasibility of the estimates whenever ML or one-step approaches become computationally infeasible. This is the framework in which it should be employed and evaluated, where other methods fail. In spite of the inefficiency of the point estimates and the fact that standard errors might not be consistent, forecast values based on these estimates should still turn out to be quite accurate. Even though the consistency of the large scale VARMA(p,q) estimator has only been tested numerically through Monte Carlo simulations, a rigorous proof should not be very hard to derive. In fact, the consistency of the AR(p) part is straightforward from the IV approach while for the MA(q) part it should be possible to prove convergence to MLE along the lines of the proof by Hausman (1975) that the iterated FIVE converges to maximum likelihood. 31

32 4 Future Research Proposal The natural development of this preliminary version of a paper is to complete, extend, and further investigate the findings on the MGARCH models. The results obtained so far suggest clear directions in which the research efforts should be concentrated: 1) The analysis of the various MGARCH specifications for fully specified densities must be completed w.r.t. the exchange rates data set. Further, it needs to be extended to a wider range of data sets including the NYSE, NASDAQ, S&P, the FTSE, International Stock Market Indexes, Derivatives, more US-Bonds with different maturities, Foreign Bonds,... This is necessary to gather enough empirical evidence to be able to rank the MGARCH models. Also the measure of how well these specifications perform needs more work: the one used in this paper was based on comparisons among point estimates. However, it will be part of the future research the derivation of their standard errors and the test of the significance of their differences. Even though out of sample forecasts tend to be biased toward under parametrized models, they cannot be overlooked: an out of sample comparison among the forecast variance covariance matrices of the various MGARCHes will be included as an alternative measure. The climber proved to be easy to program and capable of delivering good parameters estimates. However, it is not clear how it performs with respect gradient methods. It will be part of the future work the estimation of a few MGARCH models on more than one data set with both techniques. This should provide enough information on whether to seek fine tuning of the climber or discard it because inefficient or unreliable at finding the global maximum. The experience derived from the estimation of the VARMA(2,1) + asymmetric EGARCH with level effect + diagonal DCC E will be useful in finding the solution to the problems that afflicted the estimation. In this context I have proposed the use of smoothed lagged realizations of the variance covariance matrices. The expected results are a better fit of the MGARCHes due to the conditioning on less noisy information and easier and faster estimations because of the smoother shape of the likelihood function. The multivariate GARCH models of the previous analysis will be re-estimated using smoothed realizations (of different flavors) and the results analyzed to determine whether there are significant improvements in the fit, in the forecasts, 32

33 and in the estimation. 2) In the literature large scale MGARCH models have been estimated with dimensions up to M = 40 approximately. The next step will be to estimate models with M = 50 and in the near future attempt M = 100. The difference between this setting and that where densities are fully specified is that here I will follow a multi-step estimation. The starting point will be the DCC E that allows the separate estimation of the conditional variances and the correlations. The curse of dimensionality will hit less harder as it will be confined to the model governing the dynamics of the correlations. A diagonal BEKK on the correlations will not be sustainable for M = 50 but a scalar parametrization of the coefficients matrices will. I will try to make the constrained dynamics more flexible by using smoothed realizations: the weights should then shift from the lagged conditional correlation matrix to the Ψ t 1 term, allowing each correlation to be more free to follow different paths. In this framework the econometricians main interest will be to be able to handle large conditional variance covariance matrices and be able to make accurate forecasts. Nevertheless, having correct standard errors for a multi-step estimation procedure would allow them to make inference on the parameters values. Even though this paper did not deal with this problem, in the future it will be interesting to study how to compute consistent standard errors for multi-step estimators. The latter, will yield consistent estimates even though they will not set the score of the whole model to zero in the finite sample. Exploiting this information might help derive the correct asymptotic distribution. The H-GARCH that I have proposed, along with the multi-step estimation technique for large scale VARMA models, and the use of smoothed residuals proved to perform better than the Gaussian-BEKK w.r.t. the MAD and better than the t- BEKK w.r.t. the MSE. Given the performance of the DCC and the suggestion that models without interactions among the elements of the variance covariance matrix perform better than those with, it will be interesting to estimate an H-GARCH on such premises. Even though it is not as elegant as the other MGARCH models, the latter allows for extremely fast estimations. The encouraging results of Table 1 indicate that it is worth studying and developing. 3) The third and last main topic that will be investigated is the use of MGARCH models as auxiliary models for the indirect estimation of stochastic volatility models. 33

34 In this respect the desired characteristics of a MGARCH specification are multiple: it should be much easier to estimate than the maintained model (otherwise it might be convenient to directly estimate the latter using MCMC or other approaches), and it should provide an exhaustive description of the data as it will be the eyes of the maintained model. Nevertheless, the auxiliary model should not over fit the data as all the noise will then be transfered to the maintained model. 5 Conclusions The preliminary results of this paper showed that the DCC E is the benchmark model for MGARCHes. This finding suggests that including dependencies across the components of the conditional variance covariance matrix does not improve neither the fit to the data nor the forecasts. On the other hand, the estimation of the DCC E in one step on the bond data pointed out its possible limits. Because of the feedbacks among the location function, the univariate GARCHes, the BEKK on the correlations, and their normalization it proved not to be an exceptionally stable model when estimated in one step. Multi-step estimation, instead, presents no major problems as each step is conditional on the results of the previous thus eliminating all the feedbacks at once. The H-GARCH proved to be a competitive alternative but most of all it is the estimation technique associated with it that delivered satisfactory results (when compared to the other MGARCH specifications) in no more than 7 seconds! Furthermore, it provided a tangible signal of the fact that using smoothed realizations might improve the models fit and forecasts as well as smoothing the surface of the likelihood with all the consequences. The paper also contains a detailed outline of the estimation methodology for large scale VARMA models even though there is no formal proof of the asymptotic properties of the estimator. While no estimation attempt has yet been made, a thorough discussion about which MGARCH model is more suitable for large scale estimation is provided. The necessary changes to the DCC E are listed and motivated: parsimony is the key element. Multi-step estimation helps reduce the number of parameters that need to be estimated at once and removes perverse feedbacks, but in the end the main problem is still associated with the parametrization of a positive definite matrix. 34

35 6 Appendix: Univariate GARCH(1,1) models GARCH (on variances; no asymmetries; power: 1) h t = ω + αh t 1 + βε 2 t 1 AVGARCH (on std.dev.; no asymmetries; power: 1) h 1/2 t = ω + αh 1/2 t 1 + β ε t 1 NARCH (on std.dev.; no asymmetries; power: λ) h 1/2 t = [ω + αh λ/2 t 1 + β ε t 1 λ ] 1/λ EGARCH (on std.dev.; yes asymmetries; power: ln) ln(h t ) = ω + α ln(h t 1 ) + β ε t 1 + γ ε t 1 ht 1 ht 1 GJR-GARCH (on variances; yes asymmetries; power: 1) h t = ω + αh t 1 + βε 2 t 1 + γi[ε t 1 < 0] ε 2 t 1 ZARCH (on std.dev.; yes asymmetries; power: 1) h 1/2 t = ω + αh 1/2 t 1 + β ε t 1 + γi[ε t 1 < 0] ε t 1 APARCH (on std.dev.; yes asymmetries; power: λ) h 1/2 t = [ω + αh λ/2 t 1 + β ε t 1 λ + γi[ε t 1 < 0] ε t 1 λ ] 1/λ AGARCH (on variances; yes asymmetries; power: 1) h t = ω + αh t 1 + β(ε t 1 + γ) 2 NAGARCH (on variances; yes asymmetries; power: 1) h t = ω + αh t 1 + β(ε t 1 + γh 1/2 t 1) 2 35

36 7 References Alexander, C. (2001), A Primer on the Othogonal GARCH Model., ISMA Cente, Mimeo. Alexander, C. and A. Chibumba (1997), Multivariate Orthogonal Factor GARCH., University of Sussex, Mimeo. Andersen, T.G. and T. Bollerslev (1998), Answering the Skeptics: Yes, Standard Volatility Models do Provide Accurate Forecasts. International Economic Review 39, Bauwens, L., S. Laurent and J.V.K. Rombouts (2003), Multivariate GARCH Models: A survey., CORE Discussion Paper. Bollerslev, T. (1990), Modeling the Coherence in Short-run Nominal Exchange Rates: A Multivariate Generalized ARCH model. Review of Economics and Statistics 72, Bollerslev, T. (1987), A conditional Heteroskedastic Time Series Model for Speculative Prices and Rates of Return. Review of Economics and Statistics 69, Bollerslev, T. (1986), Generalized Autoregressive Conditional Heteroskedasticity. Journal of Econometrics 31, Bollerslev, T., R. Engle, and J. Wooldridge (1988), A Capital Asset Pricing Model with Time Varying Covariances Journal of Political Economy 96, Brooks, C., S.P. Burke, and G. Persand (2003), Multivariate GARCH Models: Software Choice and Estimation Issues. Journal of Applied Econometrics 18, Cappiello, L., R. Engle, and K. Sheppard (2003), Asymmetric Dynamics in the Correlations of Global Equity and Bond Returns., Working Paper 204, Working Paper Series of the European Central Bank. Cumby, R., S. Figlewski, and J. Hasbrouck (1993), Forecasting Volatility and Correlations with EGARCH Models. Journal of Derivatives, Winter,

37 Diebold, F.X. and M. Nerlove (1989), The Dynamics of Exchange Rate Volatility: A Multivariate Latent Factor Arch Model. Journal of Applied Econometrics 1, Ding, Z., R. Engle, and C.W.J. Granger (1993), A Long Memory Property of Stock Market Returns and a New Model. Journal of Empirical Finance 1, Engle, R. (2001), Dynamic Conditional Correlation - a Simple Class of Multivariate GARCH Models, forthcoming in Journal of Business and Economic Statistics. Engle, R. (1990), Stock Volatility and the Crash of 87. The Review of Financial Studies 3, Engle, R. and F. Kroner (1995), Multivariate Simultaneous Generalized ARCH. Econometric Theory 11, Engle, R. and V. Ng (1993), Measuring and Testing the Impact of News On Volatility. Journal of Finance 48, Engle, R., V. Ng, and M. Rothschild (1990), Asset Pricing with a Factor-ARCH Covariance Structure: Empirical Estimates for Treasury Bills. Journal of Econometrics 45, Figlewski, S. (1997), Forecasting Volatility. Financial Markets, Institutions and Instruments 6, Gallant, A.R. and G. Tauchen (2002), SNP: A Program for Nonparametric Time Series Analysis, User s Guide Version 8.8 Gallant, A.R. and G. Tauchen (1989), Seminonparametric Estimation of Conditionally Constrained Heterogeneous Processes: Assset Pricing Applications. Econometrica 57, Gilks, W.R., S. Richardson, and D.J. Spiegelhalter (1996), Markov Chain Monte Carlo in Practice, London, Chapman & Hall. Glosten, L., R. Jagannathan, and D. Runkle (1993), On the Relationship Between the Expected Value and the Volatility of the Nominal Excess Return on Stocks. Journal of Finance, 48,

38 Gouriéroux, C. and A. Monfort (1996), Simulation-Based Econometric Methods, New York, Oxford University Press. Gouriéroux, C., A. Monfort and A. Trognon (1984), Pseudo-Maximum Likelihood Methods: Theory. Econometrica 52, Hamilton, J.D. (1994), Time Series Analysis, Princeton, Princeton University Press. Hausman, J. (1975), An Instrumental Variable Approach to Full-Information Estimators for Linear and Certain Nonlinear Models. Econometrica 43, Higgins, M.L. and A.K. Bera (1992), A Class of Nonlinear ARCH Models. International Economic Review 33, Ingber, L. (1989), Very Fast Simulated Re-Annealing. Mathematical Computer Modeling 12, Jorion, P. (1995), Predicting Volatility in the Foreign Exchange Market. Journal of Finance 50, Jorion, P. (1996), Risk and Turnover in the Foreign Exchange Market. in J.A. Frankel, G. Galli, and A. Giovannini, eds., The Microstructure of Foreign Exchange Markets (Chicago: The University of Chicago Press, pp ). Kawakatsu, H. (2003), Matrix Exponential GARCH Ledoit, O., P. Santa-Clara, and M. Wolf (2002), Flexible Multivariate GARCH Modeling With an Application to International Stock Markets., forthcoming in The Review of Economics and Statistics Lin, W. (1992), Alternative Estimators for Factor GARCH Models - A Monte Carlo Comparison. Journal of Applied Econometrics 7, Nelson, D.B. (1991), Conditional Heteroskedasticity in Asset Returns: A New Approach. Economatrica 59, Sheppard, K. (2002), Understanding the Dynamics of Equity Covariance., Manuscript, UCSD. Taylor, S.J. (1986), Modeling Financial Time Series, John Wiley and Sons Ltd. 38

39 Tse, Y. and A. Tsui (2002), A Multivariate GARCH Model with Time-Varying Correlations. Journal of Business and Economic Statistics 20, Zakonian, J. (1994), Threshold Heteroskedastic Models. Dynamics and Control 18, Journal of Economic 39

40 Table 1: Cumulative variances MAD 5-days MSE 5-days MAD 10-days MSE 10-days MAD 20-days MSE 20-days R.W R-GARCH g-bekk(36) t-bekk(39) F-Garch g-ccc t-ccc g DCC E (27) t DCC E g DCC T t DCC T H-GARCH(90) sh-garch(90)

41 Figure 1: Japanese Yen/US Dollar Standard deviations: 20 days averages Data g-bekk t-bekk R-GARCH sh-garch DCC E 41

42 Figure 2: British Pound/US Dollar Standard deviations: 20 days averages Data g-bekk t-bekk R-GARCH sh-garch DCC E 42

43 Figure 3: Swiss Franc/US Dollar Standard deviations: 20 days averages Data g-bekk t-bekk R-GARCH sh-garch DCC E 43

44 Figure 4: Japanese Yen/US Dollar and British Pound/US Dollar correlations: 20 days averages Data g-bekk t-bekk R-GARCH sh-garch DCC E 44

45 Figure 5: Japanese Yen/US Dollar and Swiss Franc/US Dollar correlations: 20 days averages Data g-bekk t-bekk R-GARCH sh-garch DCC E 45

46 Figure 6: British Pound/US Dollar and Swiss Franc/US Dollar correlations: 20 days averages Data g-bekk t-bekk R-GARCH sh-garch DCC E 46

47 Figure 7: Bond rates 47

48 Figure 8: Bond standard deviations: 20 days averages 3-Month Treasury Bill: VARMA residuals 3-Month Treasury Bill: DCC 1-Year Treasury Constant Maturity Rate: VARMA residuals 1-Year Treasury Constant Maturity Rate: DCC 10-Year Treasury Constant Maturity Rate: VARMA residuals 10-Year Treasury Constant Maturity Rate: DCC 48

49 Figure 9: Bond correlations: 20 days averages 3-Month and 1-Year: VARMA residuals 3-Month and 1-Year: DCC 3-Month and 10-Year: VARMA residuals 3-Month and 10-Year: DCC 1-Year and 10-Year: VARMA residuals 1-Year and 10-Year: DCC 49

Multivariate GARCH models.

Multivariate GARCH models. Financial market volatility moves together over time across assets and markets. Recognizing this commonality through a multivariate modeling framework leads to obvious gains