MODELING AND FORECASTING LARGE REALIZED COVARIANCE MATRICES AND PORTFOLIO CHOICE

Size: px

Start display at page:

Download "MODELING AND FORECASTING LARGE REALIZED COVARIANCE MATRICES AND PORTFOLIO CHOICE"

Charles Tate
6 years ago
Views:

1 MODELING AND FORECASTING LARGE REALIZED COVARIANCE MATRICES AND PORTFOLIO CHOICE Laurent A.F. Callot VU University Amsterdam, CREATES, and the Tinbergen Institute Anders B. Kock 1 Aarhus University and CREATES akock@creates.au.dk Marcelo C. Medeiros Pontifical Catholic University of Rio de Janeiro mcm@econ.puc-rio.br Abstract: We consider modeling and forecasting large realized covariance matrices by penalized vector autoregressive models. We consider Lasso-type estimators to reduce the dimensionality and provide strong theoretical guarantees on the forecast capability of our procedure. We show that we can forecast realized covariance matrices almost as precisely as if we had known the true driving dynamics of these in advance. We next investigate the sources of these driving dynamics as well as the performance of the proposed models for forecasting the realized covariance matrices of the 30 Dow Jones stocks. We find that the dynamics are not stable as the data are aggregated from the daily to lower frequencies. Furthermore, we are able beat our benchmark by a wide margin. Finally, we investigate the economic value of our forecasts in a portfolio selection exercise and find that in certain cases an investor is willing to pay a considerable amount in order get access to our forecasts. Keywords: Realized covariance; shrinkage; Lasso; forecasting; portfolio allocation. JEL: C22. Acknowledgements: We would like to thank two anonymous referees and the editor Jonathan Wright for constructive comments which have greatly improved the paper. Parts of the research for this paper were done while the first and second authors were visiting the Pontifical Catholic University of Rio de Janeiro. 1 Corresponding author: Department of Economics and Business Economics - CREATES, Fuglesangs Allé 4, building 2628, 313, 8210 Aarhus V, Denmark, Phone:

2 2 LARGE REALIZED COVARIANCE MATRICES AND PORTFOLIO CHOICE Its hospitality is gratefully appreciated. MCM s research is partially supported by the CNPq/Brazil. Financial support by the Danish National Research Foundation (grant DNR78, CREATES) is gratefully acknowledged. 1. Introduction This paper deals with modeling and forecasting large-dimensional time-varying realized measures of covariance matrices of returns on financial assets. By realized measures of a covariance matrix we mean estimates of the integrated covariance matrix based on ultra-high-frequency data as, for example, the composite realized kernel of Lunde et al. (2015). We evaluate the proposed model in terms of its forecasting performance in comparison to benchmark alternatives as well as in terms of its economic value in a conditional mean-variance analysis to assess the value of volatility forecasts to short-horizon investors as in Fleming et al. (2003). Our procedure is able to unveil the dynamics governing the evolution of the covariance matrices and we investigate which sectors are instrumental in driving these dynamics. This is important since we can answer questions such as: are certain sectors uncorrelated with others? Are the variances of stocks mainly described by their own past or are there spillover effects? Are the covariances governed by the same dynamics as the variances? We also investigate whether these dynamics change as the realized measures are aggregated from the daily to the weekly and monthly level as one could expect the driving dynamics to differ from the short to the long run. Modern portfolio selection as well as risk management and empirical asset pricing models strongly rely on precise forecasts of the covariance matrix of the assets involved. For instance, the traditional mean-variance approach of Markowitz (1952) requires the estimation or modeling of all variances and covariances. The evolution of financial markets has increased the number of assets, leading traditional approaches to be less suitable to be used by practitioners. Furthermore, Bollerslev et al. (1988) finds that covariance matrices appear to be quite variable over time with a strong autoregressive structure, and find expected returns to be strongly correlated with these

3 LARGE REALIZED COVARIANCE MATRICES AND PORTFOLIO CHOICE 3 variations. Typical multivariate ARCH-type models fail to deliver reliable estimates due the curse of dimensionality and large computational burdens. Possible solutions frequently used in practice are: (1) a weighted-average of past squared returns as in the Riskmetrics methodology; (2) vast latent conditional covariance models, or (3) the construction of factor models. In this paper we will take a different route and will consider the estimation of vast vector autoregressive models for realized measures of covariance matrices as in Kock and Callot (2015). To avoid the curse of dimensionality we advocate the use of the Least Absolute Shrinkage and Selection Operator (Lasso) proposed by Tibshirani (1996). An advantage of this approach is that it does not reduce dimensionality by transforming variables, thus keeping the interpretability of the individual variables. Our paper is related to the work of Audrino and Knaus (2014) who applied the Lasso to univariate autoregressive models in order to forecast realized measures of volatility. The contributions of this paper are as follows. First, we put forward a methodology to model and forecast large time-varying realized covariance matrices with a minimum number of restrictions. Second, our method can shed light on the drivers of the dynamics of these realized covariance matrices as the Lasso also performs variable selection. Third, we derive an upper bound on the forecast error which is valid even in finite samples. Fourth, we show how this bound translates into a bound on the forecast error of the time-varying variance of a portfolio constructed from a large number of assets. Fifth, we use our method to forecast the realized covariance matrices at the daily, weekly, and monthly level of aggregation. Finally, we apply our methodology to the selection of a portfolio with mean-variance preferences. We use the estimator of Lunde et al. (2015) to estimate the realized covariance matrix. Several other papers have dealt with the estimation of vast covariance matrices, e.g. Bickel and Levina (2008), Levina et al. (2008), Fan et al. (2008), Wang and Zou (2010), Fan et al. (2011), Fan et al. (2012), Fan et al. (2012), Hautsch et al. (2012), Fan et al. (2013). Our goal is to build an econometric methodology which will be used to construct dynamic models to forecast large covariance

4 4 LARGE REALIZED COVARIANCE MATRICES AND PORTFOLIO CHOICE matrices estimated elsewhere. Even with proper estimates of covariance matrices, modeling their dynamics and producing reliable forecasts pose major challenges, especially in high dimensions. In terms of forecasting (realized) covariance matrices, there are some papers related to ours. However, none of them tackles the curse of dimensionality in a general manner. For example, Bauer and Vorkink (2011) propose a multivariate version of the heterogeneous autoregressive (HAR) model. Their approach is based on the log-matrix covariance specification of Chiu et al. (1996). However, their model is only feasible for covariance matrices of low dimension. Furthermore, the parameters of their model are not easily interpretable because of the log-matrix transformation, and thus one can not investigate the main driving forces of the dynamics of the covariance matrix as in our approach. Chiriac and Voev (2011) consider a multivariate ARFIMA model. However, only small covariance matrices can be modeled. Golosnoy et al. (2012) propose a Conditional Autoregressive Wishart (CAW) model. As before, in their application the authors only consider five different assets. The use of high-frequency data for portfolio selection has been studied by many authors. Fleming et al. (2003) measure the economic gains of using high-frequency data in the context of investment decisions. Their results indicate that the economic value of switching from daily to intradaily returns to estimate daily covariance matrices are substantial. They estimate that a risk-averse investor would be willing to pay 50 to 200 basis points per year to capture the observed gains in portfolio performance. However, their analysis is restricted to small sets of assets. Hautsch et al. (2015) introduce the Multi-Scale Spectral Components (MSSC) model, which is a kind of factor specification, for forecasting covariance matrices and they show that high-frequency data models can translate into better portfolio allocation decisions over longer investment horizons. Although, these authors consider the same problem as we do here, their modeling approach is quite different.

5 LARGE REALIZED COVARIANCE MATRICES AND PORTFOLIO CHOICE 5 The rest of the paper is organized as follows. Section 2 describes the problem setup, defines notation, and presents the Lasso and some key assumptions. In Section 3 we present theoretical performance guarantees of our procedure. The data set and computational issues are discussed in Section 4, variable selection and forecasting results are presented in Section 5, portfolio selection is discussed in Section 6. Finally, Section 7 concludes the paper. All proofs are deferred to the supplementary appendix. 2. Methodology In this section we put forward our methodology and present a finite sample upper bound on the forecast error of our procedure. But first we introduce the following notation Notation. For any x R m, x = m i=1 x2 i, x l 1 = m i=1 x i and x l = max 1 i m x i denote l 2, l 1 and l norms, respectively. For an m m matrix A, A denote the maximum absolute entry of A. For any two real numbers a and b, a b = max(a, b) and a b = min(a, b) The econometric method. Let Σ t denote the n T n T realized measure of the integrated covariance matrix as of time t. Note that n T is indexed by the sample size T implying that n T may be large compared to T and hence standard asymptotics, which take n T as a fixed number, may not accurately reflect the actual performance in finite samples. The entries of the conditional covariance matrix are potentially a complicated function of past entries. If we allowed conditioning on the infinite past, each entry of Σ t would be a function of infinitely many variables. Instead, we assume that Σ t only depends on the p T most recent values of Σ t, namely Σ t 1,..., Σ t pt. As each of these matrices has (n T + 1)n T /2 unique entries we still have that every entry of Σ t can be a function of n T n T +1 2 p T variables (plus a constant for each entry). In the case of the Dow Jones Industrial Average (DJIA) considered in this paper we have n T = 30, and assuming that p T = 10 lags suffice to describe the dynamics of Σ t,

6 6 LARGE REALIZED COVARIANCE MATRICES AND PORTFOLIO CHOICE every entry of Σ t can be a function of up to 4,650 variables (plus a constant). This is the case for all the 465 unique entries of Σ 2 t. To make this manageable, we shall assume that every entry of Σ t is a linear function of it past, so that Σ t follows a VAR(p T ) process. This allows for rich dynamics as every entry of Σ t can depend on several thousand variables, leading to equations with many more parameters than observations. Formally, define y t = vech Σ t where vech is the half-vectorization operator returning a vector of length n T (n T + 1)/2 with the unique entries of Σ t. We assume that y t follows a vector autoregression of order p T, i.e., y t = ω + p T i=1 Φ i y t i + ɛ t, t = 1,..., T (1) where Φ i, i = 1,..., p T, are the k T k T dimensional parameter matrices pertaining to the coefficients of the lagged left hand side variables with k T = n T (n T + 1)/2, ω is a k T 1 vector of intercepts allowing for non-zero means of all variances and covariances and ɛ t is the zero-mean error term. We assume that we have p T initial values y pt +1,..., y 0. Henceforth we suppress the dependence of n T, k T and p T on T to simplify notation. The number of parameters per equation, k, increases quadratically in n, as does the number of equations. So, even for Σ t of a moderate dimension, the number of parameters in (1) may be very large. Hence, standard estimation techniques such as least squares may provide very imprecise parameter estimates or even be infeasible if the number of variables is greater than T. To circumvent this problem, we use the Least Absolute Shrinkage and Selection Operator (Lasso) of Tibshirani (1996) which is feasible even when the number of parameters to be estimated is (much) larger than T. Even though each entry of Σ t can depend on every entry of the previous p conditional covariance matrices, it is reasonable to assume that the major stocks play a more important role in describing the dynamics of Σ t. Furthermore, it is likely 2 Actually, we shall work with models including up to 20 lags, thus having 9,300 variables per equation.

7 LARGE REALIZED COVARIANCE MATRICES AND PORTFOLIO CHOICE 7 that the intrasectoral dynamics are stronger than the intersectoral ones. By this we mean that lags of variables from the same sector as the left hand side variable being modeled are more likely to be significant than variables from other sectors. For these reasons the Φ i, i = 1,..., p, may contain many zeros, i.e. they are sparse matrices. It is exactly in such a setting that we can establish performance guarantees on the Lasso. Thus, the Lasso can be used to unveil and disentangle the potentially complex dynamics of the sequence of realized covariance matrices. If some of the intercepts in ω equal zero this will make the model even sparser. However, this is not a requirement for our theory to be valid. It is important to stress that we do not assume the realized covariance matrices themselves to be sparse. The coefficients Φ i, which describe the dynamics of the realized covariance matrices, are the quantities assumed to be sparse. Let Z t = (1, y t 1,..., y t p) be the (kp + 1) 1 vector of explanatory variables and Z = (Z T,..., Z 1 ) the T (kp + 1) matrix of covariates. Let y i = (y T,i,..., y 1,i ) be the T 1 vector of observations on the ith variable, i = 1,..., k, and ɛ i = (ɛ T,i,..., ɛ 1,i ) the corresponding vector of error terms. Finally, γi = (ωi, βi ) is the (kp + 1) dimensional vector of true parameters for equation i which also implicitly depends on T. Hence, we may write (1) as y i = Zγ i + ɛ i, i = 1,..., k (2) such that each equation in (1) may be modeled separately. In our application modeling the DJIA, every equation has 465 p explanatory variables plus a constant. However, it is likely that many of these variables are irrelevant in explaining the dynamics of y i, leading to many entries of β i being equal to zero. For each i = 1,..., k we denote by s i = {j : β i,j 0} the number of non-zero entries in β i. We shall require below that s i does not grow too fast. It is important to stress that it is the number of non-zero elements of β i per equation that is the important quantity. We do not make any assumptions on ω being sparse, none of the ω i are assumed to be zero but this is not ruled out either.

8 8 LARGE REALIZED COVARIANCE MATRICES AND PORTFOLIO CHOICE 2.3. The Lasso. The properties of the Lasso have been studied extensively, as for example in Zhao and Yu (2006), Meinshausen and Bühlmann (2006), Bickel et al. (2009), Kock and Callot (2015), and Medeiros and Mendes (2015) to mention just a few. It is known that it only selects the correct model asymptotically under rather restrictive conditions on the dependence structure of the covariates. However, it can still serve as an effective screening device in these cases as it can remove many irrelevant covariates while keeping the relevant ones and estimating the model with high precision. We investigate the properties of the Lasso when applied to each equation separately and we provide finite sample bounds on the system-wise forecasting performance. As we do not assume that any of the intercepts in ω is sparse we shall employ a version of the Lasso which does not penalize these. To be precise, we estimate γ i in (2) by minimizing the following objective function L(γ i ) = 1 T y i Zγ i 2 + 2λ T β i l1 (3) with respect to γ i where λ T is a sequence to be defined exactly below. Let γ i = ( ω i, β i) denote the minimizer. (3) is basically the least squares objective function plus an extra term penalizing entries of β i that are different from zero. The intercepts are not penalized. λ T is often chosen either by cross validation or using an information criterion. The Lasso performs estimation and variable selection simultaneously which is useful in high-dimensional models as the number of specification tests to be carried out after the usual estimation step would be daunting. Furthermore, it is not clear in which order these tests should be carried out and one would necessarily have to resort to ad hoc procedures. Finally, the Lasso is a convex minimization problem, resulting in fast estimation and variable selection Forecasting with the Lasso. Once the γ i have been obtained for equation i, forecasting is done as usual in VAR models. The one-step ahead forecast is given by ŷ i,t +1 = γ iz T. Doing this for all i = 1,..., k results in a forecast Σ T +1 of the matrix Σ T +1. Forecasts at longer horizons are obtained by iterating forward the estimated VAR model.

9 LARGE REALIZED COVARIANCE MATRICES AND PORTFOLIO CHOICE 9 3. Theoretical Results In this section we provide an upper bound on the maximal forecast error Σ T +1 Σ T +1. It is exactly such a bound which is required for providing a theoretical upper bound on the errors of the variance forecasts of a portfolio of assets. Let w R n denote a set of portfolio weights. The true conditional variance of the portfolio is given by σt 2 +1 = w Σ T +1 w, while the forecasted variance is σ T 2 +1 = w ΣT +1 w. How close are σ 2 T +1 and σ2 T +1 to each other? In the presence of an upper bound on the positions an investor may take in any given asset we can answer this question by providing an upper bound on Σ T +1 Σ T +1. Thus, we need a precision guarantee for the forecasts of the Lasso. Assume that w l1 1 + c for some c 0. Here c indicates the amount of short selling that is allowed. c = 0 corresponds to no short selling, while for any c > 0 the maximum amount of short selling possible is c/2. For any fixed vector of portfolio weights, the next theorem, of which a similar version may be found in Fan et al. (2012), utilizes the short selling constraint to derive an upper bound on the distance between the predicted portfolio variance and the actual portfolio variance. Theorem 1. Assume that w l1 1 + c for some c 0. Then, 2 σ T +1 σt 2 +1 ΣT +1 Σ T +1 (1 + c) 2. Theorem 1 reveals that in the presence of a restriction on the exposure of the portfolio, an upper bound on Σ T +1 Σ T +1 implies an upper bound on the distance between σ T 2 +1 and σ2 T +1. It is sensible that a short-selling constraint is necessary in order to establish such a bound since otherwise the investor could go infinitely long in the least precisely forecasted stock. Note that the bound in Theorem 1 is generic in the sense that the forecast Σ T +1 does not need to come from a specific method. Any matrix Σ T +1 can be plugged in. We shall now give an upper bound on Σ T +1 Σ T +1 based on our VAR approach. Let m i and η 2 i,y denote the mean and variance of y t,i, respectively, and η 2 i,ɛ

10 10 LARGE REALIZED COVARIANCE MATRICES AND PORTFOLIO CHOICE be the variance of the mean zero ɛ t,i, 1 i k. Then define m T = max 1 i k m i, η T = max 1 i k (η i,y η i,ɛ ), and let s = max(s 1,..., s k ) denote the maximal number of non-zero coefficients in any equation of the VAR. Theorem 2. Let λ T = 8 ln(1 + T ) 5 ln(1 + k) 4 ln(1 + p) 2 ln(k 2 p)k 2 T /T where K T is defined in the appendix and bounded if m T and η 2 T are bounded. Under regularity conditions made precise in the appendix, one has for all 0 < q < 1 ( ΣT +1 Σ 16 ) T +1 K T qκ sλ 2 T + 1 (4) with high probability (the exact probability, as well as the definition of the constant κ > 0, are given in the appendix) where K T = η2 T (9m2 T +ln(kp+1) ln(t )). ηt Theorem 2 gives an upper bound on the forecast error of Σ T +1 which is valid even in finite samples. Recall that k = n(n + 1)/2. Thus, for any size n of the covariance matrix, Theorem 2 gives an upper bound on the forecast error of the Lasso. Clearly, the larger the covariance matrix to be forecasted is (k large), the larger this upper bound since there are more elements which have to be forecasted. However, K T only increases logarithmically in k. In order to gauge how precise the upper bound in (4) is, it is instructive to compare it to the bound one could have obtained if the true parameters had been known. If the true parameters Φ i, i = 1,..., p were known, one would have the following upper bound: Lemma 1. Assume that Φ i, i = 1,..., p are known. Then, under regularity conditions made precise in the appendix, with η T,ɛ = max 1 i k η i,ɛ, one has ΣT +1 Σ T +1 2ηT,ɛ 2 ln(k) ln(t ) with high probability (the exact one is given in the appendix). Compared to the bound in Theorem 2 the bound in Lemma 1 has roughly removed the term KT sλ T which is the part of the upper bound stemming from the estimation error of the parameters. The last part of the bound in Theorem 2 stems

11 LARGE REALIZED COVARIANCE MATRICES AND PORTFOLIO CHOICE 11 from the fact that the error terms in the model (1) are unknown such that we must still bound their fluctuations with high probability. Even when the true parameters are known, these unknown error terms are an irremovable source of forecast error. From an asymptotic point of view one will usually requires that sλ T 0 since this is needed to make the estimation error tend to zero in probability. In this case the bound in Theorem 2 is almost as good as the one in Lemma 1. Hence, our forecasts are almost as precise as if we had known the true model from the outset even though we do not assume knowledge of which variables are relevant, nor do we assume to know the values of the non-zero parameters. By combining Theorems 1 and 2 one may achieve the following upper bound. Corollary 1. Under the assumptions of Theorems 1 and 2 one has that 2 σ T +1 σt 2 +1 ( 16 ) K T qκ sλ 2 T + 1 (1 + c) 2 with high probability (the exact one is given in the appendix) and K T as in Theorem 2. Corollary 1 provides a finite sample upper bound on the error of the forecast of the portfolio variance under a short selling constraint. As stressed previously, in the absence of a short selling constraint, corresponding to c, the upper bound in the above display tends to as in the worst case one could go infinitely long in the stock whose volatility is forecasted the least precisely and offsetting this with a short position in, say, the stock whose volatility is forecasted the most precisely. Note also that the bound in Corollary 1 is actually valid uniformly over {w R n : w l1 1 + c}, i.e. all portfolios with a short selling constraint of 1 + c. To be precise, Corollary 2. Under the assumptions of Theorems 1 and 2 one has that 2 sup σ T +1 σ 2 T +1 w R n : w l1 1+c ( 16 ) K T qκ sλ 2 T + 1 (1 + c) 2

12 12 LARGE REALIZED COVARIANCE MATRICES AND PORTFOLIO CHOICE with high probability (the exact one is given in the appendix) and K T as in Theorem 2. As ŵ = arg min w R n : w l1 1+c, n i=1 w i=1 w ΣT +1 w satisfies the conditions of Corollary 2 we have in particular that ŵ ΣT +1 ŵ and ŵ Σ T +1 ŵ are close to each other. This means that the forecasted portfolio variance is not far from the actually realized portfolio variance when using the weights ŵ that seek to minimize the out of sample forecast error. Ensuring positive-definiteness of Σ T +1. Σ T +1 are in general not positive definite, except for the no-change forecasts. Furthermore, certain matrices are illconditioned, which renders the estimation of portfolio weights unstable or even infeasible. To overcome these issues, we follow Hautsch et al. (2012) and apply eigenvalue cleaning to every matrix that has eigenvalues smaller than or equal to 0 or a condition number greater than 10n T. Write the spectral decomposition of Σ T +h = V T +h Λ T +h VT +h where V T +h is the matrix of eigenvectors and Λ T +h a diagonal matrix with the n eigenvalues λ it +h on its diagonal. Let λ mpt +h = min{ λ it +h λ it +h > 0}, replace all the λ it +h < λ mpt +h by λ mpt +h and define the diagonal matrix Λ T +h with the cleaned eigenvalues on its diagonal. The regularized forecast matrix Σ T +h = V T +h Λ T +h VT +h is by construction positive definite. Log-matrix transform. An alternative approach relies on properties of the matrix logarithm and the matrix exponential as explained in Chiu et al. (1996) or Bauer and Vorkink (2011). Prior to estimation we transform the data by applying the matrix logarithm, defining Ω t = log(σ t ), and estimating the VAR on vech Ω t. We then construct a forecast Ω T +h and revert the matrix logarithm by applying the matrix exponential to the forecasted covariance matrices to create our forecast Σ T +h = exp( Ω T +h ) which is positive definite by construction. Thus, we now assume that it is vech Ω t which follows a VAR model as in (1). Imposing exactly the same

13 LARGE REALIZED COVARIANCE MATRICES AND PORTFOLIO CHOICE 13 regularity conditions as in Theorem 2 (but on the VAR for vech Ω t ) one can establish the following upper bound on the forecast error 3. Theorem 3. Let λ T = 8 ln(1 + T ) 5 ln(1 + k) 4 ln(1 + p) 2 ln(k 2 p)k 2 T /T where K T is defined in the appendix and bounded if m T and η 2 T are bounded. Under regularity conditions made precise in the appendix, one has for all 0 < q < 1 ΣT +1 Σ T +1 = +1 e ΩT e Ω T +1 ( ) ( [ ] 16 K T qκ sλ 2 T + 1 e n KT 2 max 1 i k γi +1 + l 1 [ 16 qκ 2 sλ T +1 with high probability (the exact one, as well as the definition of the constant κ > 0, are given in the appendix) where K T = η2 T (9m2 T +ln(kp+1) ln(t )). If Φ ηt 2 1 i, i = 1,..., p are 6 known and η T,ɛ = max 1 i k η i,ɛ, one has under regularity conditions made precise in the appendix ΣT +1 Σ T +1 = +1 e ΩT e Ω T +1 ( [ ] ) η 2T,ɛ log(k) ln(t )en 2 KT max 1 i k γi +1 + KT l 1 with high probability (the exact probability is given in the appendix). ]) The bound in Theorem 3 is very similar to the one in Theorem 2. The only ( [ ] [ ]) difference is the part e n KT 2 max 1 i k γi l 1 qκ 2 sλ T +1 which is the price we pay for translating a bound on the forecast error of the log-matrix transformed data into a bound on the original data. The size of this extra term is related to the modulus of continuity of the matrix exponential. Note that the dimension n enters this term in an exponential manner such that the bound may be rather uninformative when n is large compared to T. However, it is not surprising that one must pay such a price as the matrix exponential is a steep function meaning that small changes in its argument can lead to large changes in its value: small changes in ΩT +1 Ω T +1 can still lead to big changes in ΣT +1 Σ T +1 = +1 e ΩT e Ω T +1. However, we 3 We stress again that all notation and assumptions in Theorem 3 are exactly as in Theorem 2 except for the fact that the VAR structure is assumed for vech Ω t instead of vech Σ t.

14 14 LARGE REALIZED COVARIANCE MATRICES AND PORTFOLIO CHOICE shall see that in practice the forecasts based on the log-matrix transform are rather precise. A similar intuition applies when comparing Theorem 3 to the result in Lemma 1. The bounds are similar except for the price paid for translating a bound from the log-matrix world back to our original data. Finally, one should mention that from an economic point of view a drawback of the log-matrix transform is that the parameters no longer have any economic interpretation when the VAR is estimated on vech Ω t instead of vech Σ t. If one is interested in forecasting only, this is of course of no importance. 4. Data and implementation Data and Cleaning. The data consists of 30 stocks from the Dow Jones index from 2006 to 2012 with a total of 1474 daily observations 4. The daily realized covariances are constructed from 5 minutes returns by the method of Lunde et al. (2015). Results pertaining to the weekly and monthly level of aggregation as well as all tables and figures whose references start with sp- can be found in the supplementary material. The stocks can be classified in 8 broad categories highlighted in Table sp-1. When we do not use the log-matrix transformation, the data is transformed by taking the logarithm of the variances ensuring that, after an exponential transformation, all variance forecasts are strictly positive. The sample covered by our data set includes 16 out of the 20 largest intraday point swings of the Dow Jones industrial average, triggered by the financial crisis of 2008 as well as flash-crashes in 2010 and These events lead to many extreme values in the daily realized covariance matrices. In order to mitigate the effect of these extreme observation we perform some light cleaning of the data prior to estimation. To be precise, we flag for censoring every covariance matrix in which more than 25% of the (unique) entries are more than 4 standard errors (of the series corresponding to that entry) away from their sample average up to then. These matrices are replaced by an average of the 4 We are grateful to Asger Lunde for kindly providing us with the data.

15 LARGE REALIZED COVARIANCE MATRICES AND PORTFOLIO CHOICE 15 nearest five preceding and following non-flagged matrices. Replacing these extreme entries with averages including only preceding days leads to replacement with relatively small values as extreme events in our data seem to appear suddenly but decay slowly. In our application we found that the results are robust to variations in the data cleaning procedure as only very few observations are censored (see below). Using this cleaning scheme, the flagged matrices are concentrated in October The flash crashes of the 6 th of May 2010 and the 9 th of August 2011 are also flagged. In total, the number of cleaned matrices is 20 which corresponds to a very small fraction (1.3%) of the total sample size of All forecasts are computed recursively and all forecast errors are computed based on the de-transformed forecasts. Daily forecasts are computed using a rolling window of 1000 observations leading to 455 forecasts. The first forecast is made on the 6th of February We forecast with VAR, EWMA, DCC, and random walk (No- Change) models. The estimators for the VARs are either the Lasso, the adaptive Lasso (adalasso), using the Lasso as initial estimator, or the post Lasso OLS to be explained in more detail in the next section. Implementation. All the computations 5 are carried out using R and the lassovar package based on the glmnet package which is an implementation of the coordinate descent algorithm of Friedman et al. (2010). When all β are penalized by the same λ (Lasso) the algorithm standardizes the variables and the estimated parameters are adjusted accordingly; variables are not standardized when estimating the adalasso. The intercepts are left unpenalized in all estimations. For all forecasts that do not come from the log-matrix transformed data, positive definiteness is ensured by eigenvalue cleaning. The models are estimated equation by equation, the penalty parameter is chosen by minimizing the Bayesian Information Criterion (BIC). The ( ) BIC for equation i and penalty parameter λ is BIC i (λ) = T log ɛ λ,i ɛ λ,i + ) kp j=1 ( βλ 1 ij 0 log(t ), where ɛ λ,i is the estimated vector of error terms corresponding to a penalty of λ. The results of the Lasso are compared to the ones of the 5 This work was in part carried out on the Dutch national e-infrastructure with the support of SURF Cooperative. Partial replication material can be found at

16 16 LARGE REALIZED COVARIANCE MATRICES AND PORTFOLIO CHOICE adalasso. Let J( β i ) = {j R kp : βi,j 0} denote the indices of the coefficients in the i th equation deemed zero by the Lasso and J i ( γ i ) = {1} (J i ( β i ) +1) where the set addition is understood elementwise. The adalasso estimates β i by minimizing the following objective function L(β i ) = 1 y i Z T J( γi ) γ i, J( γ i ) 2 + 2λ T j J( β i ) β i,j, i = 1,..., k. (5) β i,j If the first stage Lasso estimator classifies a parameter as zero it is not included in the second step, resulting in a problem of a much smaller size. If β i,j = 0 then β i,j is likely to be small by Theorem 1 in the supplementary appendix and consistency of the Lasso. Hence, 1/ β i,j is large and the penalty on β i,j is large. Conversely, if β i,j 0, β i,j is not too close to zero and the penalty is small. We also consider post Lasso least squares estimation (Belloni and Chernozhukov (2013)), that is, estimating the model by OLS after using the Lasso for variable selection. 5. Driving dynamics and forecasts This section reports our empirical findings. We begin by studying the dynamic behavior of the sequence of realized covariance matrices by investigating which variables drive its dynamics Driving dynamics of the realized covariance matrix. The results in this section are based on a VAR(1) estimated by the Lasso, each of the 465 equations to be estimated has 465 variables (plus a constant). Table 1 reports the average (across estimation windows) fraction of variables from a given category (in rows) selected in equations for stocks belonging to a given category (in columns) at the daily frequency. As an example, 75% of the lagged variances from the Basic Materials sector are included when modeling variances of the Basic Materials sector while only 17% of the lagged variances of stocks belonging to the Consumer, Non-cyclical sector are included when modeling variances of the Basic Materials sector. The numbers are broken down into variance and covariance terms in order to see whether diagonal and

17 LARGE REALIZED COVARIANCE MATRICES AND PORTFOLIO CHOICE 17 off-diagonal terms have different driving dynamics. There are 30 variance equations and 435 covariance equations. Assigning categories to covariance terms involving variables from two different sectors requires a subjective choice. We have chosen to assign a covariance between two variables from different sectors to both sectors. Hence, such intersectoral covariance terms enter as explanatory variables in two rows in Table 1. The selected models in Table 1 for the variance terms are quite sparse, and lags of variance terms are much more important than lags of covariance terms. The most important dynamics for 7 out of 8 sectors are the intrasectoral ones (own lags). The Industrial and Energy sectors are particularly dependent on their own past as they always use all lagged variances from their respective categories to explain themselves. The financial sector is also driven strongly by its own past. Figure sp-1 in the appendix shows how often each individual stock was selected at each point in time. The bottom panel of Table 1 investigates the driving forces of the covariance terms. For 6 out of 8 sectors it is lagged variances of that sector which are chosen most often in order to explain the dynamics of the corresponding sector. Off-diagonal terms are now chosen more often than above. However, the fraction of times the off-diagonal terms are chosen is still lower than the corresponding number for the diagonal terms as there are many more off-diagonal terms. Table sp-2 in the supplementary appendix confirms that the above findings remain valid when including the variance of the S&P 500 index as a common factor. To complement the selection frequencies in Table 1 and to give a sense of the total number of parameters selected per equation, Figure 1 plots the average number of variables selected in the variance as well as the covariance equations over time for the VAR(1) and VAR(20) estimated by the Lasso. The dimensions of the diagonal equations are remarkably stable and sparse, only including between 11 and 13 variables out of a possible 465 for the VAR(1) and 20 to 24 out of 9300 for the VAR(20).

18 18 LARGE REALIZED COVARIANCE MATRICES AND PORTFOLIO CHOICE The off-diagonal equations are less sparse and their dimension, drops around two of the periods of high volatility in May 2010 and August Figure 2 displays the percentage of parameters changing classification, that is, the percentage of parameters going from being zero to non-zero or from being nonzero to zero, in two consecutive models. We report results for a VAR(1) and a VAR(20) estimated with the Lasso and for a VAR(1) estimated by Lasso where the data has been transformed using the log-matrix function prior to estimation. For both VAR(1) models the number of changes in the variance (diagonal) equations is around 0.1% while the corresponding number for the VAR(20) is around 0.02%.This leads to 0.5 (VAR(1)) and 2 (VAR(20)) parameters changing from zero to non-zero or vice versa on average. The covariance equations are less stable except when the data has been transformed using the log-matrix function prior to estimation. In the left panels of Figure 3 we investigate how the l 2 forecast error of the one-step ahead forecasts evolves throughout the sample, and in particular how it reacts to the periods of extreme volatility. The forecast errors of the covariances react stronger to the periods of extreme volatility than the forecast errors of the variances. The right panels of Figure 3 shows that the decrease in the model size of the covariance equations is associated with an increase in the penalty parameter λ which determines the amount of shrinkage. Tables sp-3 and sp-5 in the supplementary appendix report the selection frequencies at the weekly and monthly levels of aggregation, respectively. The diagonal pattern for the explanatory variables of the variances disappears almost entirely at the monthly frequency. The dynamics of the variance equations are now governed mainly by the financial sector and the energy sector, which we interpret as implying that these sectors are driving the long run volatility of the Dow Jones index. To shed even further light on the findings in Table 1 we also consider the variable selection pattern for 4 individual stocks in more detail. Figure?? contains the results for Alcoa, IBM, JPM Morgan, and Coca-Cola which belong to the Basic Materials, Technology, Financial and Consumer (non-cyclical) sectors, respectively. For each of

19 LARGE REALIZED COVARIANCE MATRICES AND PORTFOLIO CHOICE 19 these 4 stocks, we indicate in the corresponding plot whether the lagged variance of the stock indicated on the y-axis is selected in the model estimated at the date given on the x-axis. The variable selection pattern is stable over time, with many lagged variances being either selected or left out throughout the entire sample. A common feature for the four stocks is that a relatively sparse set of variables describes their dynamics and that lags of the stock being modeled are often important Forecasts. In this section we forecast the daily realized covariances using the proposed procedures. Results for the weekly and monthly levels of aggregation can be found in the supplementary material. Let ê T +h denote vech ( ΣT +h Σ T +h ) where Σ T +h denotes the forecast of Σ T +h as of time T at horizon h 1. Thus, ê T +h is the vector of forecast errors at horizon h. Let the times of the first and last forecast be denoted by T 1 and T n, respectively. Then, in order to gauge the precision of the procedures above we consider the following measures. 1 (1) Average l 2 -forecast error (l 2 ): Tn T n T 1 +1 T =T 1 ê T +h. We shall employ this measure separately for the whole matrix, the variances, and the covariances. When it is employed for the whole matrix it corresponds to the average forecast error in the Frobenius norm. (2) Average median absolute forecast error (AMedAFE): 1 T n T 1 +1 Tn T =T 1 median ( ê T +h ). This measure is considered for the whole matrix, the variances, and the covariances. It is more robust to outliers than the l 2 -norm. (3) Average maximal absolute forecast error (AMaxAFE): 1 T n T 1 +1 Tn T =T 1 max ( ê T +h ). This measure is considered for the whole matrix, the variances, and the covariances. As the maximal forecast error plays a crucial role in Theorems 1 and 2 it is of interest to investigate this quantity for different procedures. The principal benchmark against which we gauge the VAR forecasts are no-change forecasts (random walk, Σ T +h = Σ T for h 1.) Table 2 reports the value of the above measures for the no-change forecasts on a gray background. The numbers for all other models are the relative forecast errors compared to the no-change forecast.

20 20 LARGE REALIZED COVARIANCE MATRICES AND PORTFOLIO CHOICE Thus, numbers less than one indicate more precise forecasts than the no-change forecasts. Table 2 also reports results for two alternative models for comparison purposes. (1) DCC(1,1) 1-step ahead forecasts for the daily and weekly data, the training sample is too short to estimate the DCC model at the monthly level of aggregation. (2) Exponentially Weighted Moving Average (EWMA) with smoothing parameter λ = We report 1-step ahead EWMA forecasts at every level of aggregation. To assess the effect of filtering the data, we report daily no-change forecasts for the uncensored series relative to the censored series in Table 2. The forecast errors are larger for the uncensored data in particular on the AMaxAFE measure, yet the AMaxAFE is even larger for both DCC and EWMA than for the no-change forecasts on uncensored data. Table 2 also shows that the eigenvalue cleaning procedure used to regularize the VAR forecasts (when not using the log-matrix transformation) has a negligible effect on the accuracy of the forecasts. Variances appear to be harder to forecast than covariances as can be seen from the larger median and maximal average forecast errors for the variance equations. The gains in forecasting accuracy from using VARs estimated by the Lasso are largest for the variance series. We believe this is an important and new finding. Forecasts from VAR(1) models estimated using the Lasso consistently deliver lower maximal errors than the benchmark. The improvements are more marginal in terms of median and l 2 norm of the errors, however small improvements in the precision of the forecasts can result in large gains as the portfolio section will demonstrate. The post-lasso appears to provide the most accurate model for h = 1 of the approaches not based on the log-matrix transform and thus it shall be our main focus. At every forecast horizons the post Lasso has a lower median forecast error for the variances than the no-change forecasts, but it is less precise when it comes

21 LARGE REALIZED COVARIANCE MATRICES AND PORTFOLIO CHOICE 21 to forecasting the covariances. The forecasts from the VAR(1) based on the logmatrix approach are as precise as the ones from the post-lasso for the non-diagonal elements and superior to these as well as to the no-change forecasts for the covariances. The overall median forecast error should be interpreted with care as it mainly reflects the median forecast error of the off-diagonal terms of the covariance matrix. When considering the maximal forecast error, the VAR(1) estimated by the post-lasso greatly outperforms the no-change forecasts for the diagonal terms. In fact, it is around 20 percent more precise when forecasting 20 days ahead. The no-change forecasts are still more precise when it comes to the forecast precision of the covariances. More importantly, the VAR(1) based on the log-matrix transform is superior to the no-change forecasts for variances as well as covariances as it is much more precise than the post Lasso approach for the covariances. The inclusion of the S&P500 in the model does not appear to consistently improve the forecasts which is in line with the findings of the previous section where the S&P500 was shown not to have a particularly strong explanatory power compared to the other variances. Figure 1 showed that more variables were selected in models for the covariance terms than in those for the variances in a VAR(1). For that reason, in order to better capture the dynamics of the off-diagonal terms, we also experimented with gradually including more lags. Results for VAR(20) models in Table 2 show substantial improvement relative to the benchmark for every model and horizon. Specifically, the increased number of lags renders the post Lasso superior to the no-change forecasts for variances as well as covariances. The forecasts of the covariances improve markedly compared to the VAR(1) model. The VAR(20) on the log-matrix transformed data still delivers more precise forecasts for the covariance terms when considering the maximal forecast errors. These forecasts are also much more precise than the no-change forecasts. The lowest l 2 forecast errors are also found for the VAR(20) based on the log-matrix transform.

22 22 LARGE REALIZED COVARIANCE MATRICES AND PORTFOLIO CHOICE Finally, Tables sp-4 and sp-6 in the supplementary appendix confirm that the variances are harder to forecast than the covariances, even at the weekly and monthly levels of aggregation. In conclusion, the forecast precision of the VAR models is in general improved by including more lags hinting at the presence of serial dependence, in particular at the daily level of aggregation. The particular version of the Lasso used to estimate the VAR model does not seem overly important, which makes the procedure rather robust. The log-matrix approach performs well overall. Finally, the greatest gains in forecast precision are obtained at the longest forecast horizons irrespective of the level of aggregation of the data. 6. Portfolio selection In this section we use the forecasted covariance matrices to construct investment portfolios. Our analysis is based on Fleming et al. (2003). We consider a risk-averse investor who uses conditional mean-variance analysis to allocate resources across the n = 30 different firms that compose the Dow Jones index. The investor s utility at each point in time is given by U(r pt ) = ( 1 + r pt ) γ ( ) 2, 1 + rpt (6) 2(1 + γ) where r pt is the portfolio return at time t and γ is the investor s risk aversion coefficient. The larger the value of γ, the more risk averse is the investor. Although we have estimated portfolios for daily, weekly and monthly frequencies we will focus on the daily results. The portfolios are rebalanced in each period as follows. Let r t+1, µ t+1, and Σ t+1 denote, respectively, an n 1 vector of stock returns at time t + 1, the expected return at time t + 1 given data up to time t and the forecasted conditional covariance matrix at time t + 1 based on information up to period t. Σ t+1 is computed by the methods described earlier while µ t+1 is computed by a moving average of 100 days. 6 6 We have experimented with different lenghts of the moving average window and the results did not change significantly.

23 LARGE REALIZED COVARIANCE MATRICES AND PORTFOLIO CHOICE 23 The investor s problem at t = t 0,..., T 1 is to select a vector of weights for period t + 1 based solely on information up to time t. The investor chooses the weights that minimize the portfolio volatility, subject to a target expected return and several weight constraints: ŵ t+1 = arg min w t+1 w t+1 Σ t+1 w t+1 s.t. w t+1 µ t+1 = µ target, n w it+1 = 1, i=1 n w it+1 I(w it < 0) 0.30 and w it , i=1 (7) where w t+1 is an n 1 vector of portfolio weights on the stocks, µ target is the target expected rate of return from t to t + 1 and I( ) is an indicator function. The optimal weights ŵ t+1 is a function of µ target, µ t+1, and Σ t+1. We impose two additional restrictions. First, we allow the maximum leverage to be 30% (corresponding to c = 0.6 in Section 3). This yields much more stable portfolios. Second, we restrict the maximum weights on individual stocks to be 20%. This imposes diversification in the portfolio. The target expected return is set to 10% per year. As all the stocks considered are very liquid, we set the transactions cost to be 0.1%. Finally, note that we focus on portfolio optimization based on the one-step ahead forecasts even though the Lasso based techniques actually had their biggest advantages in the long-horizon forecasts. We do so since it is likely that the investor will rebalance the portfolio at the same frequency he has chosen to aggregate the data to. Results. Overall, eight different forecasting models have been considered: five Lasso-based VARs and benchmarks based on the DCC, the EWMA and no-change forecasts. The following statistics are considered. (1) Max weight: max t0 +1 t T max 1 i n (ŵ it ) for t = t 0 +1,..., T and = 1,..., N. This is the maximum weight over all assets and all time periods.

24 24 LARGE REALIZED COVARIANCE MATRICES AND PORTFOLIO CHOICE (2) Min weight: min t0 +1 t T min 1 i n (ŵ it ) for t = t 0 + 1,..., T and = 1,..., N. This is the minimum weight over all assets and all time periods. (3) Proportion of leverage: 1 n(t t 0 ) T t=t 0 +1 n i=1 I(ŵ it < 0). The proportion of leverage is the fraction of negative weights computed for all assets and all time periods. (4) Average turnover: 1 n(t t 0 ) T t=t 0 +1 n i=1 ŵ it ŵit hold, where ŵit hold (1+r = ŵ it 1 ) it 1 1+r pt 1. The turnover measures the average change in the portfolio weights. w hold it, i = 1,..., n, are the weights of the hold portfolio. The hold portfolio at period t + 1 is defined as the portfolio resulted from keeping the stocks from period t. (5) Average return: µ p = 1 (T t 0 ) out-of-sample average portfolio return. T t=t 0 +1 r pt = 1 (T t 0 ) T t=t 0 +1 ŵ tr t. This is the (6) Accumulated return: T t=t 0 +1 (1 + r pt). This is the accumulated portfolio return over the out-of-sample period. (7) Standard deviation: σ p = 1 (T t 0 ) T t=t 0 +1 ( r pt 1 (T t 0 ) T t=t 0 +1 r pt) 2. This is the portfolio return standard deviation over the out-of-sample period. (8) Sharpe ratio: µ p σ p. The larger the Sharpe ratio, the better the portfolio as it delivers higher ratios of return over risk. (9) Average diversification ratio: 1 (T t 0 ) T t=t 0 +1 n i=1 ŵitσ it σ pt, where σ pt = ŵ t Σ t ŵ t. This is the average ratio of weighted standard deviations of the individual stocks over portfolio standard deviations. (10) Economic value: the economic value is the value of such that, for different portfolios p 1 and p 2, we have T t=t 0 +1 U(r p 1 t) = T t=t 0 +1 U(r p 2 t ). represents the maximum return the investor would be willing to sacrifice each period in order to get the performance gains associated with switching to the second portfolio. In this comparison our benchmark, p 1, will always be no-change forecasts. We report the value of as an annualized basis point fee. It

Estimation and Forecasting of Large Realized Covariance Matrices and Portfolio Choice. Laurent A. F. Callot, Anders B. Kock and Marcelo C.

Estimation and Forecasting of Large Realized Covariance Matrices and Portfolio Choice Laurent A. F. Callot, Anders B. Kock and Marcelo C. Medeiros CREATES Research Paper 2014-42 Department of Economics