Risk management with the multivariate generalized hyperbolic distribution, calibrated by the multi-cycle EM algorithm

Size: px

Start display at page:

Download "Risk management with the multivariate generalized hyperbolic distribution, calibrated by the multi-cycle EM algorithm"

Arnold Lloyd
5 years ago
Views:

1 Risk management with the multivariate generalized hyperbolic distribution, calibrated by the multi-cycle EM algorithm Marcel Frans Dirk Holtslag Quantitative Economics Amsterdam University A thesis submitted for the degree of Master of Science in the field of Econometrics 2011, May 18 1st Reviewer Dr. S.A. Broda 2nd Reviewer Prof. Dr. H.P. Boswijk

2 Abstract This master thesis presents risk assessment using an underlying DCC(1,1)-MGARCH(1,1) portfolio model with residuals following a multivariate generalized hyperbolic distribution, calibrated by the multi-cycle Expectation-Maximization algorithm. This study shows that a significant difference exists between the symmetrical and asymmetrical distributions of the MGHyp class to model the unpredictable asset returns. Since EM calibration is relatively slow, the calibration time frame is successfully reduced by utilizing parallel processing. Keywords: backtesting, expectation-maximization algorithm (EM), multivariate generalized hyperbolic distribution (MGHyp), conditional Value at Risk, parallel processing, risk assessment, MCECM.

3 Voila, C est tout i

4 Contents 1 Introduction 1 2 Dynamic Conditional Correlation - Multivariate GARCH Model Multivariate GARCH subclasses Pros and Cons DCC-MGARCH in depth Multivariate generalized hyperbolic distribution Why we should use a (complex) non-normal distribution Parameterization of the multivariate generalized hyperbolic distribution Unconditional expectation and covariance of the MGhyp Special cases Calibrating the multivariate generalized hyperbolic distribution Expectation - Maximization algorithm Generalized Expectation-Maximization framework Appealing and problematic EM properties Calibration assumption Defining the E-step Defining the M-step Serious EM optimization problems Risk assessment One day ahead risk forecasting Distribution of the univariate portfolio return ii

5 CONTENTS 5.3 Performance of risk forecasting Application Developed GUI Empirical Data Results Calibration time improvement Conclusion 45 References 47 A Derivation of the conditional density Normal-Mean-Variance- Mixture 51 B Derivation of the MGHyp probability distribution function 53 C Derivation of the conditional GIG distribution 56 C.0.1 Step C.0.2 Step C.0.3 Conditional density function D Proof of the closed form expressions γ, µ and Σ 60 E Proof of the alternative maximization function Q 2 63 F GUI layout 65 iii

6 1 Introduction Risk assessment and the quantitative understanding how to minimize market risk by carrying an asset portfolio is in present-day risk management one of the key research areas. Gradually, the traditional Gaussian distribution to model financial returns has been replaced by several other viable distributions suitable to capture the empirically observed heavy tail behavior, kurtosis and peakedness. Although a vast literature has been written describing all sorts of different heavy tailed distributions, it is the paper of Barndorff-Nielsen (1977) that is quite interesting due to its flexibility. Barndorff-Nielsen developed a generalized hyperbolic distribution consisting of at least 10 different subclasses. These subclasses, such as the Variance-Gamma by Madan and Seneta (1990), skewed Student-t by Aas and Haff (2006) and the hyperbolic distribution by Eberlein and Keller (1995) all outperformed the previously assumed Gaussian distribution. An extensive overview is given by Paolella (2007). The univariate generalized hyperbolic distribution seems to be adequate to model the observed heavy tail for a single asset. Recently, empirical studies try to model a portfolio model with residuals following the multivariate generalized hyperbolic distribution. Pioneers like Protassov (2004), McNeil et al. (2005) and Hu (2005) demonstrate that the the observed heavy tail, peakedness and asymmetry is quite accurately described by the MGHyp. However, due to the complexity of parameter estimation, the MGHyp den- 1

7 sity is calibrated, for instance, by the Expectation-Maximization (EM) algorithm of Dempster et al. (1977). This process relies on optimizing the expectation of the log-likelihood and although this process is quite accurate, it tends to be relatively slow. While current papers assume that the observed asset returns are described only by the MGHyp distribution, this thesis differentiates by assuming an underlying multivariate asset return model with residuals distributed by the MGHyp density. A vast literature exists on the proper usage of underlying multivariate volatility models such as the VEC, BEKK, DCC, Stochastic Volatility and latent factor models. As they all seem appropriate, they differ in optimization techniques, the number of parameters needed to be estimated and simplicity versus applicability. An extensive overview is given by Bauwens et al. (2006). This study contributes towards the further development of the effectiveness of the multivariate generalized hyperbolic distribution when it is used to forecast the possible next day portfolio loss through risk management analysis. By reviewing the subclasses: Normal Inverse Gaussian (NIG), multivariate hyperbolic (Hyp) and the 10-dimensional multivariate hyperbolic distribution (KHyp) 1 following from the MGHyp class, a recommendation is made about the overall performance. Besides the empirical part, this thesis aims to develop a matlab routine to accompany the underlying DCC-MGARCH portfolio model jointly with the MGHyp calibration and to find a suitable method to reduce the calibration time. This program has not been developed yet as far as it is known. To increase the calibration speed, this program uses parallel processing 2. By writing this program, it allows future empirical studies the possibility to enhance their work without the necessity of writing the program their selves and offers the possibility to include even more complicated distributions since the program is able to run on a supercomputer environment. A research of the effectiveness of the multivariate generalized hyperbolic distribution forms the objective of this master thesis. 1 KHyp is used to abbreviate the 10-dimensional hyperbolic distribution 2 Parallel processing is comparable with the supercomputer principle 2

8 The remainder of this thesis is organized as follows. Chapter 2 introduces DCC-MGARCH as the underlying multivariate model to model the portfolio return. Before this model is described in depth, a brief summary is given concerning the differences between the available multivariate GARCH models. Chapter 3 introduces the multivariate generalized hyperbolic distribution. This chapter describes how to derive the density and briefly mentions the possible subclasses that could be reached by imposing some parameter constraints. Chapter 4 discusses the theory behind the Expectation-Maximization algorithm and describes in depth the actual calibration process of the MGHyp density function using the EM algorithm. This section also mentions possible optimization problems using this approach. Chapter 5 discusses the risk assessment by introducing the conditional Value at Risk with corresponding test statistics. It also demonstrates how the assumption of using portfolio weights translates to an easy to use univariate backtesting procedure. Chapter 6 formulates the results using the selected subclasses. For each subclass the asymmetric as well as the symmetric case are reviewed for both coverage levels 95% and 99%. Chapter 7 formulates the conclusion and introduces discussion points. Appendices are presented after the stated references. 3

9 2 Dynamic Conditional Correlation - Multivariate GARCH Model Through the years a vast literature is written to empirically describe the unobserved volatility. The still famous paper of Engle (1982) about Autoregressive Conditional Heteroskedasticity is extended by numerous multivariate GARCH extensions and evolved to rather complicated Stochastic Volatility models. This chapter discusses the main advantages and disadvantages of different multivariate GARCH subclasses. Based on these trade offs, the choice for the DCC-MGARCH model is supported. Section 2.1 briefly mentions the different subclasses. The trade off between advantages and disadvantages of the different subclasses are discussed in section 2.2, supporting the choice for the DCC model of Engle. At last in section 2.3 the chosen model DCC(1,1)-MGARCH(1,1), is described in depth. 2.1 Multivariate GARCH subclasses The vast literature that covers all kinds of different multivariate GARCH extensions can actually be distinguished into three subclasses according to Bauwens et al. (2006). The first subclass covers generalizations of the univariate GARCH model. The main criteria entails that volatility must be described using a simple 4

10 2.2 Pros and Cons GARCH equation. Models such as the VEC by Bollerslev et al. (1988) and BEKK by Engle and Kroner (1995) belong to this category as well as the RiskMetrics model of Morgan (1996). Factor models only belong to this category if and only if the factor itself is known. The second subclass covers linear combinations of the univariate GARCH model. Models such as the latent factor model, described by Diebold and Nerlove (1989) and Gourieroux (1997), belong to this class if the factor itself is not known. Also, the multi-factor model with orthogonal factors, O- GARCH, and its generalized version GO-GARCH by van der Weide (2002) are part of this class. The last subclass covers nonlinear combinations of the univariate GARCH model. In principle, these particular models specify the empirical volatility by time (in)dependent correlation dynamics. The constant conditional correlation model by Bollerslev (1990) as well as both dynamic conditional correlation models by Engle (2002) and Tse and Tsui (2002) belong to this class. All three subclasses differ in various assumptions and in estimation methodology such that it is unfair to compare the different methods based on which one is superior compared to the other models. 2.2 Pros and Cons This section describes the trade off between the advantages and disadvantages of the denoted subclasses. 1. Generalizations of the univariate GARCH model The VEC and BEKK models, including their generalized diagonal versions DVEC and DBEKK, require a large number of unknown parameters that has to be estimated. A VEC model using 3 asset series simultaneously requires the estimation of 78 parameters while using 5 assets, a shocking 465 unknown parameters arise. In comparison, the DCC model only has 17 unknown parameters if 5 asset series are simultaneously used. Therefore, VEC and BEKK models are most of the time only implemented for bivariate empirical studies. 5

11 2.2 Pros and Cons 2. Linear combinations of the univariate GARCH model Latent factor models and stochastic volatility models are better suited to describe empirical volatility. However, due to its complexity parameter estimation remains difficult. Not only the model parameters are estimated, the latent factor must be defined as well. Furthermore, Gourieroux (1997) mentions that due to the unobserved volatility, the only possible method to maximize the log-likelihood function is by marginalizing the entire function. Doing so reveals a time consuming procedure if one considers backtesting. 3. Nonlinear combinations of the univariate GARCH model While the previous subclasses have their complications, the same can be said about the different types of conditional correlation models. Firstly, the assumption that conditional correlations are constant over the entire time period T may seem unrealistic. Empirical findings seem to lack evidence to support the CCC model. Both Tse and Tsui (2002) and Engle (2002) propose two different time varying conditional correlation matrices. Assuming the notation of Bauwens et al. (2006), Tse and Tsui define the conditional correlation matrix as a GARCH specification: R t = (1 θ 1 θ 2 ) R + θ 1 Ψ t 1 + θ 2 R t 1 with Ψ t 1 as the correlation matrix of the standardized residuals while Engle assumes a symmetric transformation matrix Q t and a GARCH specification. ( ) ( ) R t = diag q ,t... 1 q 2 kk,t Q t diag q ,t... 1 q 2 kk,t Q t = (1 α β) Q + αɛ t 1 ɛ t 1 + βq t 1 with ɛ t 1 as the standardized residuals 1. The DCC model of Engle is estimated consistently if the number of observations is large enough. According to Bauwens et al. (2006) the more assets contained in the dataset, the more natural imposed restrictions there are on the dynamics. However, the correlation dynamics assumes constant coefficients 1 standardized by the univariate GARCH volatility dynamics 6

12 2.3 DCC-MGARCH in depth α and β for the entire correlation dynamics. Since the possibility exists that the correlation relationship changes over time, Engle and Sheppard (2002) introduce a more flexible correlation dynamics that estimates at each time i [1... T ] the coefficients α and β. This implies time dependent coefficients α i and β i. This procedure does increase the number of unknown parameters considerably. Comparing the different models and their possible shortcomings, it has been decided that this thesis implements the simplified DCC model of Engle with scalars α and β. It offers the flexibility of estimating k > 2 univariate GARCH models while the simplicity remains. Also, it reduces computing time because at some point the EM algorithm needs to calibrate the multivariate generalized hyperbolic distribution. This process is known to be quite slow. Introducing factor models would drastically increase the computational time due to marginalizing the log-likelihood function at every iteration of the EM cycle. Which approach, or which subclass, is best suited for risk management in combination with the multivariate generalized hyperbolic distribution is beyond the scope of this thesis, but it is an interesting topic for future research. 2.3 DCC-MGARCH in depth As aforementioned, this research considers the DCC model specification stipulated by Engle (1999) and Engle and Sheppard (2002). The precise model definition is described in depth in this section. The proposed multivariate GARCH model, DCC (1, 1) M GARCH (1, 1), assumes log return series from k assets that are conditionally multivariate normal distributed with constant mean µ and covariance matrix H t. The information set I t 1 consists of all known information until time t 1. r t I t 1 N (µ, H t ) (2.1) and H t = D t R t D t (2.2) 7

13 2.3 DCC-MGARCH in depth where D t is a k k diagonal matrix of time varying standard deviations hi,t from k univariate GARCH models: D t = h1,t h2,t hk,t Each diagonal element of matrix D t is given by the GARCH(1,1) specification h i,t = ω i + α (r i,t 1 µ i ) 2 + βh i,t 1 (2.3) for i = 1... k with sufficient restrictions on parameters α + β < 1 and non negative variances. The lag length for each univariate GARCH specification is preselected as GARCH(1,1). Laplante et al. (2008), among others, concluded the lack of evidence supporting a higher order lag length if dealing with financial stock data. Especially if the DCC parameter estimates and residuals are used for estimating risk of a certain portfolio. The k k dynamic correlation matrix, R t, with ones on the i th diagonal is given by: R t = diag {Q t } 1 Q t diag {Q t } 1 (2.4) Let Q t be the transformation matrix of the dynamic correlation matrix, such that Q t = (1 α β) Q + αɛ t 1 ɛ t 1 + βq t 1 (2.5) and let ɛ t be the standardized residuals given by ɛ t = D 1 t (r t µ) (2.6) The unconditional covariance matrix Q is a k k matrix estimated from the standardized residuals ɛ t and let diag (Q t ) be specified as a k k diagonal matrix of square root elements of Q t q q Q t = qkk Estimating the DCC model could potentially be a fairly complicated and dreadful time consuming procedure. However, a design feature of the DCC 8

14 2.3 DCC-MGARCH in depth model makes it possible to estimate it as a two step optimization problem using the (quasi) maximum log-likelihood method. Using the notation of Engle, the parameter θ is a vector that corresponds to all coefficients of the univariate volatility dynamics 2.3, implying θ = (ω 1..ω k, α 1..α k, β 1..β k ). The parameter φ corresponds to the coefficients of the correlation dynamics 2.5. Firstly, the optimization technique maximizes the volatility dynamics denoted as L v (θ r t ). Secondly, using the estimated GARCH parameters and standardized residual series ɛ t for each of the k return series, the correlation dynamics, denotes as L c (φ, θ), is optimized. The exact theoretical notation is denoted below, while the actual application is carefully written down after the theoretical representation. Hence the first optimization step: ˆθ = argmax θ {L v (θ r t )} (2.7) As indicated by Engle (1999), as long as the first step parameter estimates are consistent, the second step parameter estimates are consistent as well assuming reasonable regularity conditions and a continuous function in the neighborhood of the true parameter values. Hence the second step: { max φ L c (φ, ˆθ )} (2.8) It results in consistent but inefficient parameter estimates. A proof is given by Newey and McFadden (1994) using the GMM procedure. To describe the log-likelihood function, the conditional distribution function is needed. Fortunately, normality is assumed and with some straightforward matrix algebra an expression is found. If the innovations are non- Gaussian distributed, the procedure is still valid by QML estimation. Hence: = T t=1 f(r I 0 ) = T f (r t I t 1 ) t=1 ( ) k ( 1 ( H t ) 1 2 exp 1 ) 2π 2 (r t µ) Ht 1 (r t µ) (2.9) Using the conditional distribution function and straightforward matrix algebra, the full log-likelihood function is characterized by the following expression. L (θ, φ) = 1 2 t=1 k log (2π) + log ( H t ) + (r t µ) H 1 t (r t µ) 9

15 2.3 DCC-MGARCH in depth L (θ, φ) = 1 2 t=1 k log (2π) + 2 log ( D t ) + (r t µ) Dt 2 (r t µ) ɛ tɛ t + log ( R t ) + ɛ tr 1 t ɛ t (2.10) It is noticeable that the resulting full log-likelihood equation 2.10 could be decomposed into two parts of which one is recognizable as the log-likelihood formulation for k univariate GARCH models L (θ, φ) = L v (θ r t ) + L c (φ, θ) (2.11) and let L v (θ r t ) be the log-likelihood function for the k univariate GARCH models which are independent and separately optimized. L v (θ r t ) = 1 2 t=1 k log (2π) + 2 log ( D t ) + (r t µ) D 2 t (r t µ) = 1 2 ( { k }) k log (2π) + log (h it ) + (r i,t µ i ) 2 t=1 h i,t (2.12) Once the volatility component is estimated, the correlation component (second stage) is estimated using the full log-likelihood equation 2.10, conditioned on the estimated parameters of L v (θ r t ). Since constants do not affect the maximization of the log-likelihood function, it is easier (and for algorithm purposes quicker) to exclude these constants and optimize: L c (φ, ˆθ) = 1 2 t=1 log ( R t ) + ɛ tr 1 t ɛ t (2.13) When the second step is optimized, the estimated DCC parameters α and β and standardized residuals are then used to calibrate the multivariate generalized hyperbolic distribution. 10

16 3 Multivariate generalized hyperbolic distribution Choosing the appropriate density function jointly with an underlying multivariate model is in risk management still one of the research focus points. This chapter presents the theoretical framework to calibrate the multivariate generalized hyperbolic distribution. Section 3.1 discusses the observed density of returns. Section 3.2 presents the parameterization. Section 3.3 briefly expresses the first and second moments and section 3.4 presents different subclasses. 3.1 Why we should use a (complex) non-normal distribution Empirical evidence seems to indicate that the hypothesis of normally distributed financial returns, univariate or multivariate, is rejected in favor of non-normality. Pagan (1996) recommended that actual financial returns appear (a) to have semi-heavy tails for the empirical distribution; (b) to be dynamic or time varying; (c) to show different sizes of clustering over time. 11

17 3.1 Why we should use a (complex) non-normal distribution The more complicated and flexible the density function is, the more control one has over the heavy tail, asymmetry and peakedness but tractability issues arises and the estimation time frame rapidly increases. The question, mentioned by Pagan, is whether a highly complicated density function is sufficiently better suited than a simpler one. The models are having to be made increasingly complex so as to capture the nature of the conditional density of returns Ultimately one must pay more attention to whether simple economic models are capable of generating the complex behavior that is evident Pagan (1996) In 1977 a new density class is introduced by Barndorff-Nielsen (1977). It is called the univariate generalized hyperbolic distribution; defined as the logarithm of a continuous density function. What they actually achieved is to create a density function that enables independent modifications of the tail behavior, asymmetry and peakedness. While the original paper concentrates its research to model mass-size distributions of aeolian sand deposits, the independent calibration of the third and fourth moments showed potential to model financial returns. After the publication of the paper by Barndorff- Nielsen, multiple successful financial applications have been developed using the univariate generalized hyperbolic distribution. For instance by Blaesild (1981), Eberlein and Keller (1995) and in the dissertation paper of Prause (1999). They all found empirical evidence indicating that indeed this distribution is better suited to handle the observed heavy tail and asymmetry behavior. However, in general they also remarked two simple flaws. Due to the high flexibility many parameters are needed to be estimated and this estimation process could become time consuming. While the multivariate case is fully described by Barndorff-Nielsen (1977), the first computerized algorithm was essentially developed in 1992 by Blaesid and Sorensen (1992). Until 1992 researchers struggled with the multivariate case due to the increased complexity, instability, time consuming computing time and insufficient mathematical packages. The algorithm of Blaesid and Sorensen in 1992 could handle only two or just three assets simultaneously. 12

18 3.2 Parameterization of the multivariate generalized hyperbolic distribution The second attempt by Prause (1999) made it possible to use more than two or three assets simultaneously, but if and only if the resulting density function is symmetric. The last and final attempt originates to Protassov (2004) and McNeil et al. (2005). Both papers exploited the Normal-Mean- Variance-Mixture theorem, defined as definition 1. It enables the modeling of multiple assets simultaneously for symmetrical as asymmetrical density functions and has the capability to estimate all function arguments. Although it seems the best method, some care must be taken to overcome singularity problems of the dispersion matrix. 3.2 Parameterization of the multivariate generalized hyperbolic distribution This section describes in depth the parameterization of the multivariate generalized hyperbolic distribution defined by Protassov (2004) and McNeil et al. (2005). The model of McNeil et al. is favored, compared to the other MGHyp algorithms, because it does not rely on fixed preselected calibration parameters nor the symmetry constraint. The derivation of the MGHyp density depends on the Normal-Mean-Variance-Mixture and the mixture weight W. Definition 1. Normal-Mean-Variance-Mixture The k-dimensional random variable X is said to follow the Normal-Mean- Variance-Mixture if X d = µ + W γ + W AZ (3.1) where 1. Z N k (0,I k ), the multivariate Normal distribution of dimension k; 2. W 0 is a positive, scalar-valued random variable, drawn randomly from a defined density function, independent of Z; 3. A R d k with AA = Σ of dimension k k; 4. µ and γ are parameter vectors in R k. Definition 2. Mixing weight W The mixing weight W is specified as the scalar value drawn from a predefined distribution or calibrated (by EM) density function f ( ) 13

19 3.2 Parameterization of the multivariate generalized hyperbolic distribution The complexity and flexibility of the MGHyp density depends mainly on the mixing weight. As will be discussed during the calibration of the MGHyp in section 4.3, the mixing weight distribution usually has missing observation values. These need to be addressed by a missing value estimation process. The used process, called EM, is carefully explained in section 4.1. Using this approach, its possible to utilize the flexibility of the mixture weight. As stated by Paolella (2007), there are currently ten mixing weights that could be reached. Each of these mixing weights results a different distribution and none of them, except one, results the proper multivariate generalized hyperbolic distribution. Definition 3. Proper MGHyp The proper MGHyp is defined as definition 1 in which the mixing variables are drawn independently from a Generalized Inverse Gaussian distribution GIG (λ, χ, ψ). The joint density function f(x; λ, χ, ψ, µ, Σ, γ) representing the MGHyp probability distribution function is a closed form expression and continuous if and only if Σ is a non-singular matrix with rank k 1. f (x) = exp { (x µ) Σ 1 γ } ψ λ ( χψ ) λ (2π) k 2 Σ 1 2 K λ ( χψ ) where χ and ψ are defined by ψ k 2 λ χ = (x µ) Σ 1 (x µ) + χ ψ = γ Σ 1 γ + ψ ( χ ψ ) k 2 λ K λ k 2 ( χ ψ ) All six function arguments (λ, χ, ψ, µ, Σ, γ) are in general unknown and must be estimated by means of an algorithm and each define a specific part or shape of the density function. As is shown in table 3.1, the MGHyp consists of multiple calibration arguments which makes this distribution very flexible. It could model the heavy tail independently from the asymmetry and peakedness. However, the more unknown parameters needed to be estimated, the less tractable it becomes. The high flexibility of the MGHyp also gives the possibility to switch to different distributions, or subclasses, simply by imposing certain constraints on the calibration parameters (λ, χ, ψ, µ, Σ, γ). Some subclasses are reached using a specific mixture 1 The complete derivation is given in appendix B 14

20 3.3 Unconditional expectation and covariance of the MGhyp Parameter Range Representation λ R Shape parameter of the density function χ 0 Peakedness parameter ψ 0 Difference between the statistical skewness and kurtosis estimates. µ R Location vector Σ non-singular, Dispersion matrix symmetric γ R Skewness vector Table 3.1: Calibration parameters and ranges of the proper MGHyp density function variable, for instance the Hyperbolic and Normal Inverse Gaussian distribution and some subclasses could only be reached if one would take a limiting distribution, like the Asymptotic Laplace or the Normal distribution. The complete derivations of all reachable subclasses are stated and well defined by Paolella (2007). A different but useful advantage using the MGHyp is given by De Finetti (1929). De Finetti introduced that if the used distribution is in fact an infinitely divisible distribution, it gives a relation or building block for the Lévy processes. It implies independent increments and stationarity in the continuous time frame. 3.3 Unconditional expectation and covariance of the MGhyp To conclude the parameterization of the multivariate generalized hyperbolic distribution, expressions for the unconditional mean and covariance matrix are defined. First define the conditional mean and covariance matrix. Secondly, using some statistical reordering an expression for the unconditional mean and covariance matrix is found. Let W be known and drawn from a specified density function and let Σ = AA be a k k dispersion matrix such that the conditional distribution is defined as: X W N k (µ + W γ, W Σ) (3.2) 15

21 3.4 Special cases Next, the unconditional expectation of the MGHyp is defined; using the conditional expectation. E (X) = E [E (X W )] = µ + E (W ) γ (3.3) The unconditional covariance matrix is denoted by cov (X) = E [cov (X W )] + cov [E (X W )] = E [W Σ] + cov [µ + W γ] = E (W ) Σ + var (µ) + var (W γ) + 2cov (µ, W γ) = E (W ) Σ + var (W ) γγ (3.4) In general, it is false to believe that µ as well as Σ are the mean and covariance matrix of X. This is only the case if and only if the MGHyp probability distribution function is symmetric. This is only the case if γ = 0. Setting the optimization function for Σ, as denoted by 4.16, equal to zero proves this assumption. 3.4 Special cases This section briefly mentions the broadly known subclasses that could be reached using the GIG distribution while imposing some constraints. (a) k-dimensional hyperbolic distribution Set λ = 1 2 (k + 1) and let k denote the number of assets. As Eberlein and Keller (1995) indicate, if and only if the number of assets equals one, only in this case the hyperbolic distribution is an univariate hyperbolic distribution. (b) One-dimensional hyperbolic distribution Setting λ = 1 reveals a multivariate distribution whose univariate margins are one-dimensional hyperbolic distributions. (c) Normal Inverse Gaussian distribution Setting λ = 0.5 reveals the NIG distribution. 16

22 3.4 Special cases (d) Variance Gamma distribution Let λ > 0 and χ = 0 such that one obtains the limiting case distribution to zero for χ. This is better known as generalized Laplace or Variance Gamma. (e) Skew student-t distribution Let λ = 0.5υ, χ = υ and ψ = 0 such that the MGHyp distribution translates to the asymmetric or skewed-t distribution. Implementing this type of distribution requires knowledge of υ, but in empirical studies the true value is unknown. To overcome this problem, the estimation process needs be changed slightly. Instead of fixing λ as a preselected constant 1, it is wise to take the limiting case distribution to zero for ψ and leave λ as unknown. Although this case is not yet commonly used, Protassov (2004) foresees promising results as a risk management candidate due to the few parameters needed to be estimated and the fastest observed calibration speed. 1 The reason why we normally fix λ is explained in paragraph

23 4 Calibrating the multivariate generalized hyperbolic distribution The estimation of the missing values, due to the mixing weight, and the calibration of the multivariate generalized hyperbolic distribution is based on the Expectation-Maximization (EM) algorithm of Dempster et al. (1977). This chapter discusses the EM algorithm in depth and shows how to apply the theorem. Section 4.1 discusses the theory behind the Expectation- Maximization algorithm. Section 4.2 states important assumptions. The calibration of the E-step is explained in 4.3 while the M-step is explained in section 4.4. Section 4.5 concludes with optimization problems. 4.1 Expectation - Maximization algorithm As discussed in section 3.2, the mixing weight that is distributed by the Generalized Inverse Gaussian distribution has unknown or missing observations. Unless a feasible estimation process is implemented for estimating the missing values, the MGHyp is difficult to calibrate. One of the methods to estimate the missing values and calibrate the MGHyp is the EM algorithm. The earliest reference in spirit of the EM algorithm originates to 1886 where Newcomb (1886) suggested an iterative reweighting scheme to estimate two univariate normal distributions simultaneously. Since that mo- 18

24 4.1 Expectation - Maximization algorithm ment significant progress has been made, especially after the Second World War with its infusion for technology, computers and relative fast algorithms. A great reference for a detailed historic profile is given by McLachlan and Peel (2000) and McLachlan and Krishnan (2008). Although multiple theorems, propositions and algorithms were developed during the fifties, sixties and seventies to tackle the missing values problem of incomplete observation data, most propositions relied on different assumptions such that it became difficult to actually understand how to implement the theorem and algorithm and what result to expect. It is the seminal paper of Dempster et al. (1977) that captures and generalizes the ideas of earlier written papers such that the theoretical part is understandable and relatively easy to implement. They called it: the Expectation- Maximization algorithm. The basic idea of the Expectation-Maximization algorithm, henceforth EM, is computing log-likelihood estimates by an iterative procedure if some of the observations are unavailable or incomplete. A complementary feature of this algorithm is that it could actually simplify the log-likelihood estimation if this particular function is parameterized by a complex parameterization. The exact procedure is explained in detail in the next section Generalized Expectation-Maximization framework The aim of the EM algorithm is to maximize the conditional expectation of the full model log-likelihood function such that if the dataset is incomplete, consistent parameters could be estimated. Each iteration of the EM algorithm consists of two steps, called the Expectation step and the Maximization step. Definition 4. Characterization Generalized EM algorithm (Dempster et al., 1977) Let the observed data be given by y and let the corresponding x be not observed directly but only indirectly through y. Define a family of sampling densities f(x φ) depending on parameters φ, suppose that φ (p) denotes the current value after p cycles of the algorithm and assume an initial guess for φ (0). The next cycle can be described in two steps. 19

25 4.1 Expectation - Maximization algorithm E-step: Estimate the complete-data sufficient statistic f(x) by evaluating ( Q φ φ (p)) ( [ ( = E log f x φ (p))] ) y, φ (4.1) M-step: Choose φ (p+1) that maximizes Q ( φ φ (p)). The algorithm continues until ( φ (p+1) = argmax φ Q φ φ (p)) (4.2) L(φ (p+1) ) L(φ (p) ) < α (4.3) where the arbitrary value α represents the minimal smallest difference and L represents the log-likelihood function. The EM algorithm has a favorable property, namely the guarantee of a positive log-likelihood increment after each iterative cycle. Not only ensures this feature easy convergence monitoring, it also limits possible undesirable outcomes. Beale 1 notes that at least one limit point exists if and only if each of the parameters in the set φ are bounded. This limit must either be the maximum or the stationary value. Though, it is known to be a maximum, it is still unclear whether the found limit is a global or local maximum. The actual proof for the positive log-likelihood increment is quite straightforward and carefully explained by Dempster et al. (1977). It entails recognizing that the log-likelihood of the E-step with estimated latent variables is always larger or equal to the E-step with full parameterization. The positive log-likelihood increment follows directly afterwards by simple reordering and using Jensen s inequality Appealing and problematic EM properties The EM algorithm has several appealing properties. These properties imply (a) numerically stability; S.J. Haberman (1977) 2. (b) a global convergence point under general assumptions. (c) complicated and time consuming inverse matrix calculations are unnecessary. 1 one of the authors of the discussion paper about Dempster et al. (1977) 2 one of the authors of the discussion paper about Dempster et al. (1977) 20

26 4.2 Calibration assumption (d) a positive log-likelihood increment after each cycle of the iteration process which ensures easy convergence monitoring. The algorithm does come with its flaws and problems. The EM algorithm (a) is in general a slow calibration algorithm. A dataset with a fair number of observations and a few assets could easily take 5 to 10 minutes to calibrate 1. (b) does not calculate the standard errors of the estimated parameters. (c) cannot ensure convergence if the log-likelihood specification has multiple global or local minima. (d) cannot ensure that the found optimum is indeed the global maximum. A simple solution to overcome this problem is to redo the estimation with different starting values φ (0). (e) could potentially crash or fail in finding a optimum solution if the starting values are incorrectly chosen. Boyle (1983) presents an example of a generalized EM sequence that converges to the unit circle and not a single point. (f) is not always applicable if the log-likelihood function is not a continuous function. 4.2 Calibration assumption For the continuation of this thesis, the parameter λ is presumed constant and preselected before the remaining MGHyp parameters are estimated. Firstly, this research focuses on which hyperbolic subclass of the MGHyp distribution is statistically better suited for the Value at Risk calculation. Whether an estimated λ is more suitable is a research topic beyond the scope of this thesis. Secondly, introducing λ as an unknown parameter results in a 1 dependent on the used computer hardware and software 21

27 4.3 Defining the E-step very slow estimation procedure. Each M-Step needs to numerically optimize the derivative of the Bessel function over the index λ. To keep this EM framework comparable with the previous versions, some parameter notations made by Protassov (2004), McNeil et al. (2005), Liu and Rubin (1995) and Hu (2005) are also implemented in this thesis. 4.3 Defining the E-step Let (p) R, strictly non-negative, denote the current EM cycle and let Θ (p) denote the collection of parameters at cycle (p) such that the E-step is defined as ( Q Θ Θ (p)) [ = E log f (x complete Θ) x observed, Θ (p)] (4.4) Unfortunately, the complete data specification depends not only on the observations x, but also on the missing variables w. Estimating the joint density f (x, w) is therefore quite difficult in its present form but if somehow it is known that f (w Θ) has been realized, this knowledge can provide information whether f (x w; Θ) has also been realized. Assume f (w Θ) > 0 such that f (x complete Θ) = f (x, w Θ) = f (x w; Θ) f (w Θ) = T f (x i w i ; Θ) f (w i Θ) Let x i be a vector of dimension k containing standardized residuals of the DCC(1,1)-MGARCH(1,1) model of k assets at some time i, where i [1... T ] and assume that all observation vectors x i are captured in a 1 kt vector (x 1... x i... x T ). Let the latent variables w = (w 1,..., w i,..., w T ) be denoted by a Generalized Inverse Gaussian distribution N (λ, χ, ψ) given by Barndorff-Nielsen (1977) in which the parameters χ and ψ are unknown. Accordingly, the density of w i, f (w i, λ, χ, ψ), is given as ( f (w i ; λ, χ, ψ) = χ λ χψ ) λ ( ( ) w λ 1 2K λ χψ i e 1 χ 2 w i +ψw i ) (4.5) 22

28 4.3 Defining the E-step with K λ ( χψ ) as the modified Bessel function 1 of the third kind with index λ. ( ) K λ χψ = t λ 1 e 1 2 χψ(t+t 1 ) dt (4.6) The next step entails finding an expression for the conditional density f (x i w i, Θ). The derivation of this function is quite straightforward if one begins with the assumption of a Normal Mean-Variance Mixture while imposing definition 1. Furthermore, let µ and γ be vectors of dimension k and let Σ be a symmetric k k matrix. The exact derivation is given in appendix A while it s final result is presented below. f (x i w i, µ, Σ, γ) = 1 (2π) k 2 Σ 1 2 w k 2 i e (xi µ) Σ 1γ e (x i µ) Σ 1 (x i µ) 2w i e w i 2 γ Σ 1 γ (4.7) At this point, both probability distribution functions are fully parameterized such that the complete parameter space Θ is now defined by the arguments λ, µ, Σ, χ, ψ and γ. This also implies that the quasi log-likelihood function of 4.4 is fully parameterized as L (Θ, x, w) = log f (x i w i, µ, Σ, γ) } {{ } L 1(x i,w i,µ,σ,γ) + log f (w i ; λ, χ, ψ) } {{ } L 2(w i,λ,χ,ψ) (4.8) It follows that the full quasi log-likelihood function is actually separately optimized by L 1 ( ) and L 2 ( ). Neither of the separate log-likelihood functions depends on unknown function arguments which show up in both functions simultaneously. Simply by replacing the density functions 4.5 and 4.7 in the quasi log-likelihood function 4.8 and substituting this expression in the E-step 4.4 denotes Q ( Θ Θ (p))2. ( Q Θ Θ (p)) = Q 1 (x i, µ, Σ, γ) + Q 2 (λ, χ, ψ) (4.9) 1 A great reference concerning the Bessel family, its applications and derivations is given by the Internet site Digital Library of Mathematical functions, Olver and Maximon (2010) 2 The arguments for the individual functions Q 1 and Q 2 are not mentioned due to styling issues, but are defined by

29 4.3 Defining the E-step Q 1 ( ) = T 2 log Σ k γ Σ 1 γ Q 2 ( ) = (λ 1) ψ 2 [ E log w i x i, Θ (p)] + (x i µ) Σ 1 γ [ E w 1 i x i, Θ (p)] (x i µ) Σ 1 (x i µ) [ E w i x i, Θ (p)] T k log (2π) 2 [ E log w i x i, Θ (p)] χ 2 [ E w i x i, Θ (p)] λt 2 log χ + λt 2 log ψ T log [ 2K λ ( χψ )] [ E w 1 i x i, Θ (p)] It can be shown that all three conditional expectations following from Q 1 ( ) and Q 2 ( ) are actually defined by the first moment of the Generalized Inverse Gaussian distribution. Basically, the proof given in appendix C utilizes the Bayes rule on the conditional density f (w i x i, Θ) following from the integrand of the continuous conditional expectation. E [w i x i, Θ] = + w i f (w i x i, Θ) dw i = + f (x i w i, Θ) f (w i, Θ) w i dw i f (x i, Θ) Furthermore, let f (x i, Θ) be the MGHyp probability distribution function as proven in appendix B. Let χ (p) = χ (p) + (x i µ (p)) ( Σ 1 ) ( (p) x i µ (p)) ψ (p) = ψ (p) + (γ ) (p) Σ 1 γ (p) be given such that 1 the conditional expectations are denoted by δ (p) i η (p) i ( [ = E w 1 i x i, Θ (p)] = ( [ = E w i x i, Θ (p)] = χ (p) ψ (p) χ (p) ψ (p) ) 1 2 K λ k K λ k 2 ) 1 2 K λ k K λ k 2 ( ) 2 1 χ ψ ( ) (4.10) χ ψ ( ) 2 +1 χ ψ ( ) (4.11) χ ψ 1 using the standard notations proposed by Protassov (2004) and McNeil et al. (2005) 24

30 4.4 Defining the M-step ξ (p) i [ = E log w i x i, Θ (p)] = 0.5 log ( ) χ (p) + ψ (p) K λ k 2 +α( χ ψ ) α α=0 ( ) K λ k χ ψ 2 (4.12) The above derivative of the modified Bessel function is taken over the index instead of over the function arguments. In order to evaluate the first order derivative ξ (p), one has to approximate its value by a numerical method. At this point all conditional expectations are defined as closed form expressions for each cycle (p). This implies that the maximization function, defined by the quasi log-likelihood function 4.9, is able to estimate the unknown parameters denoted by Θ (p). 4.4 Defining the M-step The updated parameters Θ (p+1) are then found by replacing the conditional expectations 4.10, 4.11 and 4.12 in 4.9 such that Q ( Θ Θ (p)) is separately maximized by Q 1 ( xi, µ (p), Σ (p), γ (p)) and Q 2 ( χ (p), ψ (p), λ ). ( Θ (p+1) = argmax Θ Q Θ Θ (p)) (4.13) Optimizing Q 1 ( ) is done by the usual approach. Set the derivative of Q 1 ( ) with respect to µ (p), γ (p) and Σ (p) equal to zero and simply solve the system of unknowns. The proof of this derivation is given in appendix D while the final results are presented below. γ (p+1) = µ (p+1) = 1 T T δ(p) i (x x i ) (4.14) δ (p) η (p) 1 T Σ (p+1) = 1 T with δ (p) = 1 T T x iδ (p) i γ (p+1) (4.15) δ (p) ( x i µ (p+1)) ( x i µ (p+1)) + η (p) γ (p+1) (γ ) (p+1) δ (p) i (4.16) T δ(p) i, η (p) = 1 T T η(p) i and ξ (p) = 1 T T ξ(p) i 25

31 4.5 Serious EM optimization problems Maximizing the quasi log-likelihood function Q 2 ( ) with respect to χ and ψ is performed by a numerical maximization method. Although as it is an easy step to implement, a precaution is necessary. If one optimizes only Q 2 ( ) instead of the complete log-likelihood, it is required to recalculate the weights δ (p) i, η (p) i and ξ (p) i using the updated parameters µ, Σ and γ of the current iteration (p) as indicated by McNeil et al. (2005). This results the so-called multi-cycle expectation conditional maximization (MCECM). max χ,ψ Q 2 (λ, χ, ψ) = max χ,ψ (λ 1) ξ i χ 2 δ i ψ 2 λt 2 log (χ) + λt 2 log (ψ) T log [ 2K λ ( χψ )] For each iteration cycle (p) it is now possible to evaluate µ (p+1), γ (p+1), Σ (p+1), χ (p+1) and ψ (p+1) until the quasi log-likelihood difference between the two cycles is small enough. These five parameters are then used as calibration arguments for the multivariate generalized hyperbolic distribution MGHyp (λ, χ, ψ, µ, Σ, γ). η i 4.5 Serious EM optimization problems Two serious problems arise if the EM algorithm is implemented for multiple financial assets. Firstly, an identification problem comes to light for Σ due to near singularity. Luckily, there are several solutions to overcome this problem, but not every solution is appropriate. This thesis uses the proposal of McNeil et al. (2005) for reasons described below. 1. Preselecting the determinant of Σ as 1 This approach not only reduces computing time of the EM algorithm, but Hu (2005) denotes that it also reduces the instability of the algorithm. This approach is feasible since Σ is estimated by standardized residuals. Using a large sample the determinant should statistically equal to one. However, this cannot always be assured for a small sample size. Todays software packages and computers are considerably faster than five years ago, thus implementing a strategy that actually estimates Σ is preferred above preselecting the determinant of Σ. 26

32 4.5 Serious EM optimization problems 2. Scaling the Σ matrix This procedure has been proposed by McNeil et al. (2005). To address the problem efficiently, scale the matrix Σ using the determinant of the sample covariance matrix. Σ (p+1) = cov (x) 1 k Σ (p+1) 1 k Σ (p+1) (4.17) The advantage given by this method results a dispersion matrix Σ with the same desired properties described by preselecting the determinant of the dispersion matrix while it does not seriously increases computing time. The downside is firstly that it is fairly easy to forget this step in the actual algorithm program. Secondly, Hu (2005) indicates that if λ is quite large (roughly abs(λ) 100) the algorithm could become unstable and result unreliable results. 3. Preselecting only χ Protassov (2004) simply preselected χ as 1 and it remained constant during the EM algorithm. It simplifies the algorithm, it reduces computing time and possible errors. However, Hu found some stability complications if λ is small (roughly λ [ 1, 1]) and is it not clear why Protassov choose χ to be 1. Secondly, a different optimization problem could occur if one starts with the proposal of Protassov (2004) and Hu (2005) to maximize Q 2 ( ) by closed form expressions, comparable to the maximization process of Q 1 ( ). Both derivatives with respect to χ and ψ still contain the modified Bessel function which depends on both parameters χ and ψ simultaneously. Therefore, it is impossible to find two closed form expressions for updating both parameters. A simple workaround resolves this problem by introducing, after taken both derivatives, the notation ϑ = χψ. It requires some algebraic manipulation by substituting the maximization function of ψ in the maximization function of χ. Eventually one first order condition remains depending only on the parameter ϑ. Appendix E proofs the derivation while its final result is presented below. δ (p) η (p) Kλ 2 (ϑ) ϑ + 2λK λ+1 (ϑ) K λ (ϑ) ϑkλ 2 (ϑ) = 0 (4.18) 27

33 4.5 Serious EM optimization problems Solving 4.18 over ϑ is not that easy as it seems. It is discovered during numerous simulation runs during this research, that numerically solving 4.18 results an unstable program with an imaginary valued ϑ indicating imaginary calibration parameters for the mixing weight distribution. A simple workaround would be to take only the real valued part of the imaginary ϑ. Doing so results that both calibration parameters χ and ψ are scalars in the order of 700 to It sets the Bessel function equal to zero without actually solving ϑ and therefore the parameter estimates are incorrectly estimated. Evidently, if one constrains one or both calibration parameters, the methods works better but it does not resolve the instability nor the problem how to preselect the variables. 28

34 5 Risk assessment Ultimately, a trader wants to know his risk exposure by carrying a portfolio. This chapter focuses on how to forecast potential loss for the next trading day and the accuracy of these estimates. Section 5.1 formulates the risk measure. Section 5.2 describes in depth how to transform the MGHyp density function to the univariate case. Section 5.3 focuses on the accuracy of the forecast portfolio returns. 5.1 One day ahead risk forecasting Since VaR is a statistical approach that depends on multiple factors like the rolling window, possible usage of portfolio weights and coverage level, no optimal value exists that truly represents the forecast potential loss. Therefore it is important to elaborate the fixed parameter choices. 1. The forecast time horizon is set at one day. Typically, traded assets are considered to be liquid such that financial institutions are not enforced to hold a loss making position. 2. This thesis uses two nominal coverage levels, the 95% that banks and investment firms usually use and the 99% indicated by the Basel Committee. 3. The rolling window to train the DCC-MGARCH as the MGhyp density function is set by the previous 500 observations. A rolling window 29

35 5.1 One day ahead risk forecasting using the previous 1000 observations has been tested, but no significant differences compared with the rolling window of 500 were noticeable. Using a rolling window of 1000 only doubled the calibration time. 4. Conditional VaR (CVaR) is selected as the risk forecaster. Little empirical evidence justify the usage of VaR for financial-economical research as denoted in the paper of Leippold (2004) while CVaR is a coherent risk measure. 5. The portfolio weights, defined as equally weighted, remain constant over the entire holding period of the portfolio. The one day ahead forecast portfolio return is estimated by the underlying DCC-MGARCH portfolio model with the residuals following the MGHyp distribution, calibrated by the previous 500 observations. Let x be the asset weights, sum up to one and let H t+1 be the forecast DCC- MGARCH covariance matrix such that the forecast portfolio return is denoted by 1 x r t+1 = x µ + x H 1 2 t+1 ɛ (5.1) The forecast one day ahead potential loss of the weighted portfolio is estimated by the CVaR approach defined in the papers of Artzner et al. (1997) and Artzner et al. (1999). Besides the assumption that CVaR is a coherent risk measure, it is likely that CVaR is able to estimate the risk adequately using the given nominal coverage level 1 β 0 with β 0 as the nominal significance level and with the assumption that the underlying model is correctly specified. Utilizing the translation invariance property and following 1 with ɛ as the MGHyp distribution 30

36 5.2 Distribution of the univariate portfolio return the paper of Hellmich and Kassberger (2009), the CVaR is denoted as: ( ) CV ar β0 t+1 t ( x r t+1 ) = CV ar β0 t+1 t x H 1 2 t+1 ɛ x µ [ ] = E x H 1 2 t+1 ɛ x H 1 2 t+1 ɛ V ar 1 β0 (x r t+1 ) x µ = 1 1 β V ar 0 t+1 t y f GHyp (y; Θ) dy x µ 1 β 0 (5.2) The estimation of CVaR (5.2) depends on the univariate density function ) f GHyp (y; λ, χ, ψ, x H 1 2 t+1 µ, x H 1 2 t+1 ΣH 1 2 t+1 x, x H 1 2 t+1 γ. Although the procedure to transform the MGHyp to the univariate case is quite straightforward if one uses the linear transformation property, it is explained in depth in section 5.2. forecast is estimated as Since the translation property also holds for VaR, the VaR ( ) V ar 1 β0 t+1 t (x r t+1 ) = V ar 1 β0 t+1 t x H 1 2 t+1 ɛ + x µ = F 1 GHyp (1 β 0) + x µ (5.3) where the quantile function F 1 GHyp is evaluated by standard numerical root finding methods. 5.2 Distribution of the univariate portfolio return The aim of this section is to develop a density function with arguments x H 1 2 t+1 ɛ. This demonstration follows the paper of Hellmich and Kassberger (2009) 1. Basically, it is a simple matter of implementing some familiar properties of the linear transformation theorem. Consider a multivariate linear function BX + b and assume that the intercept vector b = 0 and that matrix B is actually a weighting vector x R k such that the sum of the weights 1 Hellmich and Kassberger used this approach for a simplified portfolio model r t = a t with a t as the MGHyp distribution. 31

37 5.3 Performance of risk forecasting equals one. Let X be the MGhyp, stated as definition 3, such that the linear function results as x X + 0 = x µ + wx γ + wx AZ f GHyp (λ, χ, ψ, x µ, x Σx, x γ) After the linear transformation two advantages result. Firstly the multivariate GHyp has been transformed to the univariate case. This resolves the complexity to calculate the multivariate cumulative distribution and enhances the algorithm computing speed without loosing its accuracy. Secondly, the transformation does not affect the choice for the mixing variables W, indicating that all ten different subclasses could be reached after the linear transformation. ) At this point the density function f Ghyp (x H 1 2 t+1 X replacing x by x H 1 2 t+1 and realize the similarity of ɛ and X. ( ) x H 1 2 t+1 X = f GH λ, χ, ψ, x H 1 2 t+1 µ, x H 1 2 t+1 ΣH 1 2 t+1 x, x H 1 2 t+1 γ is easily found by (5.4) 5.3 Performance of risk forecasting This subsection concludes by presenting performance tests based on the backtesting framework. This is achieved by first testing the violations for independence and if the actual coverage level is statistically equivalent to the nominal coverage level 1 β 0, both defined by Christoffersen (1998). Secondly, the performance of the different subclasses are ranked based on their Mean-Squared-Error value. To evaluate the performance of the CVaR forecasts, a backtesting analysis is defined that counts the number of CVaR violations. The indicator function equals one if the actual loss of the portfolio at the next day is larger than the forecast portfolio return. The actual violation percentage β and the actual coverage level 1 β are calculated, using the number of found violations, as 1 1 Note that the summation starts after the specified calibration 32

38 5.3 Performance of risk forecasting Ĩ = 1 t=501 { } 1 r t+1 < V ar 1 β0 t+1 t ( x r t+1 ) (5.5) β = Ĩ T (5.6) First, let the coverage test of Christoffersen be defined by the loglikelihood Ratio test. Assume that β 0 is the nominal violation probability such that the LR test statistic is denoted by [ ( ( LR coverage = 2 log 1 β ) T Ĩ ( ) β Ĩ) ] log (1 β 0 ) T Ĩ β0 Ĩ a χ 2 (1) (5.7) If the null is rejected in favor of the alternative, it would suggest that either the CVaR underestimates or overestimates the actual risk. The independence test of Christoffersen (1998) is defined as the loglikelihood Ratio test to test if the occurred violations are i.i.d. Bernoulli distributed 1. Assume that i follows j for i, j = 0, 1 such that π ij = nij j ni,j where n ij counts the number of times Ĩt = j given Ĩt 1 = i. The LR test statistic with one degree of freedom is denoted by LR indepence = 2 [log (π n00 00 πn01 01 πn10 10 πn11 ) log (πn1 1 πn0 0 )] a χ 2 (1) (5.8) Both log-likelihood Ratio tests could also be captured into one LR test, using the same approach as described by Christoffersen (1998), with two degrees of freedom. LR conditional = LR coverage + LR independence a χ 2 (2) (5.9) Ranking the performance of the different subclasses is done by Mean- Squared Error because simply it is not valid to compare the different subclasses based on their p-values following from the above test statistics. The MSE is based only on observations where a violation of the forecast loss occurs. 1 A consequence of not i.i.d distributed violations leads to clustering as it is mentioned by Angelidis and Degiannakis (2007) 33

39 5.3 Performance of risk forecasting with Γ t+1 = MSE = 1 Ĩ 1 { } 1 r t+1 < V ar 1 β0 t+1 t ( x r t+1 ) t=1 Γ t+1 (5.10) { ( r t+1 CV ar 1 β0 t+1 t ( x r t+1 )) 2 if violation occurs 0 else (5.11) The subclass with the lowest MSE score is the preferred subclass to model not only the financial returns, but also to forecast the possible next day portfolio loss. 34

40 6 Application 6.1 Developed GUI This chapter presents the results, based on the empirical study to fit the multivariate generalized hyperbolic distribution by the multi-cycle EM algorithm. Section 6.1 describes the developed program in Matlab. Section 6.2 presents empirical data. Section 6.3 presents the results of this study. Section 6.4 discusses the calibration speed improvement using parallel processing. For the empirical part of this study, a complete matlab program is developed to accommodate the underlying DCC(1,1)-MGARCH(1,1), the calibration of the MGHyp and the numerical estimation of the VaR and conditional VaR. Since EM optimization is relatively slow, the algorithm uses parallel processing; explained in section 6.4. As far as it is currently known, no other freely available program currently exists that uses parallel processing or has the capability to calibrate the MGHyp jointly with the DCC-MGARCH model. The input is monitored through a GUI. It enhances the usability of the algorithm and limits possible input errors. The program gives the option to estimate the DCC-MGARCH model, calibrate the MGHyp for the full sample size or perform the backtesting analysis. If necessary, one could give parameter inputs for the Normal-Mean-Variance-Mixture if desired. Besides the user-friendly interface, it ensures that all subroutines are correctly 35

41 6.2 Empirical Data identified by the parallel processing process. The output GUI presents the coverage, independence and Christoffersen conditional test statistics; it displays the time series analysis and if desired it could present the empirical asset return data. Since the GUI is written as an easy to change program, it could easily be used on a supercomputer environment or with some minor adjustments a completely different distribution could be substituted. The input and output layouts are presented in appendix F. 6.2 Empirical Data The equally weighted portfolio is constructed by the S&P 500 top ten constituents by market cap; Apple Inc (AAPL), Chevron Corp (CVX), General Electric (GE), Intl Business Machines Corp (IBM), JP Morgan Chase & Co (JPM), Microsoft Corp (MSFT), Procter and Gamble (PG), AT&T (T), Wells Fargo & Co (WFC) and Exxon Mobile Corp (XOM). The finite sample T=2,766, for the period 01/01/2000 to 01/01/2011, is formed by taking daily negative log returns of the adjusted daily close price 1. AAPL CVX GE IBM JPM Mean Std Skewness Kurtosis Jarque Bera Ljung-Box Q-test MSFT PG T WFC XOM Mean Std Skewness Kurtosis Jarque Bera Ljung-Box Q-test Table 6.1: Univariate sample statistics. Statistics indicated with are rejected at the 1% significance level. positive mean depicts an average loss. 1 Since negative returns are taken, a 36

42 6.3 Results For each of the selected asset returns, the sample skewness, kurtosis, Jarque-Bera and Ljung-Box Q-test statistic are calculated and presented in table 6.1. It is notable that over the entire time period eight out of ten assets endured an average loss. This is explainable because the data covers the entire liquidity crisis period in which most assets lost a large percentage of their value. It is also notable that all ten asset returns exhibit heavy tail and asymmetric properties. Consequently, the Jarque-Bera test indicates strong evidence to reject Gaussian distributed returns at 1% significance level with critical value The Ljung Box Q-test, using squared returns and an autocorrelation lag length of size ten, shows evidence of heteroskedasticity rendering the randomness assumption invalid. Consequently, at a 1% significance level (critical value ) the null is rejected for all but two asset returns. Although GE as IBM are not formally rejected at 1%, they are at the 5% significance level. Mardia s test of multivariate normality to test both multivariate skewness and multivariate kurtosis is presented in table 6.2. Mardia s test result a rejection of the multivariate normality hypothesis at a 1% significance level. Multivariate Coefficient Statistic df p-value Skewness Small sample correction Kurtosis Table 6.2: Mardia s test of multivariate normality. Statistics indicated with are rejected at the 1% significance level. 6.3 Results In this section the empirical results are presented, based on the proposed theoretical framework. By reviewing the subclasses: Normal Inverse Gaussian (NIG), multivariate hyperbolic (Hyp) and the 10-dimensional multivariate hyperbolic (KHyp) a comparison is made which of the hyperbolic 37

43 6.3 Results subclasses performs the best to handle the observed heavy tail and asymmetry. As indicated in chapter 5, the backtesting analysis uses an equally weighted portfolio and the previous 500 observations to train the multivariate generalized hyperbolic distribution as well as the underlying DCC(1,1)- MGARCH(1,1) portfolio model. The one day ahead portfolio risk is then estimated by the conditional Value at Risk at the nominal 95% and 99% coverage level. Figures 6.1, 6.2 and 6.3 show the time series analysis illustrating the risk violations by + markings while table 6.3 and 6.4 present the statistical data. Christoffersen unconditional coverage test indicates strong evidence to reject the null of correct coverage for all three symmetric distributions using the nominal 95% coverage level. Respectively, the actual significance level 1 for the NIG is (0.0011), Hyp (0.0032) and KHyp (0.0084). Of these three symmetrical distributions, the NIG is the only one found to be statistically significant for the Christoffersen conditional test with p-value (0.0038). While the latter distributions all underestimate risk, no significant problems are found when estimating the risk using the asymmetrical distributions. Using the nominal 99% coverage level no strong statistical significance is found of under or overestimated risk for the symmetric as the asymmetric NIG, Hyp and KHyp distributions. As for the coverage test (0.0949) as for the Christoffersen conditional test (0.1591), the asymmetric KHyp has the lowest observed statistically insignifcant p-value. Since the different distributions cannot be ranked based on their p-value, the MSE value is calculated to evaluate the performance under both nominal coverage levels. As aforementioned, the three symmetric distributions under the nominal 95% are found to be statistically significant for the coverage test such that these three distributions are disregarded. For the nominal 95% level the asymmetric NIG distribution performs slightly better compared to the asymmetric Hyp ( vs ). At the nominal 99% coverage, all three symmetrical distributions are considered to be superior 1 p-value given in the parantheses 38

44 6.3 Results ˆβ Coverage Independence Conditional MSE asym NIG (0.0707) (0.7696) (0.1870) asym HYP (0.0880) (0.7668) (0.2234) asym 10dim-HYP (0.1928) (0.7136) (0.4004) sym NIG (0.0011) (0.4890) (0.0038) sym HYP (0.0032) (0.5685) (0.0110) sym 10dim-HYP (0.0084) (0.6466) (0.0281) Table 6.3: Backtesting test statistics based on the 95% coverage level. The p-values are denotes between the parantheses and ( ) indicates an rejection at 1% (5%) significance level. All three symmetric subclasses are rejected by the Kupiec test at 1% significance level. ˆβ Coverage Independence Conditional MSE sym NIG (0.3085) (0.5812) (0.5113) sym HYP (0.9412) (0.4828) (0.7796) sym 10dim-HYP (0.2759) (0.3941) (0.3842) asym HYP (0.1987) (0.3776) (0.2967) asym NIG (0.6255) (0.4461) (0.6641) asym 10dim-HYP (0.0949) (0.3459) (0.1591) Table 6.4: Backtesting test statistics based on the 99% coverage level. The p-values are denotes between the parantheses and ( ) indicates an rejection at 1% (5%) significance level. 39

45 6.3 Results Figure 6.1: The one day ahead conditional Value at Risk using the NIG distribution for the nominal 95% (blue) and 99% (red) coverage level. Conditional VaR violations are indicated by + markings. Notable is the violation concentration using the nominal 99% level. 40

46 6.3 Results Figure 6.2: The one day ahead conditional Value at Risk using the Hyperbolic distribution for the nominal 95% (blue) and 99% (red) coverage level. Conditional VaR violations are indicated by + markings. Notable is the violation concentration using the nominal 99% level. 41

47 6.3 Results Figure 6.3: The one day ahead conditional Value at Risk using the 10- dimensional hyperbolic distribution for the nominal 95% (blue) and 99% (red) coverage level. Conditional VaR violations are indicated by + markings. Notable is the violation concentration using the nominal 99% level. 42

48 6.3 Results above the asymmetrical ones. Of these symmetric distributions, the NIG slightly outperforms the Hyp ( vs ). It seems suspicious that all three symmetrical distributions outperform the asymmetrical distribution based on the MSE value. This result is particular remarkable. Since the portfolio is heavily skewed according to Mardia s test it is expected that the asymmetrical distribution should outperform the symmetrical distribution. Furthermore, since the asymmetrical distribution nests the symmetrical distribution, one should expect that at least one asymmetrical subclass could outperform a symmetrical subclass. Although it could still be a coincidence, it seems that the MSE value is presenting a biased ranking system for this empirical research. Due to the parsimonious behavior of the MSE, more observations are explained by the simpler and less complex symmetrical distribution. This results in lower MSE values and falsely indicates the better suited model. If one would compare the MSE values of the nominal 95% coverage level including the rejected symmetrical distributions, it is easily seen that while the symmetrical distributions are rejected, their MSE values are considerably lower than their asymmetrical equivalents. This is quite odd since all three distributions underestimate the risk. It probably could explain the odd ranking, and therefore its adviced not to rank the models based on the MSE value. While it is now assumed that the MSE value cannot be used to rank the models between the asymmetrical and symmetrical assumption, it is probably possible to rank the models if the same assumptions are used. It implies ranking the model setups or subclasses using the same nominal coverage level and the same symmetrical or asymmetrical assumption. If this is done, its follows from tables 6.3 and 6.4 that in these four cases the NIG distribution outperforms. This result is not surprising and has been documented in other papers, for instance by McNeil et al. (2005) and Protassov (2004). 43

49 6.4 Calibration time improvement 6.4 Calibration time improvement The calibration of the MGHyp density by EM is considered to be in general a slow optimization process. Let the time to optimize one cycle of the backtesting analysis be given as five minutes, such that estimating the full backtesting sample (2,266 cycles) 1 takes a shocking 8 days to complete. It would simply take too much time for an empirical study with twelve different distributions, namely 96 days. Therefore, to reduce the running time of the full backtesting sample a new method is introduced. The simplest method to reduce the running time of the full backtest sample is to recalibrate the MGHyp density after every thirty cycles instead of after each cycle. However, this method does not use all up-to-date market information. This study proposes the use of parallel processing using multicore desktops with CPUs since Mathworks (2011) denotes that it is designed specifically for data intensive algorithms based on array type variables. The local host divides its CPU into a maximum of eight clusters by efficient programming such that eight independent cycles are optimized simultaneously. The efficiency of this method depends on the exact test setup. Since all eight clusters need to communicate with each other about their progress, an information jam could occur if the test setup is too slow to receive, interpreted and send commands to all clusters. By a simple trial and error the efficient number of clusters is established. This study uses an Intel Quad 2.4GHz and 4gig internal memory. Matlab 2010a is used as the mathematical package including the parallel computing toolbox and the full eight clusters are implemented. Clearly, the effectiveness of the parallel processing unit is noticeable since it reduces the running time by 2600%. Using one of the symmetric distributions results a running time of merely 6 hours while the asymmetric distributions takes seven hours to complete. This is perfectly explainable because the symmetric case assumes γ = 0 such that it isn t estimated by the EM algorithm. Recognizable differences using the symmetrical constraint or different coverages levels are not found. 1 full sample reduced by the first 500 calibration observations 44

50 7 Conclusion This study focused on the further development of the effectiveness of the multivariate generalized hyperbolic distribution (MGHyp) to model the observed heavy tail, asymmetry and peakedness of the return portfolio distribution. It complements recent findings by first introducing an underlying DCC(1,1)-MGARCH(1,1) portfolio model with residuals following the MGHyp density and secondly an analysis is made if the complex model is empirically better suited to forecast the possible next day portfolio loss through backtesting. Since calibrating the MGHyp is a slow process, this study also proposed the implementation of parallel processing as a programming technique to reduce the backtesting estimation time frame. Using an equally weighted portfolio and the conditional Value at Risk as the risk measure, a particular result is established. It is assumed that if (i) the underlying model is correctly specified as a dynamic relation, (ii) the forecast horizon is one day and (iii) the risk methodology is based on the CVaR approach, the true coverage level should be estimated consistently. Remarkably, at the nominal 95% coverage level all three symmetrical subclasses are rejected by the coverage test. This raises concerns whether the underlying model is appropriate or that the used empirical data is heavily skewed. Tables 6.1 and 6.2 indicate indeed serious heavy skewness and while using a different and smaller dataset did not show any problems with estimating the true coverage level, it is proposed that the observed coverage 45

51 rejection is due to the heavy skewed empircal data. It inferres, as expected, that a symmetric distribution is unable to handle the heavy skewness. While at the nominal 95% coverage level all asymmetric subclasses outperform the symmetrical distributions, at the nominal 99% coverage level it is exactly the other way around. This result is unexpected since the asymmetrical distribution nests the symmetrical distribution and the used portfolio is heavily skewed. It appears that due to the parsimonious behaviour of the MSE value, the simpler symmetrical models explains more observational data such that the MSE value falsely indicates it as superior models. In both cases the NIG distribution is slightly superior compared to the hyperbolic distribution based on the MSE value; for the asymmetrical case vs and for the symmetrical case vs respectively. Using parallel processing leads to a significant time reduction. It reduces the waiting time by 2600% from almost eight days to merely six to seven hours for one backtesting analysis. An one hour time difference exists between the asymmetrical and symmetrical distributions due to an extra parameter estimation for the asymmetrical case. No significant time difference are found within the asymmetrical or symmetrical subclasses nor differences between the nominal coverage levels. This empirical study shows that the NIG distribution is for both nominal coverage levels the best performer and parallel processing reduces estimation time considerably. It is recommended to improve this research by evaluating all MGHyp subclasses and using a more detailed test to figure out whether the NIG is indeed the best performer. This is easily implemented if one uses the developed matlab routine. A benefit using this program is the ability to impose even more complex distributions or to assume constraints on some of the MGHyp calibration parameters. It opens a complete new world for (complex) empirical studies in the field of finance. 46

52 References Aas, K. and Haff, I. H. (2006). The generalized hyperbolic skew student s t-distribution. Journal of financial econometrics, 4: Angelidis, T. and Degiannakis, S. (2007). Backtesting var models: An expected shortfall approach. Working paper, University of Crete. Artzner, P., Delbaen, J., and Heath, D. (1997). Thinking coherently. Risk, 10: Artzner, P., Delbaen, J., and Heath, D. (1999). Coherent measures of risk. Mathematical Finance, 9: Barndorff-Nielsen, O. (1977). Normal inverse gaussian distributions and stochastic volatility modelling. Scandinavian Journal of Statistics, 24: Bauwens, L., Laurent, S., and Rombouts, J. V. (2006). Multivariate garch models: A survey. Journal of applied econometrics, 21: Blaesid, P. and Sorensen, M. (1992). Hyp, a computer program for analyzing data by means of the hyperbolic distribution. Research Report University Aarhus, 248, dept. Theor. Statist. Blaesild, P. (1981). The two-dimensional hyperbolic distribution and related distributions, with application to johannsen s bean data. Biometrika, 68: Bollerslev, T. (1990). Modeling the coherence in short-run nominal exchange rates: a multivariate generalized arch model. Review of Economics and Statistics, 72:

53 REFERENCES Bollerslev, T., Engle, R., and Wooldridge, J. (1988). A capital asset pricing model with time varying covariances. Journal of Political Economy, 96: Christoffersen, P. F. (1998). Evaluating interval forecasts. International Economic Review, 39: De Finetti, B. (1929). Sulle funzioni ad incremento. Rend. Acc. Naz. Lincei, 10: Dempster, A. P., Laird, N., and Rubin, D. (1977). Maximum likelihood from incomplete data via the em algorithm. Proceedings of the Royal Society of Londen. Series B, Methodological, 39(1):1 38. Diebold, F. and Nerlove, M. (1989). The dynamics of exchange rate volatility: A multivariate latent factor arch model. Journal of Applied Econometrics, 4:1 21. Eberlein, E. and Keller, U. (1995). Journal of Business, 38: Hyperbolic distributions in finance. Efunda and Mathematica, W. (2010). equation. The modified bessel s differential Engle, R. (1982). Autoregressive conditional heteroskedasticity with estimates of the variance of united kingdom inflation. Econometrica, 50: Engle, R. (1999). Dynamic conditional correlation - a simple class of multivariate garch models. Econometrica, 50: Engle, R. (2002). Dynamic conditioanl correlation - a simple class of multivariate garch models. Journal of Business and Economic Statistics, 20: Engle, R. and Kroner, K. (1995). Multivariate simultaneous generalized arch. Econometric theory, 11:

54 REFERENCES Engle, R. and Sheppard, K. (2002). Theoretical and empirical properties of dynamic conditional correlation multivariate garch. Econometrica, 50: Gourieroux, C. (1997). ARCH Models and Financial Applications. Springer- Verlag, New York. Hellmich, M. and Kassberger, S. (2009). Efficient and robust portfolio optimization in the multivariate generalized hyperbolic framework. Working paper. Hu, W. (2005). Calibration of multivariate generalized hyperbolic distributions using the em algorithm, with applications in risk management, portfolio optimization and portfolio credit risk. Dissertation paper, The Florida state university - Florida. Laplante, J., Desrochers, J., and J.Prefontaine (2008). The garch(1,1) model as a risk predictor for international portfolios. International Business and Economic Research Journal, 7: Leippold, M. (2004). Don t rely on var. Euromoney, 1:FA2 FA5. Liu, C. and Rubin, D. (1995). Ml estimation of the t-distribution using em and its extensions, ecm and ecme. Statistica Sinica, 5: Madan, D. and Seneta, E. (1990). The variance gamma model for share market returns. Journal of Business, 63: Mathworks (2011). Parallel computing toolbox. McLachlan, G. and Krishnan, T. (2008). The EM Algorithm and Extensions. John Wiley & Sons, Inc, Hoboken, New Jersey. McLachlan, G. and Peel, D. (2000). Finite Mixture Models. John Wiley & Sons, Inc, Hoboken, New Jersey. McNeil, A., Frey, R., and Embrechts, P. (2005). Quantitative Risk Management: Concepts, Techniques and Tools. Princeton University Press, Princeton, New Jersey. 49

55 REFERENCES Morgan, J. (1996). Riskmetrics. Newcomb, S. (1886). A generalized theory of the combination of observations so as to obtain the best result. American Journal of Mathematics, 8: Newey, W. and McFadden, D. (1994). Large sample estimation and Hypothesis Testing in Handbook of Econometrics, volume 4. Elsevier, North Holland. Olver, F. and Maximon, L. (2010). Chapter 10 bessel functions. Pagan, A. (1996). The econometrics of financial markets. Journal of Emprical Finance, 3: Paolella, M. (2007). Intermediate Probability. John Wiley & Sons, Ltd, Hoboken, New Jersey. Prause, K. (1999). The generalized hyperbolic model: Estimation, financial derivatives and risk measures. Dissertation paper, Institut fur Mathematische Statistik, Albert-Ludwigs Universitat, Freiburg. Protassov, R. (2004). Em-based maximum likelihood parameter estimation for multivariate generalized hyperbolic distributions with fixed λ. Statistics and Computing, 14: Tse, Y. and Tsui, A. (2002). A multivariate garch model with time-varying correlations. Journal of Business and Economic Statistics, 20: van der Weide, R. (2002). Go-garch: a multivariate generalized orthogonal garch model. Journal of Applied Econometrics, 17:

56 Appendix A Derivation of the conditional density Normal-Mean-Variance- Mixture This appendix demonstrates the derivation of the conditional probability distribution f (x i w i ) given as 4.7. Let x i be a k dimensional vector for i [1,..., T ] and Z N k (0,I k ) the multivariate Normal distribution of dimension k. The first step is to define the Normal-Mean-Variance-Mixture conditioned on the mixing weights w i as x i w i = µ + w i γ + w i AZ N k (µ + w i γ, w i Σ) and the multivariate probability distribution function as 1 f (x i w i ) = e ( 1 (2π) k 2 Σ 1 2 w k 2 (xi µ wiγ) (w iσ) 1 (x i µ w iγ)) (A.1) 2 i With some simple rewriting it is possible to separate the exponential function into three components. The first component is the quadratic relation of x i µ, the second component is the quadratic relation of w i γ. The third relation is the crossproduct between x i µ and w i γ which is multiplied by 2 because the cross relation needs to be taken both ways. Doing so reveals the quadratic relation of x i µ as: e (x i µ) (w i Σ) 1 (x i µ) 2 = e (x i µ) (Σ) 1 (x i µ) 2w i 51

57 The quadratic relation of w i γ e ( w i γ) (w i Σ) 1 ( w i Σ) 2 = e w i γ Σ 1 Σ 2 and the crossproduct e 2 ( w i γ) (w i Σ) 1 (x i µ) 2 = e (xi µ) Σ 1 γ Combining all three separate expressions results the same probability distribution function as in A.1. However, this new expression A.2 simplifies the differentiation of the log-likelihood and it simplifies the theoretical derivation of the density functions given in appendix B and C. f (x i w i ) = 1 (2π) k 2 Σ 1 2 w k 2 i e (xi µ) Σ 1γ e (x i µ) Σ 1 (x i µ) 2w i e w i 2 γ Σ 1γ (A.2) 52

58 Appendix B Derivation of the MGHyp probability distribution function This appendix demonstrates the derivation of the multivariate generalized hyperbolic probability distribution function f (x i ). Let x i be a k dimensional vector for i [1,..., T ]. The first step is to find a continuous function for f (x i ). f (x i ) = f (x i, w i ) dw i = 0 0 f (x i w i ) f (w i ) dw i (B.1) The conditional density function f (x i w i ) and mixing weight f (w i ) are already known at this stage. Let f (x i w i ) be defined by A.2 such that f (x i w i ) = 1 (2π) k 2 Σ 1 2 w k 2 i e (xi µ) Σ 1γ e (x i µ) Σ 1 (x i µ) 2w i e w i 2 γ Σ 1γ (B.2) and let f (w i ) be the Generalized Inverse Gaussian density function ( f (w i ) = χ λ χψ ) λ ( ) w λ 1 2K λ χψ i { exp 1 ( )} χ + ψw i 2 w i (B.3) A straightforward substitution of B.2 and B.3 in B.1 and excluding all independent elements on w i from the integrand, defines the multivariate generalized hyperbolic probability distribution f (x i ). 53

59 f (x i ) = exp { (x i µ) Σ 1 γ } χ λ ( χψ ) λ (2π) k 2 Σ 1 2 K λ ( χψ ) { w λ k 2 1 i exp (x i µ) Σ 1 (x i µ) w iγ Σ 1 γ 1 ( )} χ + ψw i 2w i 2 2 w i dw i The last expression is rewritten such that the arguments within the exponential are organized in a more nicer form. f (x i ) = exp { (x i µ) Σ 1 γ } χ λ ( χψ ) λ w λ k 2 1 i exp (2π) k 2 Σ 1 2 K λ ( χψ ) { (x i µ) Σ 1 (x i µ) + χ 2w i w ( i γ Σ 1 γ + ψ ) } dw i 2 (B.4) Lets concentrate more closely on the arguments within the exponential by replacing both fractions with χ = (x i µ) Σ 1 (x i µ) + χ ψ = γ Σ 1 γ + ψ such that exponential part of the integrand B.4 is rewriting to: { exp 1 1 χ 1 } 2 w i 2 w iψ { = exp 1 2 = exp { χ 1 2 χ 1 2 ψ 1 2 ψ } w i 2 w iψ 1 2 ψ 1 2 χ 1 2 χ w i } χ χ ψ 1 ψ 2 w ψ i χ ψ χ { [ = exp 1 1 χ ψ 2 w i χ + w i ψ ψ χ ]} (B.5) 54

60 Expression B.5 demonstrates that one particular element is exactly the in- verse of the other, namely w ψ i χ. By substituting ψ t = w i χ such that the integrand of B.4 is modified to: 1 ( ) χ 1 2 ψ 1 λ k { 2 2 t 1 exp 2 0 or ( χ 1 2 ψ 1 2 ) λ k 2 1 (χ 1 2 ψ 1 2 ) χ ψ [ 1 t + t ]} ( χ 1 2 ψ 1 2 t λ k ) dt { exp 1 [ ]} 1 χ ψ 2 t + t ( Notice that the variable of integration has changed from dw i to χ 1 2 ψ 1 2 dt (B.6) ) dt which is simply the first order differentiation of w i. The integrand represents the explicit notation of the modified Bessel function of the third kind. It s solution is known, hence substituting its value in B.6 and combining with B.4 results: f (x i ) = exp { (x i µ) Σ 1 γ } χ λ ( χψ ) λ (2π) k 2 Σ 1 2 K λ ( χψ ) ( χ 1 2 ψ 1 2 ) λ k 2 ( ) K λ k χ ψ 2 (B.7) Lastly, a small trick is applied for two arguments of B.7. Doing so prevents styling issues in the final representation of the density f (x i ). ( ( ) λ k 2 χ ψ χ 1 2 ψ 1 2 ) λ k 2 ( = χ 1 2 ψ 1 2 ψ 1 ) λ k 2 = ψ λ k 2 = ψ k 2 λ ( χ ψ ) k 2 λ (B.8) χ λ ( χψ ) λ = ψ λ ( χψ ) λ (B.9) Substituting B.8 and B.9 in B.7 results the probability distribution function for the multivariate generalized hyperbolic distribution. f (x i ) = exp { (x i µ) Σ 1 γ } ψ λ ( χψ ) λ (2π) k 2 Σ 1 2 K λ ( χψ ) where χ and ψ are defined as ψ k 2 λ χ = (x i µ) Σ 1 (x i µ) + χ ψ = γ Σ 1 γ + ψ ( χ ψ ) k 2 λ K λ k 2 ( χ ψ ) (B.10) 55

61 Appendix C Derivation of the conditional GIG distribution This appendix demonstrates the derivation of the conditional density f (w i x i ) which later on will reveal itself as the GIG density function. Let x i be a k dimensional vector for i [1,..., T ]. The first step is to find an explicit expression for f (w i x i ) using Bayes theorem. f (w i x i ) = f (x i w i ) f (w i ) f (x i ) (C.1) Luckily, all three probability distribution functions are known. So basically by substitution, simplification and some simple algebraic manipulation the density function is found. Lets summarize the three functions. The first one f (x i w i ) is the conditional pdf depending on the normality assumption as described in appendix B. f (x i w i ) = 1 (2π) k 2 Σ 1 2 w k 2 i e (xi µ) Σ 1γ e (x i µ) Σ 1 (x i µ) 2w i e w i 2 γ Σ 1γ (C.2) Let f (w i ) be the Generalized Inverse Gaussian density function given as ( f (w i ) = χ λ χψ ) λ ( ) w λ 1 2K λ χψ i { exp 1 ( )} χ + ψw i 2 w i (C.3) 56

62 and let f (x i ) be the MGHyp density, derived in appendix B, be given as f (x i ) = ψ k 2 λ ψ ( λ χψ ) λ (2π) k 2 ( ) exp { (xi µ) Σ 1 γ } ( ) ( ) Σ 1 k K 2 K λ χψ 2 χ ψ λ λ k χ ψ 2 where χ and ψ are defined as χ = (x i µ) Σ 1 (x i µ) + χ ψ = γ Σ 1 γ + ψ (C.4) Substituting C.2, C.3 and C.4 into C.1 leads to a gigantic and vague expression. This is clearly not the way to go, so some simplifications and algebraic manipulations are considered. This process is decomposed into two steps. Lets assume that all three pdfs can be split into two components. One component is dependent on the exponential function while the other component is not. For instance ( f (w i ) = χ λ χψ ) λ ( ) w λ 1 2K λ χψ i }{{} α component { exp 1 ( )} χ + ψw i 2 w i }{{} β component The same applies for the other two pdfs C.2 and C.4. The first step simplifies the α components by substituting these in C.1 for their respectively pdfs. In the same manner the second step simplifies the β components. C.0.1 Step 1 Selecting the α components from C.2, C.3 and C.4 and implementing the prescribed simplification procedure results: 1 χ ( λ χψ ) λ (2π) k 2 Σ 1 2 w k 2 i w λ 1 i 2K λ ( χψ ) (2π) k 2 ( ) Σ 1 2 K λ χψ ψ ( λ χψ ) λ (ψ ) k 2 λ ( ) k 2 χ ψ λ ( ) χ ψ K λ k 2 The previous expression clearly indicates that a number of parameters do cancel each other out. Hence: χ λ ( χψ ) 2λ 2ψ λ (ψ ) k 2 λ wλ k 2 1 i ( χ ψ ) k 2 λ K λ k 2 ( χ ψ ) (C.5) It might seem that the trick introduced as C.6 is unnecessary. But if one implements this trick, it is easily seen that the previous expression can be 57

63 simplified even more. Rewrite χ λ ( χψ ) 2λ = χ λ χ λ ψ λ = ψ λ (C.6) such that C.5 changes to (ψ ) λ k 2 2K λ k 2 ( χ ψ ) k 2 λ ( χ ψ ) w λ k 2 1 i (C.7) At this point no more successful simplification steps are accomplished for the α components. C.0.2 Step 2 The remaining of this appendix considers the simplification of the β components of all three pdfs C.2, C.3 and C.4. As implied in the first step, the same starting procedure is applicable for the β components. exp { (x i µ) Σ 1 γ } { } exp (xi µ) Σ 1 (x i µ) 2w i exp { wi 2 γ Σ 1 γ } { exp exp { (x i µ) Σ 1 γ } 1 χ 2 w i Simplifying the previous expression is quite straightforward. 1 2 ψw i } { = exp (x i µ) Σ 1 } (x i µ) { exp w } { i 2w i 2 γ Σ 1 γ exp 1 2 { = exp 1 [ 1 ( χ + (xi µ) Σ 1 (x i µ) ) ( + w i γ Σ 1 γ + ψ ) ]} 2 w i χ 1 } w i 2 ψw i { = exp 1 [ ]} 1 χ + w i ψ 2 w i (C.8) C.0.3 Conditional density function Simply rejoin step 1 and 2 to formulate the conditional density function f (w i x i ). f (w i x i ) = (ψ ) λ k 2 2K λ k 2 ( χ ψ ) k 2 λ ( χ ψ ) w λ k 2 1 i { exp 1 [ ]} 1 χ + w i ψ 2 w i 58

64 Using the same trick as before by interchanging χ and ψ and replacing λ k 2 with λ results the final expression which resembles the conditional Generalized Inverse Gaussian probability distribution function. f (w i x i ) = (χ ) ( ) λ λ χ ψ ( ) 2K λ χ ψ w λ 1 i { exp 1 [ ]} 1 χ + w i ψ 2 w i (C.9) also defined as N (λ, χ, ψ ) = N (λ k ) 2, χ + (x i µ) Σ 1 (x i µ), ψ + γ Σ 1 γ 59

65 Appendix D Proof of the closed form expressions γ, µ and Σ This appendix proofs the closed form expressions γ (p+1) (4.14), µ (p+1) (4.15) ( and Σ (p+1) (4.16) found by maximizing Q 1 xi, µ (p), Σ (p), γ (p)) (4.9). To keep the derivation transparent and appealing, the subscript indicating the cycle (p) is not reported from this point on, but the reader should be aware that the result does depend on the current cycle. Let the conditional expectations in 4.9 be replaced by 4.10, 4.11 and 4.12, such that the conditional maximization function Q 1 ( ) is given as Q 1 (x i, µ, Σ, γ) = T 2 log Σ k ξ i + (x i µ) Σ 1 γ δ i (x i µ) Σ 1 (x i µ) 1 2 γ Σ 1 γ η i To maximize the previous expression, the derivative is taken with respect to the vectors µ and γ and matrix Σ. Using vector/matrix differentiation rules, the closed form expression for µ (p+1) is as follows dq 1 ( ) dµ = T γ Σ 1 + = T γ + (x i µ) Σ 1 δ i = 0 x iδ i µ T δ i = 0 60

66 Dividing the numerator and denominator with T 1, taking the transpose and using δ = 1 T T δ i results the closed form expression for µ (p+1). µ (p+1) = 1 T T x iδ i γ δ (D.1) The closed form expression for γ (p+1) follows after taking the derivative Q 1 ( ) with respect to γ. dq 1 ( ) dγ = = T (x i µ) Σ 1 γ Σ 1 η i = 0 x i T µ γ T η i = 0 The vector µ in the last expression is replaced by its closed form expression D.1. Dividing the numerator and denominator with T and using the notation η = 1 T T η i results: = x i T [ 1 T T x i δ ] i γ γ δ T η i = 0 = T δ 1 T x i T 1 T x iδ i + T γ T γ δ η = T δ 1 T x i T 1 T x iδ i T γ ( δ η 1 ) = 0 The closed form solution is now simply found by expressing γ (p+1) as a function of the other arguments, taking the transpose and using x = 1 T T x i. γ (p+1) = δx 1 T T x iδ i δ η 1 = 1 T T δ ix 1 T T x iδ i δ η 1 = 1 T T δ i (x x i ) δ η 1 (D.2) Lastly, the closed form expression for Σ (p+1) is at this point quite straightforward. It is possible to substitute µ (p+1) and γ (p+1), but it does not enhance the easiness to review the closed form expression for the dispersion 61

67 matrix. Lets start with the differentiation. dq 1 ( ) dσ = T 2 Σ δ i (x i µ) (x i µ) 1 2 γγ T η i = 0 The next step is to reorder the expression such that Σ is a function of the other arguments. Σ (p+1) = 1 T δ i (x i µ) (x i µ) + ηγγ (D.3) 62

68 Appendix E Proof of the alternative maximization function Q 2 This appendix proofs the alternative first order condition 4.18 over ϑ, given as: δ (p) η (p) Kλ (ϑ 2 (p)) ϑ (p) + 2λK λ+1 (ϑ (p)) ( K λ ϑ (p)) ( ϑ (p) Kλ+1 2 ϑ (p)) = 0 To keep the derivation appealing, the subscript indicating the cycle (p) is not reported from this point on, but the reader should be aware that the result does depend on the current cycle. Let the conditional expectations in 4.9 be replaced by 4.10, 4.11 and 4.12, such that Q 2 (λ, χ, ψ) is given as Q 2 (λ, χ, ψ) = (λ 1) ξ i χ 2 δ i ψ 2 η i λt 2 log χ + λt 2 log ψ ( )] T log [2K λ χψ To maximize the previous expression, the derivative is taken with respect to χ and ψ and requires the derivative of the modified Bessel function of the third kind with index λ (Efunda and Mathematica, 2010) d log K λ (x) dx = 1 K λ (x) [ λ x K λ+1 (x) K λ (x) ] (E.1) Differentiate Q 2 ( ) with respect to χ and incorporate the derivative of the 63

69 Bessel function results the maximization expression for χ. [ dq ( ) dχ = 1 δ i T λ 2 2χ T λ K ( )] λ+1 χψ ψ ( ) χψ K λ χψ 2 χψ = 0 = 1 T δ i + λ ( ) χ + λψ ψ K λ+1 χψ ( ) = 0 χψ χψ χ K λ χψ = δ + 2λ ( ) χ ψ K λ+1 χψ ( ) = 0 (E.2) χ K λ χψ Using the same procedure, by differentiating Q 2 ( ) over ψ results the maximization expression for ψ. ( ) χ K λ+1 χψ η ( ) = 0 (E.3) ψ K λ χψ Both derivatives still contain the modified Bessel function which depends on both parameters χ and ψ. This implies that it is impossible to find two closed form expressions for updating both parameters. A simple workaround resolves this problem by introducing after taken both derivatives ϑ = χψ. χ = ψ η Kλ ( χψ ) K λ+1 ( χψ ) χ = θ η K λ (θ) K λ+1 (θ) Substituting E.4 in E.2 eventually results after some simplifications: dq ( ) dχ = δ + 2λK λ+1 (ϑ) ϑ η K λ (ϑ) ψk λ+1 (ϑ) K λ+1 (ϑ) = 0 ϑ η K λ (ϑ) K λ (ϑ) = δ η ϑ K λ (ϑ) + 2λK λ+1 (ϑ) (E.4) ψk λ+1 (ϑ) ϑ η K λ (ϑ) ϑ η K λ+1 (ϑ) = 0 Replace ψ with ϑ2 χ and χ with E.4 such that the first order condition does not depend anymore on the parameters χ or ψ. = δ η ϑ K λ (ϑ) + 2λK λ+1 (ϑ) ϑk2 λ+1 (ϑ) K λ (ϑ) = δ (p) η (p) K 2 λ = 0 (ϑ (p)) ϑ (p) + 2λK λ+1 (ϑ (p)) ( K λ ϑ (p)) ( ϑ (p) Kλ+1 2 ϑ (p)) = 0 (E.5) This last step results just one first order condition that is numerically optimized for ϑ. If the value for ϑ is known, so is χ by E.4 and ψ by ϑ2 χ. 64

70 Appendix F GUI layout Figure F.1: The developed Matlab GUI input program. The program gives the option to estimate the DCC-MGARCH model, calibrate the MGHyp for the full sample size or perform the backtesting analysis. If necessary, one could give parameter inputs for the Normal-Mean-Variance-Mixture if desired. A new window opens for the input. 65

Figure F.2: The GUI output screen, displayed after backtesting. It shows the time series analysis with the violations and the coverage, independence and Christoffersen test statistics.

71 Figure F.2: The GUI output screen, displayed after backtesting. It shows the time series analysis with the violations and the coverage, independence and Christoffersen test statistics. It also shows the MSE value and the nominal violation percentage β 0 as well as the empirically found β. If desired, the user can click the button asset data to view empirical data statistics like mean, variance and Jarque Bera. 66

Lecture 8: Multivariate GARCH and Conditional Correlation Models

Lecture 8: Multivariate GARCH and Conditional Correlation Models Prof. Massimo Guidolin 20192 Financial Econometrics Winter/Spring 2018 Overview Three issues in multivariate modelling of CH covariances