Universitetet i Oslo ingrihaf@math.uio.no International FocuStat Workshop on Focused Information Criteria and Related Themes, May 9-11, 2016
Copulae Regular vines Model selection and reduction Limitations and challenges
Copulae Remember that if the continuous variable X has the cdf F X, then U = F X (X ) is uniformly distributed [0, 1]. In many cases, it is more convenient or natural to study/model a transformation of the data, e.g. log(x ). In the copula world, one transforms the variables X i with their own cdfs F i. For continuous variables, this is called the probability integral transformation (PIT), and the resulting variables U i = F i (X i ) follow a uniform distribution.
Copulae Copulae are tools for constructing multivariate distributions. The idea behind the PIT is to isolate the individual (marginal) behaviour of the variables, to focus on their joint behaviour. Hence, a multivariate distribution can be split into the univariate margins a dependence structure. This dependence structure is called a copula. Definition: A copula C is a multivariate distribution with uniform margins U[0, 1].
Sklar s theorem [Sklar, 1959] Let X 1,..., X d follow the joint distribution F 1...d with margins F 1,..., F d. Then, there exists a function C 1...d such that F 1...d (x 1,..., x d ) = C 1...d (F 1 (x 1 ),..., F d (x d )), where C 1...d is a copula. This is true for any multivariate distribution, whether continuous, discrete or a combination of the two. If F 1...d is continuous, then the copula C 1...d is unique.
Sklar s theorem When the margins F 1,..., F d in addition are absolutely continuous and strictly increasing, one may express Sklar s theorem in terms of densities. Then f 1...d (x 1,..., x d ) = c 1...d (F 1 (x 1 ),..., F d (x d )) where c 1...d be the density of C 1...d, that is c 1...d = d C 1...d u 1... u d, and f 1...d the pdf corresponding to F 1...d. d f i (x i ), (1) i=1
Z Z 1-4 -2 0 2 4 X -4-2 0 2 4 X Illustration If we take this and divide it with the product of these 0 0.05 0.1 0.15 0.2 0.25 density 0.0 0.1 0.2 0.3 0.4 density 0.0 0.1 0.2 0.3 0.4 4 2 0 Y -2 Bivariate standard normal density -4-4 -2 0 2 4 X Univariate standard normal densities we get 0 1 2 3 4 5 This is the density of a bivariate Gaussian copula. 0.8 0.6 Y 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 X
Z Z 0.8 0.6 0.4 5 Kreditt 0 5000 10000 15000 Verdi Operasjonell 0 2000 4000 6000 8000 Verdi Illustration If we take this and multiply it with the product of these 0 1 2 3 4 5 1 0.8 0.6 0.4 Y we get 0.2 0 0 0.2 0.4 0.6 0.8 1 X Tetthet 0.0 0.0001 0.0002 0.0003 0.0004 0.0005 0.0006 beta-density Tetthet 0.0 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 0.0014 lognormal density 0 0.5 1 1.5 2 2.5 3 This is a bivariate density consisting of a Gaussian copula and beta- and lognormal margins. 20 15 Y 0.2 10 X
Copulae flexible enough? For bivariate models (d = 2), there exists a long and varied list of copula families. As soon as d 3, the catalogue of available copulae is significantly reduced [Genest et al., 2009]. Several of the well-known copulae generalise to higher dimensions. Unfortunately, their flexibility decreases with the dimension, which restricts the range of dependence they are able to reproduce.
Copulae flexible enough? Why not build a multivariate copula based merely on bivariate ones?
Copulae flexible enough? Why not build a multivariate copula based merely on bivariate ones? That is precisely the idea behind pair-copula constructions, introduced by Joe [1997].
Pair-copula constructions Complete multivariate distribution Pair-copula construction Ga Gu F M C t Copula t F Gu Gu Ga C
v V V V 0.0 0.2 0.4 0.6 0.8 1.0 U Building blocks The bivariate copulae constituting the construction need not belong to the same family. The resulting multivariate distribution will still be valid. One may for instance combine the following types of Gumbel Clayton pair-copulae 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 u U Gaussian Student 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 U
Pair-copula constructions (PCC) Let X 1, X 2, X 3 be stochastic variables with cdf F 123 and margins F 1, F 2 and F 3. Their pdf f 123 may be factorised as f 123 (x 1, x 2, x 3 ) =f 3 (x 3 )f 2 3 (x 2 x 3 )f 1 23 (x 1 x 2, x 3 ). (2) Expressing (2) in terms of the marginal pdfs and pair-copula densities by the repeated use of (1), one obtains the corresponding PCC.
Pair-copula constructions (PCC) More specifically, f 123 (x 1, x 2, x 3 ) =f 1 (x 1 )f 2 (x 2 )f 3 (x 3 ) c 13 (F 1 (x 1 ), F 3 (x 3 ))c 23 (F 2 (x 2 ), F 3 (x 3 )) c 12 3 (F 1 3 (x 1 x 3 ), F 2 3 (x 2 x 3 )). Since f 123 = f 1 f 2 f 3 c 123, then c 123 (F 1 (x 1 ), F 2 (x 2 ), F 3 (x 3 )) =c 13 (F 1 (x 1 ), F 3 (x 3 ))c 23 (F 2 (x 2 ), F 3 (x 3 )) c 12 3 (F 1 3 (x 1 x 3 ), F 2 3 (x 2 x 3 )), (3) where c 13, c 23 and c 12 3 are the copula densities corresponding to F 13, F 23 and F 12 3, respectively.
Pair-copula constructions (PCC) A five-dimensional copula may be decomposed as: c 12345 = c 12 c 23 c 34 c 45 Level 1 c 13 2 c 24 3 c 35 4 Level 2 c 14 23 c 25 34 Level 3 c 15 234 Level 4. The copulae are organised in levels according to the number of conditioning variables. Expression (3) is one of the three possible decompositions of c 123, while in the five-dimensional case, there are as many as 480 different constructions. To help categorising and building them, Bedford and Cooke [2001, 2002] and Kurowicka and Cooke [2006] introduced the graphical models called vines.
Vines in 5 dimensions 1 2 3 4 5 12 23 34 45 12 23 34 45 13 2 24 3 35 4 T 1 T 2 2 3 12 13 4 1 14 15 5 23 1 13 24 1 14 12 25 1 15 T 1 T 2 13 2 24 3 35 4 14 23 25 34 T 3 34 12 24 1 23 1 35 12 25 1 T 3 14 23 25 34 T 4 34 12 35 12 T 4 15 234 45 123 D-vine C-vine
Regular vine 6 5 1 2 6,5 5,1 2,1 7,5 3,1 4,3 7 3 4 T1 6,1 5 5,3 1 3,2 1 6,5 5,1 3,1 2,1 1,7 5 4,1 3 7,5 4,3 T2 6,3 15 5,2 31 6,1 5 5,3 1 3,2 1 7,3 15 4,2 13 7,1 5 4,1 3 T3 6,2 315 5,4 231 6,3 15 5,2 31 4,2 13 7,2 315 7,3 15 T4 7,6 2315 6,4 2315 7,2 315 6,2 315 5,4 231 T5 7,4 62315 7,6 2315 6,4 2315 T6
Regular vines Many of the pair-copula arguments are conditional distributions. These can be evaluated using a recursive formula [Joe, 1996]: F (x v) = C xv j v j (F (x v j), F (v j v j)). F (v j v j) In regular vines (R-vines), the copulae in question are, by construction, always present in the preceding levels of the structure. Inference on PCCs is in general demanding, whereas the subclass of R-vines has many appealing computational properties.
Vine matrix Dißmann et al. [2013] have proposed an efficient way of storing the indices involved in the pair-copulae in a lower triangular matrix: c 12345 = c 12 c 23 c 34 c 45 c 13 2 c 24 3 c 35 4 c 14 23 c 25 34 c 15 234 1 5 2 4 5 3 3 4 5 4 2 3 4 5 5 The density of the R-vine may then be written in terms of the indices of this matrix.
Vine inference Inference on these constructions requires (i) the choice of structure (ii) the choice of each pair-copula type (iii) the estimation of the copula parameters. In principle, these three steps should be performed simultaneously. Gruber and Czado [2016] have proposed a Bayesian method for doing this, but computational complexity makes this infeasible in medium to high dimensions. In practice, the three inference steps are therefore performed sequentially.
Parameter estimation An R-vine is a special type of multivariate copula. When the its structure and copula types are given, one may in principle use any estimator for multivariate copulae to estimate its parameters. The model consists of an R-vine with parameters θ, combined with univariate margins with parameters α. Even for rather low dimensions, the total number of parameters is high. In higher dimensions, one therefore performs the estimation in several steps.
Parameter estimation The log-likelihood function can be written as l(α, θ; x 1,..., x n) = l M (α; x 1,..., x n)+l C (θ; u1(α),..., un(α)), where uk(α) = (F 1 (x 1k ; α),..., F d (x dk ; α)). The terms of l C can be grouped according to the level they belong to, and the terms for level l depend on the copulae from the levels 1,..., l, but not the ones after. The copula parameters may therefore be estimated level by level, or even copula by copula if none of the copulae share parameters. The state-of-the-art is to 1. estimate α in a separate step, 2. estimate θ level by level, using F i (x ik ; ˆα) or F in (x ik ) = 1 n+1 n j=1 I (x ij x ik ) as estimates of u ki (α).
Structure selection Two main types of structure selection strategies have been proposed: building the vine top-down, with the aim of minimising the dependence in the top levels building the vine bottom-up, with the aim of maximising the dependence in the first levels. A procedure of the first type, based on partial correlations, is suggested by Kurowicka [2011a]. Dißmann et al. [2013] propose a procedure of the second type based on Kendall s τ coefficients. The latter has become the state-of-the-art.
Structure selection A key to the algorithm of Dißmann et al. [2013] is that each level of an R-vine is a spanning tree. This is due to the proximity condition: two copulae from level l can be combined into a copula on level l + 1 only if they share all variables but one. The algorithm is: 1. Estimate τ ij for all pairs {i, j} {1,..., d}. 2. Select the spanning tree T 1 that maximizes {i,j} T 1 ˆτ ij. 3. For levels l = 2,..., d 2: a. Estimate τ ij v for all pairs {i, j} with conditioning set v, that fulfil the proximity condition. b. Select the spanning tree T l that maximises {i,j} T l ˆτ ij v.
Structure selection We wish to construct an R-vine on five variables. Level 1: For all 15 pairs {i, j}, we estimate τ ij. There are 125 possible spanning trees. Assume that this is the winner tree: Level 2: There are now 3 possible spanning trees and 4 conditional Kendall s τ s to estimate: τ 12 5, τ 13 5, τ 23 5, τ 45 2. Assume that this is the winner tree: Level 3: There are now 3 possible spanning trees and 3 conditional Kendall s τ s to estimate: τ 13 25, τ 14 25, τ 34 25. Assume that this is the winner tree: Level 4: This level is always given by the previous ones. 3 5 2 4 1 15 25 24 35 12 5 45 2 23 5 14 25 34 25
Structure selection The unconditional Kendall s τ s, needed to construct the first level of the vine, can be estimated empirically. From the second level, the conditional Kendall s τs, τ ij v, are estimated semi-parametrically, as the empirical Kendall s τ of û i v and û j v, that are estimated parametrically based on copulae from the previous level. This requires the simultaneous choice of copula types and parameter estimation. Common practice is to select the type of each copula separately by 1. computing the AIC for a list of candidate copulae 2. choosing the one with the best AIC.
Model reduction A 20-dimensional (full) R-vine has at least 190 parameters. For a 50-dimensional one, the number is at least 1225. In high-dimensional applications, it is therefore necessary to reduce the number of parameters. One strategy is to identify independence copulae among the pair-copulae. When c 14 23 is an independence copula, it means that X 1 X 4 X 2, X 3 and c 14 23 (u, v) = 1. There are two main methods for doing this: pruning and truncation (Kurowicka [2011b], Brechmann et al. [2012], Brechmann and Joe [2015]).
Model reduction Pruning consists in testing each of the copulae in the construction for independence. Typically, C ij v is tested for independence by testing whether τ ij v is significantly different from 0. Truncation consists in finding a level after which all copulae can be set to independence. Starting with a one-level vine, truncation is performed as follows: 1. test whether one extra level of copulae makes the model significantly better 2. if yes and the number of levels is < d 1, return to 1 3. else return the truncation level. The log-likelihood ratio test of Vuong [1989] for non-nested hypotheses is used as a criterion in step 1. The structure of each new level is selected using the algorithm of Dißmann et al. [2013].
Model reduction Pruning c 12 c 23 c 34 c 45 c 13 2 c 24 3 c 35 4 c 14 23 c 25 34 c 15 234
Model reduction Pruning Truncation c 12 c 23 c 34 c 45 c 13 2 c 24 3 c 35 4 c 14 23 c 25 34 c 15 234 c 12 c 23 c 34 c 45 c 13 2 c 24 3 c 35 4 c 14 23 c 25 34 c 15 234
Model reduction Pruning Truncation c 12 c 23 c 34 c 45 c 13 2 c 24 3 c 35 4 c 14 23 c 25 34 c 15 234 c 12 c 23 c 34 c 45 c 13 2 c 24 3 c 35 4 c 14 23 c 25 34 c 15 234 c 12 c 23 c 34 c 45 c 13 2 c 24 3 c 35 4 c 14 23 c 25 34 c 15 234
Example 1: compound events In climatology, a compound event denotes an extreme event that is caused by a combination of climate and weather variables, that are not necessarily in an extreme state. In this setting, it is very important to model the dependence between the various variables, and especially in the tails. Vines have been used to model the relationship between sea surge and water levels in rivers running to the coast in question. The vine was selected based on the standard vine selection algorithm. A closer inspection showed that the AIC values for the top five copulae were almost the same. The tail behaviour of these copulae was however widely different.
Example 2: abalone data The data originate from a study by the Tasmanian Aquaculture and Fisheries Institute. The harvest of abalones is subject to quotas. These quotas are based on the age distribution of the abalones. Determining an abalone s age is a highly time-consuming task. Hence, one would like to predict the age based on physical measurements, such as weight and height.
Example 2: abalone data The Abalone data set was originally used for this purpose, and consists of 4,177 samples of: 1. Sex 2. Length 3. Diameter 4. Height 5. Whole weight 6. Shucked weight 7. Viscera weight 8. Shell weight 9. Age. A vine model was used to estimate this conditional distribution. The standard vine selection and truncation algorithms, combined with pruning, resulted in a vine with 5 levels and no independence copulae below this level. The estimated conditional distribution of age given the other variables based on a one-level vine (with 7 parameters) is almost the same as the one based on the selected, best vine (with 25 parameters).
Limitations Most of the mentioned inference methods are heuristic. The selection and reduction methods do not take into account the intended use of the model are performed level by level are conditioned on the choices of copula types in preceding levels. The truncation approach relies heavily on the model selection algorithm only considers whether all copulae after a certain level should be independence. The method for choosing copula types does not take into account the intended use of the model is based on AIC, and usually combined with semi-parametric estimation, which has been shown to be incorrect [Grønneberg and Hjort, 2014].
Challenges What should the benchmark model be? The number of possible R-vine structures for a given data set is huge even for medium dimensions (2 (d 2 2 ) 1 d! for d variables). When combined with all possible combinations of copula types from even a moderately long list of candidates, the number of possible vines becomes gargantuan. A smart (greedy) search algorithm for proposing candidate models is therefore necessary. Perhaps one could do the selection in two steps: select the structure for an R-vine consisting of non-parametric copulae select the parametric copula types when the structure is fixed.
Challenges Parameters/measures related to the vine rarely have closed form expressions. The computation of potential focus parameters will generally require Monte Carlo methods. To make a focussed selection criterion computationally efficient, one therefore needs to find good approximations to the focus parameter.
A. Sklar. Fonctions de répartition à n dimensions et leurs marges. Publ. Inst. Stat. Univ. Paris, 8, 1959. C. Genest, H. U. Gerber, M. J. Goovaerts, and R. J. A. Laeven. Editorial to the special issue on modeling and measurement of multivariate risk in insurance and finance. Insurance: Mathematics and Economics, 44(2), 2009. H. Joe. Multivariate Models and Dependence Concepts. Chapman & Hall, London, 1997. T. Bedford and R.M. Cooke. Probabilistic density decomposition for conditionally dependent random variables modeled by vines. Annals of mathematics and Artificial Intelligence, 32:245 268, 2001. T. Bedford and R.M. Cooke. Vines a new graphical model for dependent random variables. Annals of Statistics, 30(4):1031 1068, 2002. D. Kurowicka and R.M. Cooke. Uncertainty Analysis with High Dimensional Dependence Modelling. Wiley, New York, 2006. H. Joe. Distributions with Fixed Marginals and Related Topics, chapter Families of m-variate distributions with given margins and m(m-1)/2 dependence parameters. IMS, Hayward, CA, 1996. Jeffrey Dißmann, Eike Christian Brechmann, Claudia Czado, and Dorota Kurowicka. Selecting and estimating regular vine copulae and application to financial returns. Computational Statistics and Data Analysis, 59:52 69, 2013. L.F. Gruber and C. Czado. Bayesian model selection of regular vine copulas. Working paper, 2016. D. Kurowicka. Dependence Modeling: Vine Copula Handbook, chapter Optimal truncation of vines, pages 233 248. World Scientific Publishing Co., 2011a.
D. Kurowicka. Optimal truncation of vines. In D. Kurowicka and H. Joe, editors, Dependence Modeling: Vine Copula Handbook. World Scientific Publishing Co., 2011b. E.C. Brechmann, C. Czado, and K. Aas. Truncated regular vines in high dimensions with application to financial data. Canadian Journal of Statistics, 40:68 85, 2012. Eike C. Brechmann and Harry Joe. Truncation of vine copulas using fit indices. Journal of Multivariate Analysis, 138:19 33, 2015. Q. H. Vuong. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica, 57:307 333, 1989. S. Grønneberg and N.L. Hjort. The copula information criteria. Scandinavian Journal of Statistics, 41:436 459, 2014.