Penalized Exponential Series Estimation of Copula Densities

Size: px

Start display at page:

Download "Penalized Exponential Series Estimation of Copula Densities"

Lydia Ball
5 years ago
Views:

1 Penalized Exponential Series Estimation of Copula Densities Ximing Wu Abstract The exponential series density estimator is advantageous to the copula density estimation as it is strictly positive, explicitly defined on a bounded support, and largely mitigates the boundary bias problem. However, the selection of basis functions is challenging and can cause numerical difficulties, especially for high dimensional density estimations. To avoid the issues associated with basis function selection, we adopt the strategy of regularization by employing a relatively large basis and penalizing the roughness of the resulting model, which leads to a penalized maximum likelihood estimator. To further reduce the computational cost, we propose an approximate likelihood cross validation method for the selection of smoothing parameter. Our extensive Monte Carlo simulations demonstrate the effectiveness of the proposed estimator for copula density estimations. Department of Agricultural Economics, Texas A&M University. xwu@tamu.edu. I gratefully acknowledge the Supercomputing Facility of the Texas A&M University where all computations in the study were performed. 1

2 1 Introduction This paper proposes a penalized maximum likelihood estimation for copula densities via the exponential series estimator for multivariate densities introduced in Wu (2010). Consider a d-dimensional random variable x with joint distribution function F. In his seminal paper, Skalar (1959) shows that, via a change of variable argument, the joint distribution can be written as F (x 1,..., x d ) = C (F 1 (x 1 ),..., F d (x d )), (1) where F j (x j ), j = 1,..., d, is the marginal distribution for the jth element of x. C( ), the so called copula function, summarizes the dependence structure among x completely. When the margins are continuous, the copula function is unique. Thus a multivariate distribution can be completely described by its copula and its univariate marginal distributions. Suppose F is differentiable with density function f. Taking derivatives of both sides of (1) yields f(x 1,, x d ) = f 1 (x 1 ) f d (x d )c(f 1 (x 1 ),..., F d (x d )), (2) where f j ( ) is the marginal density of x j, j = 1,..., d, and c( ) is the copula density function, which itself is a density function defined on the unit cube [0, 1] d. For a detailed treatment of the mathematical properties of copulas, see Nelsen (2010). Copula is a useful device for two reasons. First, it provides a way of studying scale-free dependence structure. By writing a joint density function as the product of marginal densities and the copula density, one can separate the influence of marginal densities from that of the dependence structure. The dependence structure captured by the copula is scale free and invariant to monotone transformations. In fact, many well known measures of dependence, including Kendall s τ and Spearman s ρ, can be calculated from the copula function alone. Second, copula is a starting point for constructing families of multivariate distributions. It allows us to divide multivariate density estimation into two parts: the univariate density estimation of the margins and the copula estimation. Like the usual density functions, the copula densities can be estimated by parametric 2

3 or non-parametric methods. The commonly used parametric copulas usually contain one or two parameters and thus may not be adequate to describe complicated relations among random variables. In addition simple copulas sometimes place restrictions on the dependence structure among variables. For example, the popular Gaussian copula assumes zero tail dependence among random variables and is therefore not suitable for the study of financial assets that tend to move together under extreme market conditions. A second limitation of the parametric approach is that many parametric copulas are only defined for bivariate variables, and extensions to higher dimensional cases are not available. Alternatively one can estimate copula densities using nonparametric methods. The kernel density estimator (KDE) is a popular smoother for density estimation. It is known that the KDE suffers from the boundary bias, which is particularly severe when the derivatives of a density do not vanish at the boundaries. Unfortunately this poses a considerable difficulty for copula density estimation, because copula density functions often have nonzero derivatives at boundaries and corners. For example, the return distributions of US and UK stock markets tend to move together, especially under extreme market conditions, resulting in spikes in their copula density function at the two ends of the diagonal of the unit square. 1 Like the kernel estimation, the series estimation is a commonly used nonparametric approach. For density estimations, orthogonal series estimation, or generalized Fourier estimation, is often employed. The series estimator has the advantage of automatic adaptiveness in the sense that the degree of the series, when selected in an optimal manner, can adapt to the unknown degree of smoothness of the underlying distribution to obtain the optimal convergence rate. In contrast for the kernel estimations, one may need to use a higher (thansecond) order kernel density estimation to obtain the optimal convergence rate. 2 However, the series density estimators share with higher order kernel estimators the same problem that they may produce negative density estimates. Wu (2010) proposes an exponential series estimator (ESE) for multivariate density esti- 1 Charpentier et al. (2007) discuss several remedies to mitigate boundary bias of the KDE along the line of boundary kernel estimators. 2 For instance higher-order kernels are required to obtain faster-than n 2/5 convergence rate for univariate kernel density estimations. 3

4 mations. This method is particularly advantageous to the copula density estimation as it is strictly positive, explicitly defined on a bounded support, and largely mitigates the boundary bias problem. Numerical evidence in Wu (2010) and Chui and Wu (2009) demonstrates the effectiveness of this method for copula densities. However, the selection of basis functions for the multivariate ESEs is challenging and can cause severe numerical difficulties. In this study, we adopt a regularization approach by employing a relatively large basis functions and penalizing the roughness of the resulting model to balance between the goodness-of-fit and the simplicity of the model. This approach leads to a penalized maximum likelihood estimator for copula densities. To further reduce the computational cost, we suggest an approximate likelihood cross validation method for smoothing parameter selection. Our Monte Carlo simulations show that the proposed estimator outperforms the conventional kernel density estimator, sometimes by substantial margins. The rest of the paper is organized as follows. Section 2 provides brief backgrounds on the exponential series estimation, discussing its information theoretic origin, large sample properties, extensions to multivariate variables, and smoothing parameter selection. Section 3 proposes the penalized exponential series estimator and presents an approximate likelihood cross validation method for smoothing parameter selection. Section 4 reports our Monte Carlo simulations. Some concluding remarks are offered in the last section. 2 Exponential Series Estimator of Copula Density Functions Wu (2010) proposes a multivariate exponential series estimator, and shows that it is particularly useful for copula density estimations. In this section, we briefly discuss the exponential series density estimator. We first introduce the idea of maximum entropy density, upon which the exponential series estimator is based. We then present the exponential series estimator and discuss its smoothing parameter selection and some practical difficulties in multivariate cases. 4

5 2.1 Maximum Entropy Density One strategy to obtain strictly positive density estimates using the series method is to model the log density via a series estimator. This idea is not new; earlier studies on the approximation of log- densities using polynomials include Neyman (1937) and Good (1963). Transforming the polynomial estimate of log-density back to its original scale results in a density estimator in the exponential family. Thus approximating log densities by the series estimators amounts to estimating densities by sequences of canonical exponential families. The maximum likelihood estimation (MLE) provides efficient estimates of these exponential families. Crain (1974) establishes the existence and consistency of the MLE in this case. This method of density estimation arises naturally according to the principle of maximum entropy. The information entropy, the central concept of information theory, of a univariate continuous random variable with density f is defined as W (f) = f(x) log f(x)dx. Suppose regarding a random variable x with an unknown density function f 0, one knows only some of its moments. There may exist an infinite number of distributions satisfying these given moment conditions. Jaynes (1957) proposes a method of constructing a unique density estimate based on the moment conditions as the following: max W (f) f subject to the integration to unity and side moment conditions: f(x)dx = 1, φ k (x)f(x)dx = µ k, k = 1,..., K, where φ k s are real-valued, linearly independent functions defined on the support of x. 5

6 The solution, obtained by an application of calculus of variations, takes the form f(x; c) = exp( K c k φ k (x) c 0 ) = exp(c φ(x) c 0 ), (3) k=1 where φ = (φ 1,..., φ K ) and c = (c 1,..., c K ) are the Lagrangian multipliers for the moment conditions. The normalization factor c 0 = log { exp(c φ(x))dx } < ensures the integration to unity condition. Among all distributions satisfying the given moment conditions, the maximum entropy density is the closest to the uniform distribution defined on the support of x. Many distributions can be characterized as maximum entropy densities. For example, the normal distribution is obtained by setting φ 1 (x) = x and φ 2 (x) = x 2 for x R, and the Beta distribution by φ 1 (x) = ln(x) and φ 2 (x) = ln(1 x) for x (0, 1). In practice, the population moments are often unknown and therefore replaced by their sample counterparts. Given an iid random sample X 1,..., X n, the maximum entropy density is then estimated by the MLE based on the sample moments ˆφ = ( ˆφ 1,..., ˆφ K ), where ˆφ k = 1/n n i=1 φ k(x i ), k = 1,..., K. The log-likelihood function is given by L = 1 n n i=1 = c ˆφ log { [ { c φ(x i ) log } exp(c φ(x))dx. }] exp(c φ(x))dx Denote the MLE solution by f( ; ĉ). Thanks to the canonical exponential form of the maximum entropy density, ˆφ are the sufficient statistics of f( ; ĉ). Therefore, we call ˆφ the characterizing moments of the maximum entropy density. The coefficients for the maximum entropy densities generally cannot be obtained analytically and thus are to be solved numerically. Zellner and Highfield (1988) and Wu (2003) discuss the numerical calculations of the maximum entropy density. Define g(x) = exp(c φ(x)), µ g (h) = h(x) exp(g(x))dx/ exp(g(x))dx. (4) 6

7 The score function and the Hessian matrix of the MLE are then given by S = ˆφ µĝ(φ), H = {µĝ(φφ ) µĝ(φ)µĝ(φ )}, where ĝ(x) = exp(ĉ φ(x)). One can then use the Newton s method to solve for ĉ iteratively. The uniqueness of the solution is ensured by the positive-definiteness of the Hessian matrix. Therefore for a maximum entropy density, there exists a unique correspondence between its characterizing moments ˆφ and its coefficients ĉ. 2.2 Exponential Series Density Estimator The maximum entropy density is a useful approach for constructing a density estimate given a set of moment conditions, which enjoys an appealing information-theoretic interpretation. On the other hand, like the usual parametric models, this density estimator generally is not consistent unless the underlying distribution happens to belong to the canonical exponential family with characterizing moments given by φ. To obtain consistent density estimates, in principle one can let the number of characterizing moments increase with the sample size at a proper rate, which effectively transforms the maximum entropy method into a nonparametric estimator. To stress the nonparametric nature of the said estimator, we call a maximum entropy density whose number of characterizing moments increases with sample size an Exponential Series Estimator (ESE). Moving into the realm of nonparametric estimations inevitably brings new problems. The paramount issue is the determination of degree of smoothing, which will be discussed in length below. Another issue that warrants caution is identification. To ensure a oneto-one correspondence between f(x) and exp(g(x))/ exp(g(x))dx, we need impose certain restrictions. Two commonly used identification conditions are g(x 0 ) = 0 for certain x 0 and g(x)dx = 0. When we use orthogonal series as the basis functions, the second condition is satisfied automatically. Furthermore, since a constant in g(x) is not identified, it is excluded. 7

8 Thus throughout the text, we maintain that the usual zero order term φ 0 (x) = 1 is excluded from the basis functions for g. to f 0. Let x be a random variable defined on [0, 1] with density f 0 and ˆf is an ESE approximation Without loss of generality, let φ = (φ 1,..., φ K ) be a series of orthonormal basis functions with respect to the Lebesgue measure on [0, 1]. One can measure the discrepancy between f 0 and ˆf by the Kullback-Leibler Information Criterion (KLIC, also known as the relative entropy or cross entropy), which is defined as D(f 0 ˆf) = f 0 (x) ln(f 0 (x)/ ˆf(x))dx. 3 In an important development, Barron and Sheu (1991) establish that the sequence of ˆfs converge to f 0 in terms of the KLIC. In particular, suppose { / x r (log f 0 (x))} 2 dx <, then D(f 0 ˆf) = O p (1/K 2r + K/n), with K and K 3 /n 0 for the power series and K 2 /n 0 for the trigonometric series and the splines. Wu (2010) extends the ESE to multivariate densities. He uses the tensor product of the univariate orthogonal basis functions to construct multivariate orthogonal basis functions. Let x be a d-dimensional random variable defined on [0, 1] d with density f 0. A multivariate ESE for f 0 is then constructed as f(x) = exp ( K1 Under the assumption that { ( K1 ) exp k 1 =1 K d k d =1 φ k 1 (x 1 ) φ kd (x d ) k 1 =1 K d k d =1 φ k 1 (x 1 ) φ kd (x d ) the ESE estimates converge to f 0 at rate O p ( d j=1 K 2r j j r ) dx 1 dx d. r 1 r d ln f 0(x) } 2 dx <, where r = d j=1 r j, he shows that + 1/n d j=1 K j) in terms of the KLIC. Convergence rates in other metrics are also established, and extensive Monte Carlo simulations demonstrate the effectiveness of the ESE for multivariate density estimations. Like the orthogonal series density estimator, the ESE enjoys the automatic adaptiveness to the unknown smoothness of the underlying distribution. On the other hand, it is strictly positive, and therefore avoids the negative density estimates that might occur with the orthogonal series estimators and the higher order kernel estimators. In addition, Wu (2010) 3 The KLIC is a pseudo-metric in the sense that D(f g) = 0 if and only if f = g almost everywhere, whereas it is asymmetric and does not satisfy the triangle inequality. 8

9 suggests that the ESE is an appealing estimator for the copula density because it is explicitly defined on a bounded support and therefore less sensitive to the boundary bias problem. Chui and Wu (2009) provide further Monte Carlo evidence on this. 2.3 Selection of basis functions In this subsection we discuss the selection of the degree of basis functions for the ESE, with a focus on the multivariate case. It is well known that the choice of smoothing parameter is often the most crucial ingredient of nonparametric estimations. The kernel density estimates can vary substantially with the bandwidth. Similarly, the numerical performance of orthogonal series density estimations hinges on the degree of basis functions; for example, a high order power series may oscillate wildly and produce negative density estimates. The ESE, which can be viewed as a series estimator raised to the exponent, is no exception. When a higher-than desirable number of characterizing moments is used in the estimation, the density estimates may exhibit spurious bumps and spikes. In addition, a large number of characterizing moments increases not only the computational cost, but also the probability that the Hessian matrix used in Newton updating approaches (near) singularity. Therefore, judicious choice of the degree of basis functions is called for. 4 The natural connection between the maximum entropy density and the MLE facilitates adopting some information criterion for model specification. The Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are two commonly used information criteria that strive for a balance between the goodness-of-fit and simplicity of statistical modeling. In the paradigm of nonparametric estimation where an estimator approximates the unknown underlying process, the AIC is considered optimal in the minimax sense. 5 On 4 The selection of basis functions is particularly important for the ESEs, compared with the generalized Fourier series density estimation. For the latter, the coefficient for each basis function φ k is given by φk (x)f 0 (x)dx, which can be conveniently estimated by its sample counterpart. Therefore although the selection of basis functions affects its performance, no numerical difficulties are involved for the generalized Fourier series estimations. In contrast, the coefficients for the ESEs are obtained through an inverse problem which involves all basis functions through Newton s updating; an overlarge basis may render the Hessian matrix near singular and consequently cause numerical difficulties. 5 The BIC is consistent if the set of candidates contains the true model. However, in nonparametric estimations generally the true model is assumed unknown and the goal is to arrive at an increasingly better 9

10 the other hand, from a penalized MLE point of view, the difference between the AIC and the BIC resides in their penalties for roughness or number of parameters. Let L be the log likelihood and K the number of parameters in a model, which reflects the complexity of the model and is to be penalized. Both criteria can be written in the form of L λk, where the second term is the roughness penalty and λ determines the strength of roughness penalty. For the AIC and the BIC, λ takes the value of 1 and 1/2 ln K respectively. The cross validation (CV) provides an alternative method to select smoothing parameters. Let L i denote the log likelihood for the ith observation evaluated at a model estimated with the entire sample but the ith observation. The cross validated log likelihood is calculated as L n i=1 L i. The likelihood cross validation method minimizes the Kullback-Leibler loss and therefore is asymptotically equivalent to the AIC approach. (See Hall (1987) for an in-depth analysis of the likelihood cross validation method.) Using the information criteria or the CV method to select the smoothing parameter for the univariate ESE is relatively straightforward. Recall that the selection of smooth parameter is equivalent to the selection of basis functions or characterizing moments for the ESE. Given a reasonably large candidate set of basis functions, one can evaluate all subsets of the candidate set and select the optimal set of basis functions according to a given selection criterion. However, the process can be time consuming: if the candidate set contains K basis functions, then the number of subsets is 2 K. This process can be greatly simplified when the basis functions have certain natural ordering or hierarchical structure. For example, the polynomial series and the trigonometric series have an intuitive frequentist interpretation such that the low/high order basis functions capture low/high frequency features of the underlying process. When this type of series are used, it is a common, and sometimes preferred, practice to use a hierarchical selection approach in the sense that if the kth basis function is selection, all lower order basis are included automatically. Clearly, the hierarchical selection method is a truncation approach. The number of models that need to be estimated is K, considerably smaller than 2 K as is required in the complete subset selection. In principle either the subset selection or truncation method can be used to select the approximation to the underlying model rather than identifying the true model. 10

11 smoothing parameter for estimating multivariate densities using the ESE. However, the practical difficulty is that the number of required evaluations increases exponentially with the dimension of x. For density estimation of a d-dimensional x, we consider the tensor products of univariate basis functions given by φ k (x) = d j=1 φk j j (x j), where the multi-index k = (k 1,..., k d ). Denote the size of a multi-index by k = d j=1 k j. Suppose the candidate set M K consists of basis functions whose sizes are no great than K, i.e., M K = {φ k : 1 k K}. With a slight abuse of notation, let M K denote the number of elements in M K. One can show that M K = ( ) K+d d 1. Therefore if the subset selection method is used, it would require estimating 2 M K ESE densities, which can be prohibitively expensive. For instance, if d = 2 and K = 4, we need estimate 2 14 ESE densities; the number explodes to 2 34 if d = 3. Thus, the subset selection approach is practically infeasible except for the simplest cases. Now let us consider the truncation method, which is more economical in terms of basis functions. It seems we can proceed as in the univariate case, estimating the ESE densities with basis functions M k, k = 1,..., K, and then selecting the optimal set according to some criterion. There is, however, a key difference between the univariate case and the multivariate case. For the former, as we raise k from 1 to K, each time we increase the number of basis functions by one. On the other hand, for the general d-dimensional case, the number of basis functions added to the candidate set increases with k. To be precise, let m K = {φ k : k = K}, then M K = (m 1,..., m K ). The number of additional basis functions incorporated along the stepwise process at each stage is m k = M k M k 1 = ( ) k+d 1 d 1 for d 2 and k 1. For instance, when d = 2, we have m k = k + 1; when d = 3, the corresponding step sizes increase to 3, 6, 10, and 15 for k = 1,..., 4. Therefore in the multivariate case, the number of basis functions added along the process of truncation method rises rapidly with the dimension d, leading to increasingly abrupt expansions in model complexity. Consequently, the suitability of the simple truncation method to high dimensional case is questionable. 11

12 Lastly in addition to the practical difficulties associated with model specification in the high dimension case, there exists another potential problem. Recall that the maximum entropy density can produce spurious spikes when it is based on a large number of moments. Not surprisingly, this problem can be aggravated when using the ESE to estimate multivariate densities, whose number of characterizing moments increase rapidly with dimension and sample size. In addition, the larger is the number of basis functions, the higher is the probability that the Hessian matrix used in the Newton updating becomes (near) singular, introducing further complications. To mitigate the aforementioned problems, below we propose an alternative penalized MLE estimation approach for copula densities using the ESE. 3 Penalized exponential series estimation As is discussed above, the ESE for multivariate densities can be challenging due to difficulties associated with selection of basis functions and numerical issues. In this section, we propose to use the method of penalized MLE to determine the degree of smoothing: instead of painstakingly selecting a (small) set of basis functions to model a multivariate density, we use a relatively large set of basis functions and shrink coefficients of the resulting models toward zero to penalize its complexity. 3.1 The model Good and Gaskins (1971) introduce the idea of roughness penalty density estimation. Their idea is to use as an estimate that density which maximizes a penalized version of the likelihood. The penalized likelihood is defined as Q = 1 n n ln f(x i ) λj(f), i=1 12

13 where J(f) is a roughness penalty for density f and λ is the smoothing/tuning parameter. The log likelihood dictates the estimate to adapt to the data, the roughness counteracts by demanding less variation, and the smoothing parameter controls the tradeoff between the two conflicting goals. Various roughness penalties have been proposed in the literature. For instance, Good and Gaskins (1971) use J(f) = (f ) 2 (x)/f(x)dx, Silverman (1982) sets J(f) = (d/dx) 3 ln f(x)dx, and Gu and Qiu (1993) propose general quadratic roughness penalties for the smoothing spline density estimation. Without loss of generality and for simplicity, we consider a bivariate random variable (x, y) with a strictly positive and bounded density, f 0, defined on the unit square [0, 1] [0, 1]. Let φ k (x, y), 1 k K, be an orthonormal basis function with respect to the Lebesgue measure on the unit square. To ease notation, we denote M = M K, where M is understood to be a function of K, which in turn is a function of the sample size. We also change the multi-index k to a single index k, k = 1,..., M. We consider approximating f 0 by f(x, y) = exp( M k=1 φ k(x, y)) exp( M k=1 c kφ k (x, y))dxdy exp(g(x, y)) exp(g(x, y))dxdy, (5) where g(x, y) = c φ(x, y) with c = (c 1,..., c M ) and φ(x, y) = (φ 1 (x, y),..., φ M (x, y)). Throughout this section, integration is taken to be on the unit square. For the roughness penalty, we adopt a quadratic penalty on the log density g. The penalized MLE objective function is then given by Q = 1 n n i=1 c φ(x i, Y i ) ln exp(g(x, y))dxdy λ 2 c W c, (6) where W is a positive definite weight matrix for the roughness penalty. Given the smoothing parameter and the roughness penalty, one can use Newton s method 13

14 to solve for c iteratively. The gradient and Hessian of (6) are respectively S = ˆφ µĝ(φ) λw ĉ, H = {µĝ(φφ ) µĝ(φ)µĝ(φ) } λw, where µ g is given by (4), and ĝ = exp {ĉ φ}. One can establish the existence and uniqueness of the penalized MLE within the general exponential family under rather mild conditions (see, e.g., Lemma 2.1 of Gu and Qiu (1993)). To implement the penalized MLE, we must specify several factors: (i) the basis functions φ; (ii) the weight function W ; and (iii) the smoothing parameter λ. The smoothing parameter plays a most crucial role and is to be discussed in length in the next subsection. As for the choice of basis functions, commonly used orthogonal series include the Legendre series, the trigonometric series and the splines. Although there exist subtle differences between these series (e.g., in terms of their boundary biases), they lead to the same convergence rates under suitable regularity conditions. We also need determine the size/number of basis functions. We stress that for the penalized MLE, the number of basis functions is generally not considered a smoothing parameter. For instance, in the smoothing spline density estimations, the size of the basis functions can be as large as the sample size. In practice, often a sufficiently large (but smaller than sample size) basis suffices. The size of basis functions in the penalized likelihood estimation is usually considerably larger than that would be selected according to some information criteria, and thus calls for roughness penalty. 6 Next we need choose the weight matrix W. We consider penalizing the roughness of log density g(x, y) = c φ(x, y), whose penalty is given by {g J(g) = (m) (x, y) } 2 {c dxdy = φ (m) (x, y) } { 2 dxdy = c } φ (m) φ (m) dxdy c, 6 See Gu and Wang (2003) for an asymptotic analysis of the size of basis functions in the smoothing spline estimations. 14

15 where g (m) (x, y) = m 1 +m 2 m g(x,y) x m 1 y m 2 with m 0. Therefore, W is given by the middle factor of the last equality. Using orthonormal series simplifies the construction of the weight matrix since it leads to a diagonal weight matrix. When m = 0, W equals the identity matrix and coefficients for all basis functions are penalized equally; when m 1, coefficients for higher order/frequency moments are increasingly penalized, where the rate of increase rises geometrically with m. 7 Popular choices of m include m = 0 and m = 2, corresponding to the natural splines and cubic splines respectively in the smoothing spline estimations. Since there is a one-to-one correspondence between the characterizing moments and their coefficients in the ESE density, the penalized MLE can be viewed as a shrinkage estimator that shrinks the sample characterizing moments towards zero. In addition, this roughness penalty defines a null space J, on which J(g) = 0. This null space is usually finitedimensional to avoid interpolation of the data. When the smoothing parameter λ, the penalized MLE converges to the MLE on J, which is the smoothest density induced by the given roughness penalty. For instance, when m = 2, the smoothest function for g is linear in x; when m = 3, the smoothest function is quadratic in x, leading to the normal distribution (See Silverman (1982)). 3.2 Selection of Smoothing Parameter One advantage of using the penalized MLE for model selection is that it avoids the difficult subset selection or truncation process to determine the optimal degree of smoothing. Instead, a relatively large number of basis functions is used and the degree of smoothing is determined by a single continuous parameter λ. In practice, one has to choose λ. The method of cross validation is commonly used for this purpose. This is a natural choice, because the penalized MLE does not involve subset selection and therefore the AIC, BIC type criteria that penalize the number of parameters cannot be easily applied. The leave-one-out cross validation for linear regressions is quite straightforward. As a matter of fact, for a sample with n observations, one usually need not estimate n regressions 7 For instance, the penalty weight given to a univariate cosine basis φ k (x) = 2 cos(πkx), x [0, 1], is (πk) 2m. 15

16 because there is an analytical formula to calculate the least squares cross validation result from the regression on the full sample. This kind of analytical solution, however, generally does not exist for nonlinear estimations. For the ESE estimation of multivariate densities, this can pose a practical difficulty due to high computational cost. This is because that the coefficients of the ESE are calculated iteratively through Newton s updating. For basis functions of size M, the Hessian matrix has M(M + 1)/2 distinct elements to evaluate, each requiring multidimensional integration by numerical methods. The computational cost increases rapidly with the dimension because (i) the number of basis functions increases with the dimension in nonparametric estimations, and (ii) the multidimensional integration becomes increasingly expensive with the dimension as well. Thus it is rather expensive to implement the leave-one-out cross validation for multivariate ESEs, especially for penalized MLEs that use a large number of basis functions. Therefore we propose a first order approximation to the cross validated log likelihood, which only requires one estimation of the ESE based on the full sample. Recall that (5) belongs to the general exponential family and the sample averages ˆφ are the sufficient statistics for the penalized MLE. Denote the sample averages calculated leaving out the ith observation by φ i = 1/(n 1) n j i φ(x j, Y j ). It follows that φ i are the sufficient statistics for the penalized MLE calculated with the ith observation deleted. For given basis functions and smoothing parameter, denote by ˆf and ˆf i the penalized MLE estimates associated with ˆφ and ˆφ i respectively. Let ĉ and coefficients and Hessian matrix of ˆf, and ĉ i and theorem, we have ĉ i ĉ Ĥ 1 ( ˆφ φ i ). Ĥ be the estimated Ĥ i be similarly defined. By Taylor s The normalization factor can be approximated similarly. Define c 0 = ln exp(g(x, y))dxdy. Let ĉ 0 and ĉ 0, i be the normalization factor of ˆf and ˆf i respectively. It follows that ĉ 0, i ĉ 0 µĝ(φ )Ĥ 1 ( ˆφ ˆφ i ). 16

17 Next let L i be the log likelihood of the ith observation evaluated at ˆf i. The cross validated log likelihood can then be approximated as follows: L = 1 n L i (X i, Y i ) n i=1 = 1 n {ĉ } n i φ(x i, Y i ) ĉ 0, i i=1 1 n { ĉ n Ĥ 1 ( φ φ } i ) φ(xi, Y i ) 1 n { ĉ 0 (µĝ(φ)) Ĥ 1 ( n φ φ } i ) i=1 i=1 = 1 n {ĉ φ(x i, Y i ) ĉ 0 } 1 n φ (X i, Y i n n )Ĥ 1 ( φ φ i ) i=1 i=1 = L 1 n φ (X i, Y i n )Ĥ 1 ( φ φ i ), (7) i=1 where L is the maximum penalized log likelihood on the full sample, and the second to last equality follows because n i=1 ( φ φ i ) = 0. Next let Φ be a n M matrix with the ith row being (φ 1 (X i, Y i ),..., φ M (X i, Y i )). The cross validated log likelihood (7), after straightforward but tedious algebra, can be written in the following matrix form L L 1 n 1 trace[φĥ 1 Φ 1 ] + n(n 1) (1 Φ) Ĥ 1 (Φ 1), (8) where 1 is a n 1 vector of ones. We then use (8) as the objective function of the penalized MLE. Since λ is a scalar, we use a simple grid search to locate the solution. As is discussed above, multi-dimensional numerical integrations are used repeatedly in our estimations. For the calculation of µĝs, we use Smolyak algorithm for cubatures following Gu and Wang (2003). Smolyak cubatures are highly accurate for smooth functions. We note that the placement of nodes in Smolyak cubatures is dense near the boundaries. Therefore, they are particularly suitable for evaluating the ESEs of copula densities since they often peak near the boundaries and corners. 17

18 4 Monte Carlo Simulations To investigate the finite sample performance of the proposed estimators, we conduct a series of Monte Carlo simulations. For the penalized MLE estimator, we penalize the third order derivatives of the log densities. Denote a bivariate ESE density by f(x, y) = exp (g(x, y) c 0 ), where g(x, y) = M m=1 φ m(x, y). The the penalty then takes the form [ { J(g) = c ( x x 2 y + 3 xy + } 2 )g(x, y) dxdy] c. 2 y3 When the smoothing parameter goes to infinity, the penalized MLE converges to the following smoothest distribution induced by the penalty given above: f(x, y) = exp ( c 1 x + c 2 x 2 + c 3 y + c 4 y 2 + c 5 xy c 0 ), x, y [0, 1], which is a truncated bivariate normal density defined on the unit square. 8 Alternatively, one can penalize lower or higher order derivatives of g. We choose the third order derivative because under this penalty, the smoothest distribution is the simplest one that contains useful information on the dependence between x and y, which is captured by the sample moment 1/n n i=1 X iy i. If a lower order derivative is used, the smoothest distribution contains only moments on the margins and thus is not informative since for copula densities, all margins are uniform. On the other hand, if higher order derivatives are used, the smoothest distributions contain higher order information on the dependence between x and y, whose coefficients are not penalized. 9 We consider both the Legendre series and the cosine series, orthonormalized on the unit 8 We note that this is different from the Gaussian copula, whose distribution function is given by C(x, y; ρ) = Φ ρ (Φ 1 (x), Φ 1 (y)), x, y [0, 1], where Φ is the standard normal distribution function, and Φ ρ is the standard bivariate normal distribution function with correlation coefficient ρ. 9 In contrast to the large literature on the selection of smoothing parameters, theoretical guidance to the specification of penalty forms is scanty. On the the hand, existing literature suggests that the estimations are usually not sensitive to the form of penalty, which is consistent with our own numerical experiments. 18

19 square. The results from the two basis functions are rather similar, hence to save space below we only report the results on the Legendre series. We consider three sample sizes: 50, 100 and 200. For all three sizes, we find that the Legendre basis functions with degree no larger than 4 produce satisfactory results. 10 The approximate cross validation method described in the previous section is used to select the smoothing parameter. For comparison, we also estimate the copula densities using the kernel density estimator. In particular, we use the product Gaussian kernel and the bandwidth is selected by the method of likelihood cross validation. We consider four different copulas: the Gaussian, T, Frank, and Galambos; the first two belong to the Elliptical class, and the next two to the Archimedean and the extreme value classes respectively. For each copula, we look at three cases with low, medium and high dependence respectively. The coefficients for the copulas are selected such that the low, medium and high dependence cases correspond to a correlation of 0.2, 0.5 and 0.8 respectively. All experiments are repeated 500 times. For each experiment, let (X i, Y i ), i = 1,..., n, be an iid sample generated from a given copula. Define the pseudo-observations as X i = 1 n + 1 n j=1 I(X j X i ), Y i = 1 n + 1 n I(Y j Y i ), where the denominator is set to n + 1 to avoid numerical difficulties. j=1 Jackel (2002) and Charpentier et al. (2007) suggest that using the pseudo-observations instead of the true observations reduces the variation. The intuition is that the above transformation effectively changes both marginal series (after being sorted in the ascending order) to ( 1 n+1,..., n n+1 ), which is consistent with the fact that copula densities have uniform margins. We use the pseudo-observations in all our estimations. To gauge the performance of our estimates, we calculate the mean square errors (MSE) and mean absolute deviations (MAD) between the estimated densities and the true copula 10 Let φ k the kth degree Legendre polynomial on [0, 1]. We include in our bivariate density estimations basis functions of the form φ j (x)φ k (y), j + k 4. The size of this basis is

20 densities, evaluated on a 30 by 30 equally spaced grid on the unit square. Figure 1 reports the estimated results measured by the MSE. The top, middle and bottom rows correspond to the sample sizes 200, 100 and 50, and the left, middle and right columns correspond to the low, medium and high correlation cases. In each plot, the MSEs for the ESEs are represented by circles connected by solid lines, while those for the KDEs are represented by triangles connected by dash lines. The Gaussian, T, Frank and Galambos copulas are labeled as 1,2,3 and 4 respectively in each plot. Note that the scales for the plots differ. In all our experiments, the ESE outperforms the KDE, often considerably. The MSE increases with the degree of dependence and decreases with the sample size. Averaging across four copulas, the ratios of MSEs between the ESEs and KDEs are 0.25, 0.49 and 0.77 for the low, medium and high correlation cases respectively. The corresponding ratios for sample sizes 50, 100 and 200 are respectively 0.62, 0.67 and Figure 2 reports the estimation results in MADs. The overall pictures are similar to those of MSEs, but with a lager average performance gap. Averaging across four copulas, the ratios of MADs between the ESEs and KDEs are 0.32, 0.42 and 0.61 for the low, medium and high correlation cases respectively. The corresponding ratios for sample sizes 50, 100 and 200 are respectively 0.45, 0.45 and Thus our numerical experiments support our contentions in the previous sections that the ESE provides a useful nonparametric estimator for the copula densities. 5 Concluding Remarks We have proposed a penalized maximum likelihood estimator of the exponential series method for the copula density estimation. The exponential series density estimator is strictly positive and overcomes the boundary bias issue associated with the kernel density estimation. However, the selection of basis functions for the ESEs is challenging and can cause severe numerical difficulties, especially for multivariate densities. To avoid the issue of basis function selection, we adopt the strategy of regularization by employing a relatively large basis and penalizing the roughness of the resulting model, which leads to a penalized maximum 20

21 MSE copula Figure 1: Mean squared errors of estimated copulas. The ESE and the KDE results are represented by circles and triangles respectively. Rows 1-3 correspond to n = 200, 100 and 50 respectively; columns 1-3 correspond to correlation equal to 0, 2, 0.5 and 0.8 respectively; in each plot, copulas 1-4 correspond to the Gaussian, T, Frank and Galambos copula. Note that the scales of the plots differ. 21

22 MAD copula Figure 2: Mean absolute deviation of estimated copulas. The ESE and the KDE results are represented by circles and triangles respectively. Rows 1-3 correspond to n = 200, 100 and 50 respectively; columns 1-3 correspond to correlation equal to 0, 2, 0.5 and 0.8 respectively; in each plot, copulas 1-4 correspond to the Gaussian, T, Frank and Galambos copula. Note that the scales of the plots differ. 22

23 likelihood estimator. To further reduce the computational cost, we propose an approximate likelihood cross validation method for the selection of smoothing parameter. Our extensive Monte Carlo simulations demonstrate the usefulness of the proposed estimator for copula density estimations. Generalization of the said estimator to nonparametric multivariate regressions and applications in high dimensional analysis, especially in financial econometrics, may be of interest for future study. 23

24 References Barron, A. and C. Sheu (1991). Approximation of density functions by sequences of exponential families. Annals of Statistics 19, Charpentier, A., J. Fermanian, and O. Scaillet (2007). The estimation of copulas: Theory and practice. In J. Rank (Ed.), Copulas: From theory to Application in Finance. Risk Publications. Chui, C. and X. Wu (2009). Exponential series estimation of empirical copulas with application to financial returns. In Q. Li and J. Racine (Eds.), Advances in Econometrics, Volume 25. Crain, B. (1974). Estimation of distributions using orthogonal expansions. Annals of Statistics 2, Good, I. (1963). Maximum entropy for hypothesis formulation, especially for multidimensional contingency tables. Annals of Mathematical Statistics 34, Good, I. and R. Gaskins (1971). Nonparametric roughness penalities for probability densi. Biometrika 58, Gu, C. and C. Qiu (1993). Smoothing spline density estimation: Theory. Annals of Statistics 21, Gu, C. and J. Wang (2003). Penalized likelihood density estimation: Direct cross-validation and scalable approximation. Statistica Sinica 13, Hall, P. (1987). On kullback-leibler loss and density estimation. Annals of Statistics 15, Jackel, P. (2002). Monte Carlo Methods in Finance. New York: John Wiley and Sons. Jaynes, E. (1957). Information theory and statistical mechanics. Physics Review 106,

25 Nelsen, R. B. (2010). An Introduction to Copulas. Springer. Neyman, J. (1937). Smooth test for goodness of fit. Scandinavian Aktuarial 20, Silverman, B. W. (1982). On the estimation of a probability density function by the maximum penalized likelihood method. Annals of Statistics 10, Skalar, A. (1959). Functions de repartition a n dimensions et leurs merges. Publ. Inst. Statist. Univ. Paris 8, Wu, X. (2003). Calculation of maximum entropy densities with application to income distribution. Journal of Econometrics 115, Wu, X. (2010). Exponential series estimator of multivariate densities. Journal of Econometrics 156, Zellner, A. and R. Highfield (1988). Calculation of maximum entropy distribution and approximation of marginal posterior distributions. Journal of Econometrics 37,

Transformation-based Nonparametric Estimation of Multivariate Densities

Transformation-based Nonparametric Estimation of Multivariate Densities Meng-Shiuh Chang Ximing Wu March 9, 2013 Abstract We propose a probability-integral-transformation-based estimator of multivariate