Bayesian Inference for Conditional Copula models

Size: px

Start display at page:

Download "Bayesian Inference for Conditional Copula models"

Meryl Payne
5 years ago
Views:

1 Bayesian Inference for Conditional Copula models Radu Craiu Department of Statistical Sciences University of Toronto Joint with Evgeny Levi (Toronto), Avideh Sabeti (Toronto) and Mian Wei (Toronto) CRM, Montreal January 27, 2017

2 Outline Introduction Conditional Copulas Motivation for CC Understanding a Dependence Pattern Building Multivariate Distributions Prediction Copula Misspecification Calibration Estimation Frequentist methods Bayesian methods Calibration Models when dim(x ) >> 1 Bayesian Additive Model GP-SIM Numerical Illustrations Estimation Model Selection - Copula and Calibration Selection Criteria: CVML and WAIC

3 Copulas Copula functions are used to model dependence between continuous random variables. (Sklar, 59) If Y 1, Y 2 are continuous r.v. s with distribution functions (df) F, G, there exists an unique copula function C : [0, 1] [0, 1] [0, 1] such that H(t, s) = Pr(Y 1 t, Y 2 s) = C(F (t), G(s)). C is a distribution function on [0, 1] 2 with uniform margins. The copula bridges the marginal distributions with the joint distribution.

4 Inference for Copula Models Early on: abundance of theoretical developments: construction of new copula families and connections with dependence concepts (NA, PQD/NQD, PRD/NRD, etc). Joe ( 97), Nelsen ( 06). Statistical inference for constant copula models: Joint Maximum Likelihood: numerical methods Two-stage approach: Joe (JMVA, 05) Semiparametric approach: Genest, Khoudi & Rivest (Bmka, 95) Nonparametric approach: Pickands (BISI, 81); Capéraà, Fougères & Genest (Biomka, 97). Copula goodness-of-fit and selection (Genest, Remillard & Beaudoin, IME 09; Genest, Quessy & Remillard, SJS 06; Fermanian, JMVA 05; Berg, Eur. J. of Finance, 09).

5 Conditional Copula Consider a random sample {x i R d, y 1i R, y 2i R} 1 i n and suppose F X (y 1 ) and G X (y 2 ) are the unknown marginal conditional cdf s. The bivariate conditional copula (CC) of (Y 1, Y 2 ) X = x, is the conditional joint distribution function of U = F x (Y 1 ) and V = G x (Y 2 ) given X = x (Patton, Int l Econ. Rev. 06) H x (t, s) = C x (F x (t), G x (s)) The parametric bivariate CC model assumes there is a parametric family C = {C θ : θ Θ} s.t. C x (F x (t), G x (s)) = C θ(x) (F x (Y 1 ), G x (Y 2 )). The simplifying assumption: C x (F x (y 1 ), G x (y 2 )) = C(F x (y 1 ), G x (y 2 )).

6 Why CC? Understanding the Dependence Pattern We are interested in understanding the covariate effect on the dependence pattern between responses. The smoking cessation study of Liu, Daniels and Marcus (JASA 09) : Q = smoking cessation (0=No, 1=Yes) W = weight change X = time spent exercising Does exercise weaken the association between smoking status and weight gain?

7 Why CC? Building General Multivariate Distributions Joint models for multivariate data. If the joint distribution of U 1, U 2, U 3 (U i Uniform(0, 1), 1 i 3) is modelled using the pair copula model then c(u 1, u 2, u 3 ) = c 12 (u 1, u 2 )c 23 (u 2, u 3 )c 13 2 (u 1 2, u 3 2 ; u 2 ) where u k 2 = Pr(U k u k U 2 = u 2 ). As dimension increases, the bivariate conditional copulas depend on increasing number of variables.

8 Why CC? Regression-based prediction In a bivariate CC model the joint density is h x (y 1, y 2 ) = f x (y 1 )g x (y 2 )c θ(x) (F x (y 1 ), G x (y 2 )). The conditional density of Y 1 Y 2 = y 2, X = x is h x (y 1 y 2 ) = f x (y 1 )c θ(x) (F x (y 1 ), G x (y 2 )). This can be useful when for each item a subset of the response variables is much easier to measure than the rest.

9 Propagation of Errors Errors can appear from multiple sources: c θ(x)+δ3 (x)(f x (y 1 ) + δ 1 (x), G x (y 2 ) + δ 2 (x)) = error due to F x {}}{ + c (1,0,0) θ(x) (F x (y 1 ), G x (y 2 )) δ 1 (x) + error due to θ(x) {}}{ + c (0,0,1) θ(x) (F x (y 1 ), G x (y 2 ))δ 3 (x) +O( δ(x) 2 ) correct part {}}{ c θ(x) (F x (y 1 ), G x (y 2 )) error due to G x {}}{ c (0,1,0) θ(x) (F x (y 1 ), G x (y 2 ))δ 2 (x) +

10 Estimation of θ(x) - frequentist approaches Acar, Craiu & Yao (Biomcs, 2011) - semiparametric estimation. Parametric marginals, θ(x) is approximated nonparametrically via local polynomial estimation. Veraverbeke, Omelka & Gijbels (SJS, 2011) - nonparametric estimation of the copula and marginals. We observed that the copula estimator may be severely biased if any of the conditional marginal distributions change with the value of the covariate X = x (V., O. & G, 2011) Nonparametric estimates in large-ish dimensions d suffer from curse of dimensionality, unless the volume of data is huge.

11 CC: The SA Condition When d = dim(x ) >> 1 the curse of dimensionality can be alleviated with dimension reduction models. The SA leads to dimension crushing : C x (F x (y 1 ), G x (y 2 )) = C(F x (y 1 ), G x (y 2 )) but when is it justifiable? Acar, Genest & Ne slehová (JMVA, 2012) - discuss the bias incurred when SA is not justified. Acar, Craiu & Yao (EJS, 2013) - Generalized LRT to test a constant or linear null calibration against a general alternative. Gijbels, Omelka & Veraverbeke (Statistics, 2016) - nonparametric testing procedures. Derumigny & Fermanian (arxiv, 2016) - review of state-of-art and a work program around SA for the next years.

12 CC: d = dim(x ) > 1 Sabeti, Wei & Craiu (Stat, 2014) - Bayesian additive CC models. Chavez-Demoulin & Vatter (JMVA, 2015) - Generalized additive models. Lobato, Lloyd & Lobato (NIPS, 2013) - Gaussian Process models for CC in financial time series. Levi & Craiu - Gaussian Process with Single Index Models.

13 Why a Bayesian approach? Joint modelling can be a bit easier as selection of most tuning parameters is automatic and data driven. The posterior distribution accounts for all sources of variation (included in the model). The Monte Carlo samples from the posterior are used to compute finite sample variance estimates, pointwise credible regions and model selection criteria. Priors can be used to favour model sparsity.

14 Additive Model for Calibration X R d Y 1, Y 2 are continuous r.v. s, Y i N (µ i (X ), σi 2 ), for i = 1, 2. Jointly, 2 ( ) 1 Yi µ i (X ) h X (Y 1, Y 2 ) = φ σ i σ i i=1 [ ( ) ( ) Y1 µ 1 (X ) Y2 µ 2 (X ) c Φ, Φ θ(x )], where c(u, v θ) is the pdf of the conditional copula. σ 1 σ 2

15 Additive Model for Calibration Usually there is little/no information about the shape of θ(x ). Generally θ(x ) has a restricted range, so we estimate the calibration function η : R d R where g(θ(x i )) = η(x i ) (g is user-specified). We assume that η(x 1,..., x d ) = α 0 + d ηi (x i ). i=1 Additivity is not preserved when changing dependence measure.

16 Additive Model for Calibration Sabeti, Wang & Craiu (Stat, 2013) use an AM: Each η i has the form: η i (x i ) = 3 j=1 α (i) j x j K (i) i + k=1 ψ (i) k (x i γ (i) k )3 +. Number and location of knots {γ1,..., γ K (i)} is important. The knot-related choices are data driven. Partition the range of X i into K max intervals, I (i) k, and introduce auxiliary variables { 1 if there is a knot γ (i) k in I (i) k ζ (i) k = η i (x i) = 3 and ψ (i) k 0 0 if there is no knot in I (i) k and ψ k = 0 j=1 α(i) j x j i + K max k=1 ψ(i) k ζ(i) k (x i γ (i) k )3 +,

17 Gaussian Process Prior for CC GP is a flexible method when errors are reasonably approximated by Gaussians. GP for marginals when means could be nonlinear functions of X. GP for calibration function could be used in conjunction with other marginal models. Vanilla GP is not helping with the curse of dimensionality and can be expensive when n is large so modifications are needed.

18 Gaussian Process Prior GP prior for smooth f without specifying the form of f. For x [ 5, 5] n, consider f N n (0, K(x, x)) where K ij (x, x) = k(x i, x j ) and f i = f (x i ) L = 1 L = output, f(x) 0 output, f(x) input, x input, x Random functions f generated from a GP prior when n = 100 Cov(f (x i ), f (x j )) = k(x i, x j ) = exp{ 0.5 x i x j 2 L }.

19 Gaussian Process Estimation Observe {y i : 1 i n} noisy realizations of f (x i ) i=1,n, y i = f (x i ) + ɛ i, ɛ i N(0, σ 2 ). When interested in predicting f = (f (x j )) j=1,q use ( y f ) ( [ K(x, x) + σ N n+q 0, 2 I n K(x, x ) K(x, x ) K(x, x ) The conditional distribution of f is Gaussian with E(f y) = K(x, x) expensive for large n {}}{ [K(x, x) + σ 2 I q ] 1 y V (f y) = K(x, x ) K(x, x) [K(x, x) + σ 2 I q ] 1 K(x, x ) }{{} expensive for large n ])

20 Computational challenges for GP When n is large the computation effort is prohibitive so we adopt a sparse GP approach. The information about f in the data is funnelled using a smaller sample of size m << n of inducing (or latent) variables x g, 1 g m. Let f = (f ( x 1 ),..., f ( x m )) T we assume f N(0, K( X, X ). The (Gaussian) conditional distribution of f X, X involves K(X, X ) and the inverse of a m m matrix. Integrating out f yields the (Gaussian) conditional distribution of f X, X with similar computational burden.

21 Gaussian Process Estimation Suppose we know that f : R R and ( 4, 3, 1, 0, 2) f ( 2, 0, 1, 2, 1) y i f i N(f i, σ 2 ), f i = f (x i ), 1 i 5 L = 1 L = output, f(x) 0 output, f(x) input, x input, x Samples from the posterior distribution of f y, when σ 2 = 0.1.

22 Modelling η when d > 1 Assume that θ(x i ) = g 1 (f (X i )) and f = (f (X 1 ), f (X 2 ),..., f (X n )) T N (0, K(X, X ; w)), The (i, j)th element of matrix K(X, X ; w) is k(x i, x j ; w) = e w 0 exp [ ] d (x is x js ) 2. s=1 The covariance between outputs is defined as a function of inputs. The parameters w in the covariance function k determine distance between inputs with a significant difference in the outputs. e ws

23 Modelling η when d > 1 When X i R d a full GP model for the CC has 1 + d parameters. We consider instead the SIM model f (X ) = f (β T X ). GP-SIM model is invariant to nonlinear one-to-one transformations τ(θ). The parameter β is unidentifiable up to a constant so we assume β = 1. Marginals are fitted also using GP-SIM models.

24 Proof of concept Sc1 f 1 (x) = 0.6 sin(5x 1 ) 0.9 sin(2x 2 ), f 2 (x) = 0.6 sin(3x 1 + 5x 2 ), τ(x) = sin(15x T β) β = (1, 3) T / 10, σ 1 = σ 2 = 0.2 n = 400 Clayton Frank Gaussian Clayton SA Scenario IBias 2 IVar IMSE IBias 2 IVar IMSE IBias 2 IVar IMSE IBias 2 IVar IMSE Sc Integrated error for the estimator of τ(x).

25 Estimation of τ(x) X1=0.2 X1=0.8 Kendall's tau Kendall's tau X X2 X2=0.2 X2=0.8 Kendall's tau Kendall's tau X X1

26 Prediction performance If y i x N(µ i (x), σi 2 ), i = 1, 2 then 1 ( )) E x [Y 1 Y 2 = y 2 ] = µ 1 (x)+σ 1 Φ 1 y2 µ 2 (x) (z)c θ(x) (z, Φ dz. 0 σ 2 E(Y1), X1=0.2 E(Y1), X1=0.8 E(Y1) E(Y1) X X2 E(Y1), X2=0.2 E(Y1), X2=0.8 E(Y1) E(Y1) X X1

27 Model misspecification effects Marginals: f1 (x) = 0.6 sin(5x 1 ) 0.9 sin(2x 2 ) f2 (x) = 0.6 sin(3x 1 + 5x 2 ) σ 1 = σ 2 = 0.2 Copula: β = (1, 3)/ 10 τ(x) = sin(5x T β) Frank copula Y1 and X test points: Y 1 = 1.5, X = (0.3, 0.3) Y1 = 1.0, X = (0.3, 0.7) Y1 = 0.5, X = (0.7, 0.3) Y1 = 0.0, X = (0.7, 0.7) Y 1 = 0.5, X = (0.5, 0.5)

28 Model misspecification effects Y2 Y1,X Tau(X) Y Tau Y Y1 Black - true; Red - Correct model; Blue - Correct copula with SA Green - Wrong copula; Purple - Correct copula with missing covariate

29 Model misspecification effects Marginals f1 (x) = 0.6 sin(5x 1 ) 0.9 sin(2x 2 ) f 2 (x) = 0.6 sin(3x 1 + 5x 2 ) σ 1 = σ 2 = 0.2, X 1 X 2. Copula: τ(x) = 0.71 Model: A nonparametric model for marginals and CC based on only x1.

30 Model misspecification effects Marginals f1 (x) = 0.6 sin(5x 1 ) 0.9 sin(2x 2 ) f 2 (x) = 0.6 sin(3x 1 + 5x 2 ) σ 1 = σ 2 = 0.2, X 1 X 2. Copula: τ(x) = 0.71 Model: A nonparametric model for marginals and CC based on only x1. Kendalls Tau x_1

31 Model misspecification effects Suppose E[Y i X 1, X 2 ] = f i (α i X 1 + β i X 2 ), i=1,2. Then fit {}}{ f i (x 1, x 2 ) = f i (x 1, 0) +f (0,1) i (x 1, 0)β i x 2 + f (0,2) i (x 1, 0) β2 i x O( x2 3 ), i = 1, 2 The marginal residuals still contain high-order information about x 1 which varies with the distribution of x 2.

32 Model misspecification effects Suppose E[Y i X 1, X 2 ] = f i (α i X 1 + β i X 2 ), i=1,2. Then fit {}}{ f i (x 1, x 2 ) = f i (x 1, 0) +f (0,1) i (x 1, 0)β i x 2 + f (0,2) i (x 1, 0) β2 i x O( x2 3 ), i = 1, 2 The marginal residuals still contain high-order information about x 1 which varies with the distribution of x 2. Kendalls Tau Kendalls Tau Kendalls Tau x_ x_ x_1

33 Three Model Selection Problems Choice of copula family. Choice of calibration Simplifying Assumption or not? AM or GP-SIM? Covariate selection.

34 CV Marginal Likelihood (CVML) Calculates the average (over parameter values) prediction potential for model M via CVML(M) = n log (P(Y 1i, Y 2i D i, M)), i=1 D i is the data set from which the ith observation has been removed. Estimate CVML using E π [ P(Y1i, Y 2i ω, M) 1] = P(Y 1i, Y 2i D i, M) 1 where ω represents the vector of all the parameters and latent variables in the model.

35 Watanabe-Akaike Information Criterion An alternative approach to estimating the log pointwise predictive density (Watanabe, JMLR 10; Vehtari, Gelman & Gabry, Stat. Comput, 16). WAIC is defined as WAIC(M) = 2fit(M) + 2p(M) N fit(m) = i=1 log E π [P(y 1i, y 2i ω, M)] N p(m) = i=1 Var π[log P(y 1i, y 2i ω, M)]. fit(m) and the penalty p(m) are computed using the posterior samples.

36 Another Two Scenarios Sc3 f 1 (x) = 0.6 sin(5x 1 ) 0.9 sin(2x 2 ) f 2 (x) = 0.6 sin(3x 1 + 5x 2 ) τ(x) = 0.5 σ 1 = σ 2 = 0.2 SA is true Sc4 f 1 (x) = 0.6 sin(5x 1 ) 0.9 sin(2x 2 ) f 2 (x) = 0.6 sin(3x 1 + 5x 2 ) η(x) = sin(3x 3 1 ) 0.5 cos(6x 2 2 ) σ 1 = σ 2 = 0.2 AM model is true

37 Copula Selection - Results Frank Gaussian Clayton SA Scenario CVML WAIC CVML WAIC CVML WAIC Sc1 100% 100% 100% 100% 98% 98% Sc4 100% 100% 100% 100% 100% 100%

38 Calibration Selection - Results Clayton Frank Gaussian Scenario CVML WAIC CVML WAIC CVML WAIC Sc3 78% 78% 100% 100% 100% 100% Scenario CVML WAIC Sc 1 96% 96% Sc4 94% 94%

39 Red Wine 11 Physiochemical properties of 1599 varieties of red Vinho Verde Portuguese wine. We consider the dependence between fixed acidity and density. The former is strongly associated with quality of wine, while the latter is used as a measure of grape quality. Covariates: volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, ph, sulphates, alcohol.

40 Red Wine volatile.acidity citric.acid residual.sugar Kendalls Tau Kendalls Tau Kendalls Tau chlorides free.sulfur.dioxide total.sulfur.dioxide Kendalls Tau Kendalls Tau Kendalls Tau ph sulphates alcohol Kendalls Tau Kendalls Tau Kendalls Tau

41 Red Wine 2.5% quantile mean 97.5% quantile volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide ph sulphates alcohol Red wine data: posterior estimates and credible regions for β.

42 Conclusions Flexible modelling for CC is necessary. CC models can be useful in prediction when the dependence is strong. The SA is difficult to detect without a properly calibrated penalty for model complexity. Impact of model misspecification can be significant. Robust CC s? How to extend these methods to more than two responses (e.g. vines). CC-based models in clustering, adaptive MCMC, spatial models, statistical genetics.

arxiv: v3 [stat.me] 25 May 2017

arxiv: v3 [stat.me] 25 May 2017 Bayesian Inference for Conditional Copulas using Gaussian Process Single Index Models Evgeny Levi Radu V. Craiu arxiv:1603.0308v3 [stat.me] 5 May 017 Department of Statistical Sciences, University of Toronto