Copula modeling for discrete data

Similar documents
A measure of radial asymmetry for bivariate copulas based on Sobolev norm

Accounting for extreme-value dependence in multivariate data

Modelling Dependent Credit Risks

Modelling Dependence with Copulas and Applications to Risk Management. Filip Lindskog, RiskLab, ETH Zürich

Financial Econometrics and Volatility Models Copulas

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline.

Optimal exact tests for complex alternative hypotheses on cross tabulated data

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

MULTIDIMENSIONAL POVERTY MEASUREMENT: DEPENDENCE BETWEEN WELL-BEING DIMENSIONS USING COPULA FUNCTION

Clearly, if F is strictly increasing it has a single quasi-inverse, which equals the (ordinary) inverse function F 1 (or, sometimes, F 1 ).

Dependence. MFM Practitioner Module: Risk & Asset Allocation. John Dodson. September 11, Dependence. John Dodson. Outline.

Construction and estimation of high dimensional copulas

Bivariate Rainfall and Runoff Analysis Using Entropy and Copula Theories

Estimation of Copula Models with Discrete Margins (via Bayesian Data Augmentation) Michael S. Smith

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Gaussian Process Vine Copulas for Multivariate Dependence

Approximation of multivariate distribution functions MARGUS PIHLAK. June Tartu University. Institute of Mathematical Statistics

Unit 14: Nonparametric Statistical Methods

Modelling and Estimation of Stochastic Dependence

Multivariate Distribution Models

Copulas and dependence measurement

Bivariate Paired Numerical Data

On the empirical multilinear copula process for count data

Copula-based tests of independence for bivariate discrete data. Orla Aislinn Murphy

Copulas. Mathematisches Seminar (Prof. Dr. D. Filipovic) Di Uhr in E

Semi-parametric predictive inference for bivariate data using copulas

Identification of marginal and joint CDFs using Bayesian method for RBDO

Application of Copulas as a New Geostatistical Tool

Forecasting time series with multivariate copulas

Non parametric estimation of Archimedean copulas and tail dependence. Paris, february 19, 2015.

GOODNESS-OF-FIT TESTS FOR ARCHIMEDEAN COPULA MODELS

A copula goodness-of-t approach. conditional probability integral transform. Daniel Berg 1 Henrik Bakken 2

On consistency of Kendall s tau under censoring

Trivariate copulas for characterisation of droughts

Contents 1. Coping with Copulas. Thorsten Schmidt 1. Department of Mathematics, University of Leipzig Dec 2006

Lecture 2: Repetition of probability theory and statistics

Properties of Hierarchical Archimedean Copulas

Copulas and Measures of Dependence

Math 494: Mathematical Statistics

A PRIMER ON COPULAS FOR COUNT DATA ABSTRACT KEYWORDS

Probability Distributions and Estimation of Ali-Mikhail-Haq Copula

CHAPTER 1 INTRODUCTION

Simulating Realistic Ecological Count Data

A note about the conjecture about Spearman s rho and Kendall s tau

First steps of multivariate data analysis

Semiparametric Gaussian Copula Models: Progress and Problems

Gauge Plots. Gauge Plots JAPANESE BEETLE DATA MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA JAPANESE BEETLE DATA

Semiparametric Gaussian Copula Models: Progress and Problems

On a simple construction of bivariate probability functions with fixed marginals 1

Application of copula-based BINAR models in loan modelling

Copula-based Logistic Regression Models for Bivariate Binary Responses

8 Copulas. 8.1 Introduction

Binary response data

Approximate Bayesian inference in semiparametric copula models

On Rank Correlation Measures for Non-Continuous Random Variables

Outline. Frailty modelling of Multivariate Survival Data. Clustered survival data. Clustered survival data

The Influence Function of the Correlation Indexes in a Two-by-Two Table *

Learning Objectives for Stat 225

Hybrid Copula Bayesian Networks

Correlation: Copulas and Conditioning

Robustness of a semiparametric estimator of a copula

EXPECTED VALUE of a RV. corresponds to the average value one would get for the RV when repeating the experiment, =0.

Package HMMcopula. December 1, 2018

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institute of Technology, Kharagpur

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

1 Exercises for lecture 1

STATISTICS SYLLABUS UNIT I

Probability and Distributions

Understand the difference between symmetric and asymmetric measures

Chapter 5 continued. Chapter 5 sections

ECON 4160, Autumn term Lecture 1

Conditional Least Squares and Copulae in Claims Reserving for a Single Line of Business

Linear Models and Estimation by Least Squares

STAT 512 sp 2018 Summary Sheet

Chapter 3: Maximum Likelihood Theory

Frequentist-Bayesian Model Comparisons: A Simple Example

arxiv: v5 [stat.ml] 6 Oct 2018

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Contents 1. Contents

A Goodness-of-fit Test for Semi-parametric Copula Models of Right-Censored Bivariate Survival Times

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

Sections 2.3, 2.4. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis 1 / 21

Chapter 13 Correlation

Vine Copulas. Spatial Copula Workshop 2014 September 22, Institute for Geoinformatics University of Münster.

MAS223 Statistical Inference and Modelling Exercises

Local sensitivity analyses of goodness-of-fit tests for copulas

Algorithms for Uncertainty Quantification

Machine learning - HT Maximum Likelihood

Vine copulas with asymmetric tail dependence and applications to financial return data 1. Abstract

Counts using Jitters joint work with Peng Shi, Northern Illinois University

Marginal Specifications and a Gaussian Copula Estimation

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

Testing Statistical Hypotheses

Chapter 5. Chapter 5 sections

Risk Aggregation with Dependence Uncertainty

Songklanakarin Journal of Science and Technology SJST R1 Sukparungsee

STA 2201/442 Assignment 2

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

Karhunen-Loeve Expansion and Optimal Low-Rank Model for Spatial Processes

Lehrstuhl für Statistik und Ökonometrie. Diskussionspapier 87 / Some critical remarks on Zhang s gamma test for independence

Transcription:

Copula modeling for discrete data Christian Genest & Johanna G. Nešlehová in collaboration with Bruno Rémillard McGill University and HEC Montréal ROBUST, September 11, 2016

Main question Suppose (X 1, Y 1 ),..., (X n, Y n ) are from a discrete distribution H. More specifically, assume X, Y {0, 1,...}. Can we still do copula modeling?

Lack of uniqueness of the copula In the continuous case, there is a unique function C : [0, 1] 2 [0, 1] such that H(x, y) = C{F (x), G(y)}, x, y R. In the discrete case, there are several functions A : [0, 1] 2 [0, 1] such that H(x, y) = A{F (x), G(y)}, x, y R. This class of functions is denoted A, but note that not all its members are copulas!

Lack of uniqueness of the copula In the continuous case, C is the distribution function of the pair (U, V ) = (F (X ), G(Y )), i.e., C(u, v) = Pr(U u, V v), u, v (0, 1). In the discrete case, D(u, v) = Pr(U u, V v), u, v (0, 1). is a distribution function too, but it is not a copula! As a consolation, D A.

Many paradoxical results follow... As soon as F or G are discrete, the set C H of admissible copulas is infinitely large (though its bounds can be identified). Members of C H can embody completely different types of dependence. Measures of association, dependence concepts, and orderings become margin dependent. Genest & Nešlehová (2007), ASTIN Bulletin.

Saving the connection between copulas and dependence? 1. H defines a contingency table. 2. Spread the mass uniformly in each cell. 3. Call the resulting copula C C H the bilinear extension copula. Illustration for Bernoulli variates X and Y :

Our main discovery C is the best possible candidate if you want to think of the copula associated with a discrete H, because... C is an absolutely continuous copula. X Y C (u, v) = uv. For any concordance measure, κ(h) = κ(c ). If ( X, Ỹ ) is distributed as C, then DEP(X, Y ) DEP( X, Ỹ ). Here, DEP can refer to PQD, LTD, RTI, SI, LRD.

Are copula models for discrete data of interest? A copula model for H, viz. H(x, y) = C{F (x), G(y)} with F (F α ), G (G β ) and C (C θ ) is a perfectly valid construction, even if F and G are discrete.

Yes, they are! 0 2 4 6 8 Y Clayton copula with Kendall's tau 0.5 Gumbel copula with Kendall's tau 0.5 0 1 2 3 4 5 6 7 0 2 4 6 8 Y Y Gauss copula with Kendall's tau 0.5 0 2 4 6 8 10 12 14 X 0 2 4 6 8 10 12 X 0 2 4 6 8 10 12 X 0 2 4 6 8 Y t copula with 4 dof and Kendall's tau -0.5 Clayton copula with Kendall's tau -0.5 0 2 4 6 8 0 1 2 3 4 5 6 7 Y Y Frank copula and Kendall's tau -0.5 0 2 4 6 8 10 12 X 0 5 10 15 X 0 2 4 6 8 10 12 X X P(5) and Y G(0.6) with various copulas and positive (top) and negative (bottom) association.

Theoretical back-up H often inherits dependence properties from C because DEP(U, V ) DEP(X, Y ), where (U, V ) C and DEP means PQD, LTD, RTI, SI, LRD. θ can continue to govern association between X and Y, viz. C θ PQD C θ H θ PQD H θ.

Can we do inference? Assume (X 1, Y 1 ),..., (X n, Y n ) is an iid sample from with F and G discrete. H θ (x, y) = C θ {F (x), G(y)} How can one fit and validate such a model? The quest for the answer is the subject of our ongoing research.

Naïve suggestion Draw 10,000 samples (X 1, Y 1 ),..., (X n, Y n ) of size n = 100 from H θ (x, y) = C θ {F (x), G(y)}, where C θ is a Clayton copula and F, G are discrete distributions. Since τ = θ/(θ + 2), pick ˆτ {τ n, τ a,n, τ b,n } and let ˆθ = 2 ˆτ 1 ˆτ.

Illustration: Poisson margins Take θ = 2 and Poisson margins with E(X ) = 1 and E(Y ) = 2. ˆθ based on τ n ˆθ based on τa,n ˆθ based on τb,n

What is going on? It can be seen that τ n is an unbiased estimator of τ(h) = τ(c ). BUT: C C θ for most copula families except at independence. In general, τ a,n and τ b,n are biased estimators of τ(c θ ) because X i = F 1 (U i ) and Y i = G 1 (V i ) (F (X i ), G(Y i )) C θ. In short, the discretization of (U i, V i ) is irreversible.

Can t we just randomize? 1. Take a sample (X 1, Y 1 ),..., (X n, Y n ) from H whose margins are count distributions. 2. Add an independent noise to each component of the pair (X i, Y i ), viz. X i = X i + U i 1, Ỹ i = Y i + V i 1, where U 1,..., U n and V 1,..., V n are independent samples from the standard uniform distribution on (0, 1). 3. The randomized sample ( X 1, Ỹ 1 ),..., ( X n, Ỹ n ) then stems from a distribution whose margins are continuous.

Two additional estimators 4. Compute τ n ( X, Ỹ ), the sample version of Kendall s tau based on the randomized sample ( X 1, Ỹ 1 ),..., ( X n, Ỹ n ). This gives a moment-estimate of θ, viz. ˆθ = g 1 ( τ n ). 5. Alternatively, one can compute the pseudo-likelihood estimate of θ based on the randomized sample. 6. To eliminate the uncertainty induced by randomization, repeat the previous steps N times and compute the average of the values of the estimate of θ.

... that do not work either. Average and st. deviation of six estimates of θ in the Illustration: Estimate of θ based on τ n(u, V ) τ n(x, Y ) τ a,n(x, Y ) τ b,n (X, Y ) τ n( X, Ỹ ) MLE( X, Ỹ ) Av. 2.039 1.262 4.358 2.213 1.269 1.144 S.d. 0.446 0.243 1.495 0.537 0.285 0.779 Randomization is bound to fail, because the copula of ( X, Ỹ ) is C. However, remember that C C θ.

The empirical bilinear extension copula Compute the empirical cdf H n corresponding to the sample (X 1, Y 1 ),..., (X n, Y n ). Denote its bilinear extension copula by C n. C n is explicit; its density is given by n ij cn (u, v) = n n i n j for all u (F n (i 1), F n (i)], v (G n (j 1), G n (j)], i, j N. Observe that C n is rank-based.

Wait a minute... A sample (X 1, Y 1 ),..., (X n, Y n ) defines a contingency table. Pearson s chi-squared statistic for testing independence χ 2 = I J i=1 j=1 (n ij n i n j /n) 2 n i n j /n Spearman s mid-rank coefficient for testing monotone trend { n } ρ n = 12 (R n 3 i R)(S i S) i=1 Kendall s coefficient for testing monotone trend τ n = 2 n 2 {#(concordant pairs) #(discordant pairs)}

Surprise! It can be seen that χ 2 = n ρ n = 12 1 1 0 0 1 1 0 τ n = 1 + 4 {c n (u, v) 1} 2 du dv, {C n (u, v) uv}du dv, 0 1 1 0 0 C n (u, v) dc n (u, v).

Here s an idea In the continuous case, many inferential procedures derive from the limiting behavior of the empirical copula process C n = n (C n C). In the discrete case, one can investigate the asymptotic behavior of the empirical Maltese copula process C n = n (C n C ), hoping that it would be as useful as in the continuous case.

Known margins in the continuous case When F and G are known and continuous, C can be estimated by the empirical distribution function B n of the sample (F (X i ), G(Y i )), 1 i n. It is well-known that in this case, B n = n (B n C) converges weakly in C[0, 1] 2 to a C-Brownian sheet B C, i.e., to a centered Gaussian process with covariance function cov{b C (u, v), B C (w, z)} = C(u w, v z) C(u, v)c(w, z).

Known margins in the discrete case When F and G are known and supported on N, C can be estimated by bilinear interpolation Bn of the empirical distribution function of the sample (F (X i ), G(Y i )), 1 i n. Theorem As n, the process B n = n (Bn C ) converges weakly in C[0, 1] 2 to a centered Gaussian process B C. Here, B C is no longer a C -Brownian sheet, but a bilinear interpolation thereof.

Illustration The limiting process B C is illustrated below in the univariate case when F is binomial with p = 0.4 and N = 3, 10 and 100. Displayed are ten realizations of B n when n = 5000. process C_n -1.0-0.5 0.0 0.5 process C_n -0.5 0.0 0.5 process C_n -1.0-0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0 u 0.0 0.2 0.4 0.6 0.8 1.0 u 0.0 0.2 0.4 0.6 0.8 1.0 u

Unknown margins in the continuous case Under suitable regularity conditions, the process C n = n (C n C) converges weakly in C[0, 1] 2 to a centered Gaussian process C, C(u, v) = B C (u, v) u C(u, v) B C (u, 1) v C(u, v) B C (1, v).

Bad news in the discrete case Suppose that H is a bivariate Bernoulli distribution with F (0) = p, G(0) = q, H(0, 0) = r, where p, q (0, 1) and r = C(p, q) for some copula C. It can be established that the finite-dimensional margins of C n converge in law, although the limit may not be Gaussian. However, the sequence C n probability unless r = pq. is not asymptotically equicontinuous in In other words, C n does not converge in C[0, 1] 2 unless r = pq.

What went wrong? To illustrate, consider a similar process in the univariate case. Take F Bernoulli with F (0) (0, 1) and let F n be its empirical counterpart based on a sample of size n from F. Set F (0) = p and F n (0) = p n and consider { { u, u [0, pn ] u, u [0, p] E n (u) = (1 u) 1 p n p n, u [p n, 1], E(u) = (1 u). 1 p p, u [p, 1] Then E n = n (E n E) does not converge in law in C[0, 1] even though its finite-dimensional margins converge.

Illustration n=1000-1.5-1.0-0.5 0.0 0.5 1.0 1.5 Frequency 0 500 1000 1500 2000 2500 3000 Frequency 0 100 200 300 400 500 600 0.0 0.2 0.4 0.6 0.8 1.0 u -3.5-3.0-2.5-2.0-1.5-1.0-0.5 0.0 p=0.4-2 -1 0 1 2 3 p=0.5 Ten realizations of the process E n for the Bernoulli distribution with p = 0.4 and sample size 1000 (left). Histograms of 5, 000 realizations of E n (u) when u = 0.4 (middle) and u = 0.5 (right) based on samples of size n = 10, 000.

All s well that ends well! Consider the set O = ( ) ( ) F (k 1), F (k) G(l 1), G(l). (k,l) N 2 Theorem Let K be an arbitrary compact subset of O. Then C n converges weakly on C(K) as n to C given for every u, v O by B C (u, v) u C (u, v) B C (u, 1) v C (u, v) B C (1, v), where B C is the weak limit of B n.

Example: Spearman s rho Consider the non-normalized version of Spearman s rho, viz. ρ = ρ(h) = ρ(c ). Its consistent estimator is given by ρ n = 12 n 3 n (R i R)(S i S) = ρ(cn ), i=1 where R i and S i are the componentwise mid-ranks. Consequently, n {ρ n ρ(h)} = 12 1 0 1 0 C n (u, v) du dv.

It works! Because [0, 1] 2 \ O has Lebesgue measure zero, 12 1 1 0 0 C n (u, v) du dv = 12 C n (u, v) du dv. O Furthermore, O can be approximated arbitrarily closely by compact sets. This lies at the heart of the following result: Theorem As n, n {ρ n ρ(h)} 12 C (u, v) du dv. O

References C. Genest & J. Nešlehová (2007). A primer on copulas for count data. The ASTIN Bulletin, 37, 475 515. C. Genest & J.G. Nešlehová & B. Rémillard (2014). On the empirical multilinear copula process for count data. Bernoulli, 20, 1344 1371. C. Genest & J.G. Nešlehová & B. Rémillard (2014). Asymptotics of the empirical multilinear copula process. Submitted.

Research funded by