Characterization of multivariate Bernoulli distributions with given margins

Size: px

Start display at page:

Download "Characterization of multivariate Bernoulli distributions with given margins"

Amelia Anderson
6 years ago
Views:

1 arxiv: v1 [math.st] 5 Jun 217 Characterization of multivariate Bernoulli distributions with given margins Roberto Fontana 1 and Patrizia Semeraro 2 1 Department of Mathematical Sciences, Politecnico di Torino, roberto.fontana@polito.it 2 Department of Mathematical Sciences, Politecnico di Torino, patrizia.semeraro@polito.it

2 Abstract We express each Fréchet class of multivariate Bernoulli distributions with given margins as the convex hull of a set of densities, which belong to the same Fréchet class. This characterisation allows us to establish whether a given correlation matrix is compatible with the assigned margins and, if it is, to easily construct one of the corresponding joint densities. We reduce the problem of finding a density belonging to a Fréchet class and with given correlation matrix to the solution of a linear system of equations. Our methodology also provides the bounds that each correlation must satisfy to be compatible with the assigned margins. An algorithm and its use in some examples is shown. Keywords: Algebraic statistics; Correlation; Fréchet class; Multivariate binary distribution; Simulation.

3 1 Introduction Dependent binary variables play a key role in many important scientific fields such as clinical trials and health studies. The problem of the simulation of correlated binary data is extensively addressed in the statistical literature, e.g. [3], [6], [15] and [9]. Simulation studies are a useful tool for analysing extensions or alternatives to current estimating methodologies, such as generalised linear mixed models, or for the evaluation of statistical procedures for marginal regression models ([13]). The simulation problem consists of constructing multivariate distributions for given Bernoulli marginal distributions and a given correlation matrix ρ. Frequently, assumptions are made about the correlation structure. Probably the most common is equicorrelation, e.g. [3]. A popular approach also uses working correlation matrices ([1] and [16]), such as first order moving average correlations or first order autoregressive correlations ([12] and references therein). An important issue for these simulation procedures is the compatibility of marginal binary variables and their correlations, since problems may arise when the margins and the correlation matrix are not compatible ([4], [14] and [3]). The range of admissible correlation matrices for binary variables is well known in the bivariate case. This problem has been widely identified in the literature, but, to the best of our knowledge no effective solution exists for multivariate binary distributions with more than three variables ([3]). We propose a new but simple methodology to characterise Bernoulli variables belonging to a given Fréchet class, i.e. with given marginal distributions. This characterisation allows us to establish whether a given correlation matrix is compatible with the assigned margins and, if it is, to easily construct one of the corresponding joint densities. It also provides the bounds that each correlation must satisfy to be compatible with the assigned margins. Furthermore, if the correlation structure and the margins are not compatible, we can find a new correlation matrix which is close to the desired one but compatible with the given margins. It is worth noting that this methodology puts no restriction either on the number of variables or on the correlation structure. It also provides a new computational procedure to simulate multivariate distributions of binary variables with assigned margins and given moments. The proposed methodology is based on a polynomial representation of all the multivariate Bernoulli distributions of a given Fréchet class, i.e. of all the distributions with fixed Bernoulli margins. This representation is linked to the Farlie-Gumbel-Morgesten copula ([11]). It allows us to write each Fréchet class as the convex hull of the ray densities, which are densities that belong to the Fréchet class under consideration. By so doing, the problem of finding one distribution with given moments in a Fréchet class is reduced to the solution of a linear system of equations. 1

4 2 Preliminaries LetF m bethesetofm-dimensionaldistributionswhichhavebernoulliunivariatemarginal distributions. Let us consider the Fréchet class F(p 1,...,p m ) F m of distribution functions in F m which have the same Bernoulli marginal distributions B(p i ), < p i < 1,i = 1,...,m. If X = (X 1,...,X m ) is a random vector with joint distribution in F(p 1,...,p m ), we denote its cumulative distribution function by F p and its density function by f p where p = (p 1,...,p m ); the column vector which contains the values of F p and f p over S m := {,1} m, with asmallabuseofnotation, stillbyf p = (F p (x) : x S m )andf p = (f p (x) : x S m ) respectively; we make the non-restrictive hypothesis that S m is ordered according to the reverse-lexicographical criterion; the marginal cumulative distribution function and the marginal density function of X i by F p,i and f p,i respectively, i = 1,...,m; the values f p,i () F p,i () and f p,i (1) by q i and p i respectively, i = 1,...,m. We observe that q i = 1 p i and that the expected value of X i is p i, E[X i ] = p i, i = 1,...,m. GiventwomatricesA M(n m)andb M(d l)thematrixa B M(nd ml) indicates their Kronecker product and A n is A... A }{{}. n times If we consider a Bernoulli variable B(τ), < τ < 1, with F τ and f τ as cumulative and density function respectively, the following holds where D = ( ( fτ () f τ (1) ) = D ) is the difference matrix. ( Fτ () F τ (1) It follows that given F p and f p in F(p 1,...,p m ) we have ) f p = D m F p. (2.1) Finallywecanwritef p F(p 1,...,p m ), F p F(p 1,...,p m )andx F(p 1,...,p m ). 2

5 3 Construction of multivariate Bernoulli distributions with given margins We give a polynomial and matrix representation of all the F p F(p 1,...,p m ). We make the non-restrictive hypothesis that {q 1,1}... {q m,1} is ordered according to the reverse-lexicographical criterion. We denote {q 1,1}... {q m,1} by Q m. Theorem 3.1. Any distribution F p F(p 1,...,p m ) admits the following representation over Q m F p = Λ p U p θ where Λ p = diag(q (1 α 1) 1... q m (1 αm),(α 1,...,α m ) S m ), U p = U p1... U pm, U pi = ( ) 1 1 qi,i = 1,...,m and θ = (θ,θ m,θ m 1,θ m,m 1,...,θ 12...m ). 1 Necessary conditions for F p being a distribution are θ = 1 and θ i =,i = 1,...,m. Proof. Given u = (u 1,...,u m ) Q m let us define ( m ) (θ m g(u) = u i + θ j (1 u j )+ m θ jk (1 u j )(1 u k )+ +θ 12...m (1 u i ) ) j=1 1 j<k m and the row vectors a i = (1, 1 u i ), i = 1,...,m. We can write g(u) R as ( n g(u) = u i )(a 1... a m ) θ θ m θ m 1... θ 12...m Considering all the u Q m we get the 2 m -vector (g(u),u Q m ) = Λ p U p θ. ( ) 1 1 qi We observe that the determinant of U pi = is det(u pi ) = p i. 1 It follows that the determinant of U p, which is (p 1... p m ) 2, is also different from zero. Being the determinant of Λ p we get that the determinant of Λ p U p is different from zero. It follows that the rank of Λ p U p is 2 m and then any vector y R 2m and in particular any distribution F p can be written as F p = Λ p U p θ. If F p is a distribution in F(p 1,...,p n ), the vector parameter θ must satisfy the following necessary conditions: 1. θ = 1. The condition F p (1,...,1) = 1 implies θ = 1, since F p (1,...,1) = θ ; 2. θ i =,i = 1,...,m. The condition F p (1,...1,,1,...,1) = q i implies θ i =,i = 1,...,m, since F p (1,...1,,1,...,1) = q i (1+θ i (1 q i )). 3.

6 Remark 1. Under the necessary assumptions θ = 1 and θ i =, i = 1,...,m, the polynomial function g(u) in Theorem 3.1 is the restriction of the well-known Farlie- Gumbel-Morgesten copula C(u) to Q m : ( m ) (1+ m C(u) := u i θ jk (1 u j )(1 u k )+ +θ 12...m (1 u i ) ), u [,1] m. 1 j<k n Notice that the condition θ = 1 derives from C(1,...,1) = 1 and the condition θ i = is necessary since a requirement to be a copula is that C(1,...1,q i,1,...,1) = q i, i = 1,...,m. Our representation shows that the restriction to Q m of the Farlie-Gumbel- Morgesten copula allows us to represent all the binary distributions with given margins, and therefore to model all the possible dependence structures of multivariate Bernoulli distributions. As a consequence of Theorem 3.1 and Equation 2.1 any density f p F(p 1,...,p m ) admits the following representation over S m f p = D m Λ p U p θ (3.1) We observe that given f p F(p 1,...,p m ) we can write it as in Eq.(3.1). Vice versa Theorem 3.1 does not provide any condition on θ i1,...,i k for k 2 such that D m Λ p U p θ represents a density function f p over S m. In the remaining part of this section we will provide a representation of all the densities f p F(p 1,...,p m ). Theorem 3.2. Let f p F(p 1,...,p m ). It holds that f p = n F λ i R (i) p, (3.2) where R (i) p = (R (i) p (x),x S m ) F(p 1,...,p m ), λ i, i = 1,...,n F and n F λ i = 1. Proof. Let us define Y p = D m Λ p U p. From Eq.(3.1) it holds that f p = Y p Θ, with the conditions θ = 1 and θ i =, i = 1,...,m. We can write Θ = Y 1 p f p. The conditions θ i =, i = 1,...,m can be written as Hf p =, (3.3) 4

7 where H is the m 2 m sub-matrix of Yp 1 obtained by selecting the rows corresponding to θ i, i = 1...,m. The condition θ = 1 F p (1,...,1) = 1 is ensured by requiring that f p is a density, i.e. 1. f p (x) ; 2. x f p(x) = 1 where x S m. All the positive solutions f p of (3.3) have the following form: f p = n F λ i R(i) p, λi, (i) where R p = ( R (i) p,j,j = 1,...,2m ) R 2m, R(i) (i) p,j and H R p =, i = 1,...,n F are the extremal rays of the cone defined by Hf p = ([1] and [7]). By dividing where λ i = λ i R(i) p,+ and R (i) p = R (i) p by the sum of its elements f p = n F R (i) p,+ = 2 m j=1 λ i R (i) p, R (i) p,j we can write (i) R p, i = 1,...,n R (i) F. It follows that 2 m j=1 R(i) p,j = 1 and p,+ that the ray density defined as R (i) p (x) := R (i) p,j being x the j-th element of S m belongs to F(p 1,...,p m ), i = 1,...,n F. Finallythecondition x f p(x) = 1implies n F λ i ( 2 m j=1 R(i) p,j ) = n F λ i = 1. Then we have λ i,i = 1,...,m and n F λ i = 1 and the assert is proved. Notice that Theorem 3.2 makes extremely easy to generate any density f p of the Fréchet class F(p 1,...,p m ). It is enough to take a positive vector λ = (λ 1,...,λ nf ), such that n F λ i = 1, and build f p = n F λ i R (i) p. The constraints E[X i ] = p i, i = 1,...,m allow us to obtain an interesting intepretation of the matrix H of (3.3). We have E[X i ] = (x 1,...,x m) S m x i f p (x 1,...,x m ). It follows that x T i f p = p i (1 x i ) T f p = q i where x i is the vector which contains the i-th element of x S m, i = 1,...,m. If we consider the odds of the event X i = 1, γ i = p i /q i we have γ i q i p i =. We can write (γ i (1 x i ) T x T i )f p =. 5

8 Then H is simply the m 2 m matrix whose rows, up to a non-influential multiplicative constant, are (γ i (1 x i ) T x T i ), i = 1,...,m. Using Theorem 3.2 we represent each Fréchet class F(p 1,...,p m ) as the convex hull of the ray densities. We observe that the ray densities depend only on the marginal distributions F 1,...,F m. Building the ray matrix R p R p = R (1) p,1... R (n F) p,1... R (1) p,2... m R(n F) p,2 m whose columns are the ray densities R (i) p,i = 1,...,n F we write Eq.(3.2) simply as f p = R p λ with λ = (λ 1,...,λ nf ),λ i and n F λ i = 1. (i) In practical applications the rays R p and therefore the ray densities R p (i) can be found using the software 4ti2, [1]. In Section 5 we will use SAS and 4ti2 to show some numerical examples. In the next sections we will see that the representation of f p as in Theorem 3.2 plays a key role in determining the densities with given moments. 3.1 Moments of multivariate Bernoulli variables We observe that, given thebernoulli variable X B(τ), < τ < 1 with density function f τ we can compute the moments E[X α ],α {,1} as where M = ( ). E[X α ] = ( E[1] E[X] ) = M ( fτ () f τ (1) It follows that given X = (X 1,...,X m ) F(p 1,...,p m ) with multivariate joint density f p, we can compute the vector of its moments E[X α ] E[X α Xm αm],α = (α 1,...,α m ) S m as E[X α ] = M m f p. We also observe that the correlation ρ ij between two Bernoulli variables X i B(p i ) and X j B(p j ) is related to the second-order moment E[X i X j ] as follows E[X i X j ] = ρ ij pi q i p j q j +p i p j. (3.4) ) 6

9 3.2 Second-order moments of multivariate Bernoulli variables with given margins From Theorem 3.2 we get E[X α ] = M m f p = M m R p λ. In particular for the second-order moments µ 2 = E[X α : α = 2], where α = m α i we get the following result, which is crucial for the solution of the problem of simulating multivariate binary distributions with a given correlation matrix. Proposition 3.1. It holds that µ 2 = A 2p λ (3.5) where A 2p = (M m ) 2 R p and (M m ) 2 is the sub-matrix of M m obtained by selecting the rows corresponding to the second-order moments, R p is the ray matrix and λ = (λ 1,...,λ nf ), λ i,i = 1,...m and n F λ i = 1. It follows that the target second-order moments are compatible with the means if they belong to the convex hull generated by the points which are the columns of the A 2p = (M m ) 2 R p matrix. As a direct consequence of Proposition 3.1 we also get the univariate bounds for the second-order moments and the correlations. Proposition 3.2. For each α, α = 2, the second-order moment µ (α) 2 must satisfy the following bounds mina (α) 2p µ (α) 2 maxa (α) 2p (3.6) and the correlations ρ S(α) must satisfy the following bounds mina (α) 2p p i p j pi q i p j q j ρ ij maxa(α) 2p p i p j pi q i p j q j (3.7) where A (α) 2p is the row of the matrix A 2p such that µ (α) 2 = A (α) 2p λ and {i,j} = {k : α k = 1}. Proof. From Proposition 3.1 using the the proper row of A 2p we get To prove (3.6) it is enough to observe that µ (α) 2 = A (α) 2p λ. 1. being λ i and n F λ i = 1 it follows that the minimum (maximum) value of µ (α) 2 will be obtained choosing λ equal to one of the e i s, where e i {,1} n F is the binary vector with all the elements equal to zero apart from the i-th which is equal to one, i = 1,...,n F ; 7

10 2. the product A (α) 2p e i gives the i-th element of A (α) 2p. To prove (3.7) we simply observe that using equation (3.4) the bounds in (3.6) can be transformed to those suitable for correlations. Now we solve the problem of constructing a multivariate Bernoulli density f p F(p 1,...,p m ) with given correlation matrix ρ = (ρ ij ) i,j=1,...,m. Using Equation (3.4) we transform the desired correlations ρ ij into the corresponding desired second-order moments E[X i X j ],i,j = 1,...,m,i < j. In this way the density f p with means p 1,...,p m and correlation matrix ρ can be built as R p λ, where λ = (λ 1,...,λ nf ),λ i, n F λ i = 1 is a solution, if it exists, of the system of equations (3.5). The space of solutions λ of the system (3.5) defines the set of distributions in the Fréchet class with correlation matrix ρ. The choice of a particular solution does not modify the distributions of the sample means and of the sample second-order moments, which depend only on p 1,...,p m and ρ respectively. To explain this point let us consider a random sample {(X k1,...,x km ), k = 1,...,N} extracted from a randomly selected m-dimensional Bernoulli variable belonging to the Fréchet class F(p 1,...,p n ) and with given second-order moments µ ij := E[X i X j ], i,j = 1,...,n. The sample means X i, i = 1,...,m are 1 Binomial(N,p N i) and the sample second-order moments X i X j := N k=1 X ki X kj N, i,j = 1,...,n,i < j are 1 N Binomial(N,µ ij). In general different distributions which belong to the same Fréchet class and which have the same correlation matrix ρ (or equivalently the same vector of second-order moments µ 2 ), will have different k-order moments, with k 3. This methodology offers the opportunity to choose the best distribution according to a certain criterion. For example, as the moments of multivariate Bernoulli are always positive, it could be of interest to find one of the distributions with the smallest sum of all the moments with order greater than 2. This problem can be efficiently solved using linear programming techniques ([2]). It can be simply stated as min(1 T (M m ) 3...m f) f F m subject to { Hf = (M m ) 2 f = µ 2 where 1 is the vector with all the elements equal to 1 and (M m ) 3...m is the sub-matrix of M m obtained by selecting the rows corresponding to the k-moments, with k 3. As we alreadymentioned, fromageometrical point of view asolution ofthe system of equations(3.5) exists if and only if a point whose coordinates are the desired second-order moments belongstotheconvex hull generatedbythepointswhich arethecolumns ofthe A 2p = (M m ) 2 R p matrix. If the margins and the correlation matrix are not compatible, 8

11 the system (3.5) does not have any solution. In this case it is possible to search for a feasible ρ which is the correlation matrix closest to the desired ρ, according to a chosen distance. Finally it is worth noting that the method can be applied to the moments of order greater than 2 or to any selection of moments by simply replacing the (M m ) 2 matrix with the proper one. 3.3 Margins of multivariate Bernoulli variables with given secondorder moments In Section 3.2 we studied second-order moments of multivariate Bernoulli variables with given margins. The methodology can be easily generalised to solve the problem of studying h-order moments of multivariate Bernoulli variables with given k-order moments, h,k {1,...,m}, h k. We show this point by studying the h = 1,k = 2 case, i.e. studying margins of multivariate Bernoulli variables f µ2 with given 2-order moments µ 2 = (µ ij : i,j = 1,...,m, i < j). We observe that E[X i X j ] = (x 1,...,x m) S m x i x j f µ2 (x 1,...,x m ), that is x T ij f µ 2 = µ ij (1 x ij ) T f µ2 = 1 µ ij where x ij is the vector which contains the product x i x j of the i-th and the j-th element of x S m. If we consider the odds of the event X i X j = 1, γ ij = µ ij /(1 µ ij ), we have γ ij (1 µ ij ) µ ij = that is (γ ij (1 x ij ) T x T ij)f µ2 =. Building the matrix H 2 whose rows are (γ ij (1 x ij ) T x T ij), all the densities f µ2 must satisty the system of equations H 2 f µ2 =. The following proposition is the equivalent of Theorem 3.2, Proposition 3.1 and Proposition 3.2 for the case under study. Proposition 3.3. Let f µ2 a multivariate Bernoulli density with second-order moments µ 2 = (µ ij : i,j = 1,...,m, i < j): 1. all the densities f µ2 can be written as f µ2 = n F λ i R (i) µ 2, (3.8) where R (i) µ 2 = (R (i) µ 2 (x),x S m ) i = 1,...,n F are multivariate Bernoulli densities with second-order moments µ 2, λ i,i = 1,...,m and n F λ i = 1. 9

12 2. The vector p = (p 1,...,p m ) is p = A 1µ2 λ (3.9) where A 1µ2 = (M m ) 1 R µ2 and (M m ) 1 is the sub-matrix of M m obtained by selecting the rows corresponding to the first-order moments, R µ2 is the ray matrix and λ = (λ 1,...,λ nf ), λ i,i = 1,...m and n F λ i = For each α, α = 1, the first-order moment µ (α) 1 p i must satisfy the following bounds mina (α) 1µ 2 p i maxa (α) 1µ 2 (3.1) where A (α) 1µ 2 is the row of the matrix A 1µ2 such that p i = A (α) 1µ 2 λ and {i} = {k : α k = 1}. 4 Bivariate Bernoulli density with given margins Inthissectionweconsiderbivariatedistributions, i.e. theclassf(p 1,p 2 )of2-dimensional random variables (X 1,X 2 ) which have Bernoulli marginal distributions F i B(p i ),i = 1,2. In the bivariate case two key distributions are F L and F u, the lower and upper Fréchet bound of F(p 1,p 2 ) respectively: where x = (x 1,x 2 ) {,1} 2. For any F p F(p 1,p 2 ) it holds that F L (x) = max{f 1 (x 1 )+F 2 (x 2 ) 1) (4.1) F U (x) = min{f 1 (x 1 ),F 2 (x 2 )} (4.2) F L (x) F p (x) F U (x), x {,1} 2. (4.3) For an overview of Fréchet classes and their bounds see [5]. WenowanalyseTheorem3.2inthebivariatecase. Thenumberofraysisindependent of the Fréchet class F(p 1,p 2 ). We have two ray densities, which are the lower and upper Fréchet bound of each class. Proposition 4.1. Let f F(p 1,p 2 ), then f p = λf L +(1 λ)f U, λ [,1], where f L and f U are the discrete densities corresponding to F L and F U, respectively. Proof. We observe that in x = (, ) the distribution function and the density function take the same value. Then using (4.3) we can write f L (,) f p (,) f U (,). (4.4) 1

13 It follows that f p (,) = λf L (,)+(1 λ)f U (,) with λ = fp(,) f U(,) f L (,) f U. It holds (,) that λ 1. Now we observe that for any density function f F(p 1,p 2 ) we have f(,1) = q 1 f(,). Then using (4.4) we can write that is q 1 f L (,) q 1 f p (,) q 1 f U (,) f U (1,) f p (1,) f L (1,). We can write f p (1,) = λ 1 f L (1,)+(1 λ 1 )f U (1,). It is easy to verify that λ 1 = λ. We proceed in ananalogous way for f p (,1) = q 2 f p (,) andf p (1,1) = 1 q 1 q 2 +f p (,) and we get f p (x) = λf L (x)+(1 λ)f U (x), x {,1} 2 and λ 1. Proposition 4.1 states that F(p 1,p 2 ) is the convex hull of theupper and lower Fréchet bound. In the bivariate case we can also find the domain of θ 12 expressed as a function of the margins p 1,p 2. From Eq.(3.1) we get f p (,) = q 1 q 2 (1+θ 12 p 1 p 2 ). (4.5) and consequently Using (4.4) it follows θ 12 = f p(,) q 1 q 2 q 1 q 2 p 1 p 2. (4.6) f L (,) q 1 q 2 q 1 q 2 p 1 p 2 θ 12 f U(,) q 1 q 2 q 1 q 2 p 1 p 2 Now without loss of generality we assume q 2 q 1. From Eq.(4.1) and (4.2) we get 1. if q 1 +q 2 1 then 1 p 1 p 2 θ 12 1 p 1 q 2 ; 2. if q 1 +q 2 > 1 then q 1+q 2 1 q 1 q 2 q 1 q 2 p 1 p 2 θ 12 1 p 1 q 2. Finally(see also Theorem 1 in[8]) we obtain the bounds for the correlation coefficient ρ 12 = E[X 1X 2 ] p 1 p 2 p1 q 1 p 2 q 2. Being E[X 1 X 2 ] = f p (1,1), f L (1,1) f p (1,1) f U (1,1) and f(1,1) = 1 q 1 q 2 +f(,) for any density function f F(p 1,p 2 ) we obtain: 1. if q 1 +q 2 1 then 1 q 1 q 2 p 1 p 2 p1 q 1 p 2 q 2 q1 q 2 p 1 p 2 ρ 12 1 q 2 p 1 p 2 p1 q 1 p 2 q 2 p2 q 1 p 1 q 2 ; 2. if q 1 +q 2 > 1 then p 1p 2 p1 q 1 p 2 q 2 p1 p 2 q 1 q 2 ρ 12 1 q 2 p 1 p 2 p1 q 1 p 2 q 2 p2 q 1 p 1 q 2. 11

14 5 Examples In this section we show some results corresponding to different multivariate Bernoulli distributions. The algorithm is described in Section Trivariate Bernoulli distributions Let us consider the case m = 3 and p = ( 1, 1, ). From Theorem 3.2, solving the system of equations (3.3), we get 6 ray densities. The ray matrix R p is R p = and the matrix A 2p as defined in Proposition 3.1 is Using Eq. (3.1) we get A 2p = ρ ij 1, i,j = 1,2,3,i < j.. Let us consider the case in which the X i,i = 1,...,3 must be not correlated. We want to find a distribution F p F( 1 2, 1 2, 1 2 ) such that ρ 12 = ρ 13 = ρ 23 =. From Eq. (3.5) we obtain λ 1 = λ 2 = λ 3 = λ 5 =.25 and λ 4 = λ 6 =. The corresponding density is uniform, f p (x) = 1 8,x S 3 as expected. If we choose ρ 12 =.2,ρ 13 =.3 and ρ 23 =.4, we obtain λ 1 =.275,λ 2 =.25,λ 3 =.375,λ 4 =,λ 5 =.325 and λ 6 = as one of the solutions of Eq. (3.5). The corresponding density is f T p = (.1625,.1875,.125,.1375,.1375,.125,.1875,.1625). If we choose ρ 12 =.9,ρ 13 =.3 and ρ 23 =.6, we do not find any f p with such correlations, even if each ρ ij satisfies the constraints found for bivariate distributions, which, as we said before, in this case are 1 ρ ij 1, i,j = 1,2,3, i < j. 12

15 If we search for a feasible ρ which is the correlation matrix closest 1 to the desired ρ we obtain ρ 12 =.63, ρ 13 =.33 and ρ 23 =.3. The corresponding density is (f p )T = (.2416,,.916,.1666,.1666,.916,,.2416 ). Let us now consider the case p = ( 1, 3, ). The ray matrix Rp contains 6 margins R p = and the A 2p matrix is Using Eq. (3.1) we get A 2p = ρ and.577 ρ 13,ρ If we choose ρ 12 =.3,ρ 13 =.25 and ρ 23 =.1, we obtain λ 1 =.2835,λ 2 =.25,λ 3 =,λ 4 =,λ 5 =.2781 and λ 6 = The corresponding density is f T p = (.1729,.185,.63,.144,.79,.3258,,.133 ). As the last example of trivariate Bernoulli distribution we consider p = ( 1, 1, ). The ray matrix R p (rounded to the third decimal digit) has 11 ray densities R p = The distance can be freely chosen. In this example we used the Euclidean distance.. 13

16 Using Eq. (3.1) we get.236 ρ 12.77,.48 ρ and.289 ρ If we choose ρ 12 =.3,ρ 13 =.25 and ρ 23 =.2, we obtain f T p = (.146,,.1197,.199,.665,.617,.491,.4893). 5.2 Multivariate m = 5 Bernoulli distributions Let us consider the case p = ( 1, 1, 1, 1, ). We obtain 2,712 ray densities. If we choose ρ 12 =.3,ρ 13 =.2,ρ 14 =.2,ρ 15 =.1,ρ 23 =.2,ρ 24 =.3,ρ 25 =.2,ρ 34 =.2,ρ 35 =.1 and ρ 45 =.2, we obtain f p = Multivariate m 6 Bernoulli distributions For m = 6 and p = ( 1 2, 1 2, 1 2, 1 2, 1 2, 1 2) we obtain 77,264 ray densities. In general we observe that if the number of rays is too large with respect to the available computer power and if the objective can be reduced to the problem of finding just one density f F m with given margins p and second order moments µ 2, it is enough to solve the system { (M m ) 1 f = p (M m ) 2 f = µ 2 using standard linear programming tools (e.g. [2]). 14

17 5.4 The algorithm In this section we briefly describe the algorithm that we used in Section 5. Given m, p and ρ as input the algorithm returns the ray matrix R p and, if it exists, the density f p, which has Bernoulli B(p i ),i = 1,...,m as marginal distribution and pairwise correlations ρ = (ρ ij,i,j = 1,...,m,i < j). The algorithm has the following main steps: 1. the construction of the matrix H, see (3.3) of Theorem 3.2; 2. the generation of the ray matrix R p ; 3. the construction of the density f p as the solution of the system (3.5) of Theorem 3.2. The construction of the matrix H and of the density f p is implemented in SAS/IML. In particular, the system (3.5) is solved using the Proc Lpsolve that is part of SAS/QC. The rays are generated using 4ti2 ([1]). The software code is available on request. We performed the analysis using a standard laptop (CPU Intel core I7-262M CPU 2.7GHz 2.7GHz, RAM 8GB). 6 Discussion The proposed approach can be applied to any given set of moments, even of different orders. All the results given for moments and correlations can be easily adapted to other widely-used measures of dependence, such as Kendall s τ and Spearman s ρ. Furthermore, the polynomial representation of the distributions of any Fréchet class provides a link to copulas, which are a powerful instrument to model dependence. 7 Acknowledgements Roberto Fontana wishes to thank professor Antonio Di Scala (Politecnico di Torino, Department of Mathematical Sciences) and professor Giovanni Pistone (Collegio Carlo Alberto, Moncalieri) for the helpful discussions he had with them. References [1] 4ti2 team. 4ti2 a software package for algebraic, geometric and combinatorial problems on linear spaces. Available at

18 [2] Michel Berkelaar, Kjell Eikland, Peter Notebaert, et al. lpsolve: Open source (mixed-integer) linear programming system. Eindhoven U. of Technology, 24. [3] N Rao Chaganty and Harry Joe. Range of correlation matrices for dependent bernoulli random variables. Biometrika, 93(1):197 26, 26. [4] Martin Crowder. On the use of a working correlation matrix in using generalised linear models for repeated measures. Biometrika, 82(2):47 41, [5] Giorgio Dall Aglio, Samuel Kotz, and Gabriella Salinetti. Advances in probability distributions with given marginals: beyond the copulas, volume 67. Springer Science & Business Media, 212. [6] Mary E Haynes, Roy T Sabo, and N Rao Chaganty. Simulating dependent binary variables through multinomial sampling. Journal of Statistical Computation and Simulation, 86(3):51 523, 216. [7] Raymond Hemmecke. On the computation of hilbert bases of cones. Mathematical Software, ICMS, pages , 22. [8] Mark Huber, Nevena Marić, et al. Multivariate distributions with fixed marginals and correlations. Journal of Applied Probability, 52(2):62 68, 215. [9] Seung-Ho Kang and Sin-Ho Jung. Generating correlated binary variables with complete specification of the joint distribution. Biometrical Journal, 43(3): , 21. [1] Kung-Yee Liang and Scott L Zeger. Longitudinal data analysis using generalized linear models. Biometrika, 73(1):13 22, [11] RB Nelsen. An introduction to copulas, ser. Lecture Notes in Statistics. New York: Springer, 26. [12] Samuel D Oman. Easily simulated multivariate binary distributions with given positive and negative correlations. Computational Statistics & Data Analysis, 53(4):999 15, 29. [13] Bahjat F Qaqish. A family of multivariate binary distributions for simulating correlated binary variables with specified marginal means and correlations. Biometrika, 9(2): , 23. [14] N Rao Chaganty and Harry Joe. Efficiency of generalized estimating equations for binary responses. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(4):851 86,

19 [15] Justine Shults. Simulating longer vectors of correlated binary random variables via multinomial sampling [16] Scott L Zeger and Kung-Yee Liang. Longitudinal data analysis for discrete and continuous outcomes. Biometrics, pages ,

Simulating Longer Vectors of Correlated Binary Random Variables via Multinomial Sampling

Simulating Longer Vectors of Correlated Binary Random Variables via Multinomial Sampling J. Shults a a Department of Biostatistics, University of Pennsylvania, PA 19104, USA (v4.0 released January 2015)