Estimating the Number of Tables via Sequential Importance Sampling

Estimating the Number of Tables via Sequential Importance Sampling Jing Xi Department of Statistics University of Kentucky Jing Xi, Ruriko Yoshida, David Haws

Introduction combinatorics social networks regulatory networks We consider one of the most generally used model: no three way interaction model. Model

Conditional Poisson Distribution Let Z = (Z,..., Z l ) be independent Bernoulli trials with probability of successes p = (p,..., p l ). Define w k = p k /( p k ) for k. Then Z follows a Conditional Poisson Distribution means, P(Z = z,..., Z l = z l S Z = c) l k= w z k k. () Z Z2 Z3 Z4 Z5 Z c=3

Sequential Importance Sampling (SIS) Σ, the set of all tables satisfying marginal conditions. p(x): Σ [, ], target distribution, the uniform distribution over Σ, p(x) = / Σ. q(x) > for all X Σ, the proposal distribution for sampling. We have E [ ] q(x) = X Σ which can be estimated Σ by q(x) = Σ q(x) Σ = N N i= q(x i ), where X,..., X N are tables drawn iid from q(x).

Sequential Importance Sampling (SIS) How to get the probability of the whole table X? Denote the columns of the table X as x,, x t. By the multiplication rule we have q(x = (x,, x t )) = q(x )q(x 2 x ) q(x t x t,..., x ). We can easily compute q(x i x i,..., x ) for i = 2, 3,..., t using Conditional Poisson distribution. What if we have rejections? Having rejections means that q(x): Σ [, ] where Σ Σ. The SIS estimator is still unbiased and consistent: [ ] IX Σ E = q(x) = Σ, q(x) q(x) X Σ I X Σ where I X Σ is an indicator function for the set Σ.

SIS for 2-way Table Theorem [Chen et. al., 25] For the uniform distribution over all m n - tables with given row sums r,..., r m and first column sum c, the marginal distribution of the first column is the same as the conditional distribution of Z given S Z = c with p i = r i /n. 2 2 3

SIS for 2-way Tables with Structural Zero s A structural zero means a cell in the table that is fixed to be. Example Theorem [Yuguo Chen, 27] ["Hand Waving" version] Key: If (i, ) is not a structural : change p i = r i /n to p i = r i /(n g i ) where g i is the number of structural zeros in the ith row; otherwise p i =. Theorem [] n=6 [] [] p2=2/(6-2) p3= r2=2

SIS for 3-way Tables Theorem ["Hand Waving" version] For a cell (i, j, k ), 3 columns will go through it: (i, j, ), (i,, k ), (, j, k ). Key: When generating (i, j, ), let r = (i,, k ), c = (, j, k ), define r k = X i +k and c k = X +j k, then set: p k = r k c k r k c k + (n r k g r k )(m c k g c k ), where g r k, and g c k are the numbers of structural zeros in r and c, respectively. Theorem A similar strategy can be used in multi-way tables. Theorem

Algorithm Example: 3 3 3 Semimagic Cube

Simulations - Semimagic Cubes (Table 2) This tables lists the results from m m m tables with all marginals equals to. Dim m # tables Estimation cv 2 δ 4 576 57.472.27 % 5 628 6439.3.8 99.2% 6 82852 8977227.45 98.8% 7 6.4794e+3 6.46227e+3.64 97.7% 8.8776e+2.99627e+2. 96.5% 9 5.52475e+27 5.684428e+27.59 95.3% 9.98244e+36 9.73486e+36.73 93.3% δ: acceptance rate

Simulations - Semimagic Cubes (Table 3) We can also change marginal s. Dimension m s Estimation cv 2 δ 6 3.269398e+22 2.83 96.5% 7 3 2.365389e+38 25.33 96.7% 8 3 3.236556e+59 7.5 94.5% 4 2.448923e+64.98 94.3% 9 3 7.87387e+85 5.23 9.6% 4 2.422237e+97 4. 93.4% 3 6.8623e+7 26.62 9% 4 3.652694e+37 33.33 93.8% 5.3569e+44 46.2 9.3% δ: acceptance rate

Experiment - Sampson s Dataset It is a dataset about the social interactions among a group of monks recorded by Sampson. Data structure: Dimension: 8 8 Rows/Columns: the 8 monks. Levels: questions: liking (3 timepoints), disliking, esteem, disesteem, positive influenc, negative influence, praise and blame. Values: answers: 3 top choices were listed in original dataset, ranks were recorded. We set these ranks as an indicator ( if in top three choices, if not). N =, estimator is.74774e + 7, cv 2 = 62.4, acceptance rate is 3%.

Problems Still Open Our code performs good when marginals are close to each other. But for the opposite case, the acceptance rate can become very low. How can we reduce rejection rate? Possible idea: arrange the order of columns in different ways? How can Gale-Ryser Theorem be used for 3-way tables?

THANK YOU! Questions? Website for this paper: http://arxiv.org/abs/8.5939

Model Let X = (X ijk ) of size (m, n, l), where m, n, l N and N = {, 2,...}, be a table of counts whose entries are independent Poisson random variables with parameters, {θ ijk }. Here X ijk {, }. Consider the loglinear model, log(θ ijk ) = λ + λ M i + λ N j + λ L k + λmn ij + λ ML ik + λ NL jk (2) for i =,..., m, j =,..., n, and k =,..., l where M, N, and L denote the nominal-scale factors. This model is called no three-way interaction model. Notice that the sufficient statistics under the model in (2) are the two-way marginals. Back

Example for Structural Zero s How can structural zero s come? Different types of cancer separated by gender for Alaska in year 989: Type of cancer Female Male Total Lung 38 9 28 Melanoma 5 5 3 Ovarian 8 [] 8 Prostate [] Stomach 5 5 Total 7 22 292 The structural zeros s are denoted by "[]" Back

Theorem [2-way Tables with Structural Zero s] Define the set of structural zeros Ω as: Ω = {(i, j) : (i, j) is a structural zero, i =,..., m, j =,..., n} Theorem [Yuguo Chen, 27] For the uniform distribution over all m n - tables with given row sums r,..., r m, first column sum c, and the set of structural zeros Ω, the marginal distribution of the first column is the same as the conditional distribution of Z given S Z = c with p i = I [(i,)/ Ω] r i /(n g i ) where g i is the number of structural zeros in the ith row. Back

Theorem [SIS for 3-way Tables] Theorem For the uniform distribution over all m n l - tables with structural zeros with given marginals r k = X i +k, c k = X +j k for k =, 2,..., l, and a fixed marginal for the factor L, l, the marginal distribution of the fixed marginal l is the same as the conditional distribution of Z given S Z = l with p k := r k c k r k c k + (n r k g r k )(m c k g c k ), where g r k is the number of structural zeros in the (r, k)th column and g c is the number of structural zeros in the (c, k)th column. k Back

Theorem [SIS for Multi-way Tables] Theorem For the uniform distribution over all d-way - contingency tables X = (X i...i d ) of size (n n d ), where n i N for i =,... d with marginals l = X i,...id +, and r j k = X i...i j +i j+...i d k for k {,..., n d }, the marginal distribution of the fixed marginal l is the same as the conditional distribution of Z given S Z = l with p k := d j= r j k d j= r j k + d j= (n j r j k gj k ) where g j k is the number of structural zeros in the (i,..., i j, i j+,..., i d, k)th column of X. Back