Monte Carlo conditioning on a sufficient statistic

Seminar, UC Davis, 24 April 2008 p. 1/22 Monte Carlo conditioning on a sufficient statistic Bo Henry Lindqvist Norwegian University of Science and Technology, Trondheim Joint work with Gunnar Taraldsen, NTNU

Seminar, UC Davis, 24 April 2008 p. 2/22 Outline Definition of sufficiency Sufficiency in goodness-of-fit testing Conditional sampling given sufficient statistic: Basic algorithm Conditional sampling given sufficient statistic: Weighted sampling The Euclidean case Relation to Bayesian and fiducial statistics Other applications and concluding remarks

Seminar, UC Davis, 24 April 2008 p. 3/22 Sufficient statistics (X, T) pair of random vectors with joint distribution indexed by θ. Typically, X = (X 1,..., X n ) is a sample and T = T(X 1,..., X n ) is a statistic. T is assumed to be sufficient for θ compared to X, meaning that: The conditional distribution of X given T = t does not depend on θ. or equivalently Conditional expectations E{φ(X) T = t} do not depend on the value of θ. Useful criterion: Neyman s Factorization Theorem: T = (T 1,..., T k ) is sufficient for θ compared to X = (X 1,..., X n ) if the joint density can be factorized as f(x θ) = h(x)g(t(x) θ) i.e. f(x 1,..., x n θ) = h(x 1,..., x n )g(t 1 (x 1,..., x n ), T 2 (x 1,..., x n ),..., T k (x 1,..., x n ) θ)

Seminar, UC Davis, 24 April 2008 p. 4/22 Sufficiency Applications: construction of optimal estimators and tests, nuisance parameter elimination, goodness-of-fit testing.

Seminar, UC Davis, 24 April 2008 p. 4/22 Sufficiency Applications: construction of optimal estimators and tests, nuisance parameter elimination, goodness-of-fit testing. Motivation for present Monte Carlo approach: Usually difficult to derive the conditional distributions analytically. Simulation methods are therefore sought rather than formulas for conditional densities. Goal: To sample X conditionally given T = t.

Seminar, UC Davis, 24 April 2008 p. 5/22 Goodness-of-fit testing H 0 : observation X comes from a particular distribution indexed by θ

Seminar, UC Davis, 24 April 2008 p. 5/22 Goodness-of-fit testing H 0 : observation X comes from a particular distribution indexed by θ Suppose test statistic W W(X) given, large values expected when H 0 is violated. Let T T(X) be sufficient statistic under the null model. Conditional test: Reject H 0, conditionally given T = t, when W k(t), where critical value k(t) is such that P H0 (W k(t) T = t) = α. k(t) is found from conditional distribution of W given T = t, which in principle is known.

Seminar, UC Davis, 24 April 2008 p. 6/22 References Point of departure: ENGEN, S. and LILLEGÅRD, M. (1997). Stochastic simulations conditioned on sufficient statistics. Biometrika 84 235-240. LINDQVIST, B. H., TARALDSEN, G., LILLEGÅRD, M., AND ENGEN, S. (2003). A counterexample to a claim about stochastic simulations. Biometrika 90 489-490. Our papers: LINDQVIST, B. H. and TARALDSEN, G. (2005). Monte Carlo conditioning on a sufficient statistic. Biometrika 92 451-464. LINDQVIST, B. H. and TARALDSEN, G. (2007). Conditional Monte Carlo Based on Sufficient Statistics with Applications. In: Advances in statistical modeling and inference. Essays in Honor of Kjell A. Doksum. (ed. Vijay Nair), pp. 545-562, World Scientific, Singapore. Recent literature: DIACONIS, P., CHEN, Y., HOLMES, S. AND LIU, J. S. (2005). Sequential Monte Carlo Methods for Statistical Analysis of Tables. Journal of the American Statistical Association 100 109-120. LANGSRUD, Ø. (2005). Rotation tests. Statistics and Computing 15 53-60. LOCKHART, R. A., O REILLY, F. J. AND STEPHENS, M. A. (2007). Use of the Gibbs Sampler to Obtain Conditional Tests, with Applications Biometrika 94 992-998.

Seminar, UC Davis, 24 April 2008 p. 7/22 General setup Given (X, T), with T sufficient compared to X for parameter θ. Basic assumption: There are given random vector U with known distribution, known functions χ(, ), τ(, ) such that (χ(u, θ), τ(u, θ)) θ (X, T). Interpretation: These are ways of simulating (X, T) for given θ. EXAMPLE: EXPONENTIAL SAMPLES. X = (X 1,..., X n ) are i.i.d. from Exp(θ), i.e. with hazard rate θ Then T = n i=1 X i is sufficient for θ. Let U = (U 1,..., U n ) be i.i.d. Exp(1) variables. Then: χ(u, θ) = (U 1 /θ,..., U n /θ), n τ(u, θ) = U i /θ. i=1

Seminar, UC Davis, 24 April 2008 p. 8/22 Conditional sampling of X given T = t EXAMPLE (continued) Want to sample X = (X 1, X 2,..., X n ) conditional on T = X i = t for given t. Idea: Draw U = (U 1,..., U n ) i.i.d. Exp(1). Recall: χ(u, θ) = (U 1 /θ,..., U n /θ) n θ τ(u, θ) = U i /θ T. i=1 θ X, Solve: τ(u, θ) = t θ = ˆθ(U, t) = U i /t.

Seminar, UC Davis, 24 April 2008 p. 9/22 Algorithm 1: Conditional sampling of X given T = t The algorithm used in the example can more generally be described as follows:

Seminar, UC Davis, 24 April 2008 p. 9/22 Algorithm 1: Conditional sampling of X given T = t The algorithm used in the example can more generally be described as follows: Recall: χ(u, θ) θ X, τ(u, θ) θ T. ALGORITHM 1 Generate U from the known density f(u).

Seminar, UC Davis, 24 April 2008 p. 10/22 Problems with Algorithm 1 Algorithm 1 does not in general give samples from the correct distribution, even when ˆθ is unique.

Seminar, UC Davis, 24 April 2008 p. 10/22 Problems with Algorithm 1 Algorithm 1 does not in general give samples from the correct distribution, even when ˆθ is unique. There may not be a unique solution ˆθ for θ of τ(u, θ) = t. For discrete distributions the solutions for θ of the equation τ(u, θ) = t are typically intervals. For continuous distributions there may be a finite number of solutions, depending on u.

Seminar, UC Davis, 24 April 2008 p. 11/22 What may go wrong with Algorithm 1? Assume that for each fixed u and t the equation τ(u, θ) = t has the unique solution θ = ˆθ(u, t)

Seminar, UC Davis, 24 April 2008 p. 11/22 What may go wrong with Algorithm 1? Assume that for each fixed u and t the equation τ(u, θ) = t has the unique solution θ = ˆθ(u, t) Under Algorithm 1 we obtained conditional samples by X t = χ{u, ˆθ(U, t)} A tentative proof that this gives samples from the conditional distribution of X given T = t can be given as follows. Let φ be any function. For all θ and t we can formally write: E{φ(X) T = t} = E[φ{χ(U, θ)} τ(u, θ) = t] = E[φ{χ(U, θ)} ˆθ(U, t) = θ] = E(φ[χ{U, ˆθ(U, t)}] ˆθ(U, t) = θ) = E(φ[χ{U, ˆθ(U, t)}]) E{φ(X t )} If correct, this will imply that X t has the correct distribution, i.e. X t is distributed like the conditional distribution of X given T = t.

Seminar, UC Davis, 24 April 2008 p. 12/22 The key equality a possible Borel paradox The key equality is E[φ{χ(U, θ)} τ(u, θ) = t] = E[φ{χ(U, θ)} ˆθ(U, t) = θ] Follows apparently from the equivalence of the events {τ(u, θ) = t} and {ˆθ(U, t) = θ} Unproblematic if these events have positive probability, otherwise, equality may be invalid due to a Borel paradox Equality holds if the two events can be described by the same function of U (same σ-algebra). SUFFICIENT CONDITION FOR ALGORITHM 1 The pivotal condition: Assume that τ(u, θ) depends on u only through a function r(u), where we have unique representation r(u) = v(θ, t) by solving τ(u, θ) = t.

Seminar, UC Davis, 24 April 2008 p. 13/22 Algorithms 2 and 3 for weighted conditional sampling of X given T = t It turns out that a weighted sampling scheme is needed for the general case. Let Θ be a random variable with some conveniently chosen distribution, where Θ and U are independent. Key result is that conditional distribution of X given T = t is the same as that of χ(u,θ) given τ(u, Θ) = t. Notation: Let t W t (u) be the density of τ(u, Θ).

Algorithms 2 and 3 for weighted conditional sampling Seminar, UC Davis, 24 April 2008 p. 13/22 of X given T = t It turns out that a weighted sampling scheme is needed for the general case. Let Θ be a random variable with some conveniently chosen distribution, where Θ and U are independent. Key result is that conditional distribution of X given T = t is the same as that of χ(u,θ) given τ(u, Θ) = t. Notation: Let t W t (u) be the density of τ(u, Θ). ALGORITHM 2 Assume equation τ(u, θ) = t has unique solution θ = ˆθ(u, t) Generate V from a density proportional to W t (u)f(u). Return X t (V ) = χ(v, ˆθ(V, t)). ALGORITHM 3 General case Generate V from a density proportional to W t (u)f(u) and let the result be V = v. Generate Θ t from the conditional distribution of Θ given τ(v, Θ) = t. Return X t (V ) = χ(v,θ t ).

Seminar, UC Davis, 24 April 2008 p. 14/22 The weight function W t (u) in the Euclidean case X = (X 1,..., X n ) has distribution depending on a k-dimensional parameter θ T(X) is a k-dimensional sufficient statistic Choose density π(θ) for Θ; let f(u) be density of U Recall that W t (u) is density of τ(u, Θ) Standard transformation formula using that τ(u, θ) = t θ = ˆθ(u, t) gives π(θ) W t (u) = π{ˆθ(u, t)} det tˆθ(u, t) = det θ τ(u, θ) θ=ˆθ(u,t). and further E{φ(X) T = t} = = π(θ) φ[χ{u, ˆθ(u, t)}] det θ τ(u,θ) θ=ˆθ(u,t) f(u)du π(θ) det θ τ(u,θ) θ=ˆθ(u,t) f(u)du E U (φ[χ{u, ˆθ(U, ) π(θ) t)}] det θ τ(u,θ) θ=ˆθ(u,t) ) π(θ) E U ( det θ τ(u,θ) θ=ˆθ(u,t)

Seminar, UC Davis, 24 April 2008 p. 15/22 Example truncated exponential X = (X 1,..., X n ) are i.i.d. on [0, 1] with density f(x, θ) = { θe θx e θ 1 if θ 0 1 if θ = 0 for 0 x 1 T = n i=1 X i is sufficient compared to X. Conditional distribution of X given T = t is that of n independent uniform [0, 1] random variables given their sum (for which there seems to be no simple expression). Simulation of data: Let U = (U 1, U 2,..., U n ) be i.i.d. uniform on [0, 1]. χ(u, θ) = τ(u, θ) = ( log{1 + (e θ 1)U 1 } n i=1 θ log{1 + (e θ 1)U i } θ,..., log{1 + ) (eθ 1)U n } θ

Seminar, UC Davis, 24 April 2008 p. 16/22 Computation Thus for computation of E{φ(X) T = t} we need to compute θ τ(u, θ) θ=ˆθ(u,t) = eˆθ(u,t) ˆθ(u, t) n i=1 u i t 1 + (eˆθ(u,t) 1)u i ˆθ(u, t). To be substituted in E{φ(X) T = t} = E U (φ[χ{u, ˆθ(U, t)}] E U ( π(θ) det θ τ(u,θ) θ=ˆθ(u,t) π(θ) det θ τ(u,θ) θ=ˆθ(u,t) ) )

Seminar, UC Davis, 24 April 2008 p. 17/22 Distribution of weights Recall: π(θ) W t (u) = θ τ(u, θ) θ=ˆθ(u,t).

Seminar, UC Davis, 24 April 2008 p. 18/22 Discussion of method Disadvantage: Need to solve equation n log{1 + (e θ 1)U i } = θt i=1 at each step to find ˆθ(U, t).

Seminar, UC Davis, 24 April 2008 p. 18/22 Discussion of method Disadvantage: Need to solve equation n log{1 + (e θ 1)U i } = θt i=1 at each step to find ˆθ(U, t). Quicker: Use a Gibbs algorithm to simulate X = (X 1,..., X n ) i.i.d. uniform on [0, 1] given n i=1 X i = t: Start with X 0 i = t for i = 1,..., n n Given (X m 1,..., X m n ) with n i=1 Xm i = t. Draw integers i < j randomly. Compute a = X m i Draw X m+1 i = Let X m+1 j = a X m+1 i Continue with m m + 1 { + X m j uniform[0, a] if a 1 uniform[a 1, 1] if a > 1

Seminar, UC Davis, 24 April 2008 p. 19/22 Relationship to Bayesian and fiducial distributions. Simple case: Suppose θ is one-dimensional τ(u, θ) is strictly monotone in θ for fixed u Then distribution of ˆθ(U, t) (i.e. the θ solving τ(u, θ) = t) corresponds to Fisher s fiducial distribution. Lindley (1958): The distribution of ˆθ(U, t) is a posterior distribution for some (possibly improper) prior distribution for θ if and only if T, or a transformation of it, has a location distribution. Fraser (1961): (Multiparameter case.)the fiducial distribution is a posterior distribution if the sample and parameter sets are transformation groups, and the distributions are given by means of density functions with respect to right Haar measure.

Seminar, UC Davis, 24 April 2008 p. 20/22 Example: Multivariate normal distribution Generation of multivariate normal samples conditional on sample mean and empirical covariance matrix: X = (X 1,..., X n ) is sample from N p (µ, Σ) T = ( X, S) is sufficient compared to X, with X = n 1 n i=1 X i S = (n 1) 1 n i=1 (X i X)(X i X) Reparameterise from (µ, Σ) to θ (µ, A), where Σ = AA is the Cholesky decomposition. Simulate by letting U = (U 1,..., U n ) be i.i.d. N p (0, I): χ(u, θ) = (µ + AU 1,..., µ + AU n ), τ(u, θ) = (µ + AŪ, AS UA ), where Ū and S U are defined in the same way as X and S. Pivotal condition holds and the desired conditional sample is given by χ{u, ˆθ(U, t)} = ( x + cl 1 U (U 1 Ū),..., x + cl 1 U (U n Ū)) where s = cc and S U = L U L U are Cholesky decompositions.

Other applications Seminar, UC Davis, 24 April 2008 p. 21/22

Seminar, UC Davis, 24 April 2008 p. 21/22 Other applications Inverse Gaussian samples given the sufficient statistics: Standard algorithm for generation of inverse Gaussian variates leads to multiple roots of τ(u, θ) = t. Algorithm 3 must be used. Type II censored exponential samples: Equation τ(u, θ) = t has one or no solutions for θ. Algorithm 3 can be used. Discrete distributions (e.g. Poisson distribution, logistic regression): Solutions for θ of equation τ(u, θ) = t are typically intervals. Algorithm 3 is essentially used.

Seminar, UC Davis, 24 April 2008 p. 22/22 Concluding remarks The idea of weighted sampling (Algorithm 2) is similar to the classical conditional Monte Carlo approach of Trotter and Tukey (1956).

Seminar, UC Davis, 24 April 2008 p. 22/22 Concluding remarks The idea of weighted sampling (Algorithm 2) is similar to the classical conditional Monte Carlo approach of Trotter and Tukey (1956). Trotter and Tukey essentially consider the case of unique solution of the equation τ(u, θ) = t, but works without assuming sufficiency of conditioning variable. The approach presented here can also be used for computation of conditional expectations E{φ(X) T = t} in non-statistical problems: Construct artificial statistical models for which the conditioning variable T is sufficient. E.g., use exponential models like f(x, θ) = c(θ)h(x)e θt(x), where h(x) is the density of X and T(X) is sufficient for θ.