G METHOD IN ACTION: FROM EXACT SAMPLING TO APPROXIMATE ONE

Similar documents
Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) 1.1 The Formal Denition of a Vector Space

Intrinsic products and factorizations of matrices

Linear Algebra (part 1) : Matrices and Systems of Linear Equations (by Evan Dummit, 2016, v. 2.02)

INTRODUCTION TO MARKOV CHAINS AND MARKOV CHAIN MIXING

1 Matrices and Systems of Linear Equations

Markov Chains CK eqns Classes Hitting times Rec./trans. Strong Markov Stat. distr. Reversibility * Markov Chains

1 Introduction This work follows a paper by P. Shields [1] concerned with a problem of a relation between the entropy rate of a nite-valued stationary

8. Statistical Equilibrium and Classification of States: Discrete Time Markov Chains

Another algorithm for nonnegative matrices

P i [B k ] = lim. n=1 p(n) ii <. n=1. V i :=

290 J.M. Carnicer, J.M. Pe~na basis (u 1 ; : : : ; u n ) consisting of minimally supported elements, yet also has a basis (v 1 ; : : : ; v n ) which f

Determinants of Partition Matrices

Lifting to non-integral idempotents

A TOUR OF LINEAR ALGEBRA FOR JDEP 384H

Lecture 2 INF-MAT : A boundary value problem and an eigenvalue problem; Block Multiplication; Tridiagonal Systems

The Inclusion Exclusion Principle and Its More General Version

Lecture 11: Introduction to Markov Chains. Copyright G. Caire (Sample Lectures) 321

Calculus and linear algebra for biomedical engineering Week 3: Matrices, linear systems of equations, and the Gauss algorithm

Math 42, Discrete Mathematics

Operations On Networks Of Discrete And Generalized Conductors

Lecture 3: Markov chains.

Markov Chain Monte Carlo

Math 471 (Numerical methods) Chapter 3 (second half). System of equations

Discrete Math, Spring Solutions to Problems V

Notes on the Matrix-Tree theorem and Cayley s tree enumerator

Linear Algebra, 4th day, Thursday 7/1/04 REU Info:

Notes on the matrix exponential

MARKOV CHAIN MONTE CARLO

Definition: A binary relation R from a set A to a set B is a subset R A B. Example:

Balance properties of multi-dimensional words

INTRODUCTION TO MARKOV CHAIN MONTE CARLO

Lecture 8: Determinants I

Detailed Proof of The PerronFrobenius Theorem

1 GSW Sets of Systems

Summary: A Random Walks View of Spectral Segmentation, by Marina Meila (University of Washington) and Jianbo Shi (Carnegie Mellon University)

Polynomial functions over nite commutative rings

Coins with arbitrary weights. Abstract. Given a set of m coins out of a collection of coins of k unknown distinct weights, we wish to

Linear algebra. S. Richard

The matrix approach for abstract argumentation frameworks

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Math 443/543 Graph Theory Notes 5: Graphs as matrices, spectral graph theory, and PageRank

Section Summary. Relations and Functions Properties of Relations. Combining Relations

Reduction of two-loop Feynman integrals. Rob Verheyen

Foundations of Matrix Analysis

The Degree of the Splitting Field of a Random Polynomial over a Finite Field

On asymptotic behavior of a finite Markov chain

Institute for Advanced Computer Studies. Department of Computer Science. On Markov Chains with Sluggish Transients. G. W. Stewart y.

Key words. Feedback shift registers, Markov chains, stochastic matrices, rapid mixing

Homework 1 Solutions ECEn 670, Fall 2013

The super line graph L 2

4.1 Eigenvalues, Eigenvectors, and The Characteristic Polynomial

RESEARCH ARTICLE. An extension of the polytope of doubly stochastic matrices

Pade approximants and noise: rational functions

An O(n 2 ) algorithm for maximum cycle mean of Monge matrices in max-algebra

Markov Random Fields

5 Eigenvalues and Diagonalization

Nordhaus-Gaddum Theorems for k-decompositions

Dot Products, Transposes, and Orthogonal Projections

1. Affine Grassmannian for G a. Gr Ga = lim A n. Intuition. First some intuition. We always have to rst approximation

Kernels of Directed Graph Laplacians. J. S. Caughman and J.J.P. Veerman

Solution Set 7, Fall '12

Roots of Unity, Cyclotomic Polynomials and Applications

STA 294: Stochastic Processes & Bayesian Nonparametrics

Economics 472. Lecture 10. where we will refer to y t as a m-vector of endogenous variables, x t as a q-vector of exogenous variables,

6 Markov Chain Monte Carlo (MCMC)

Discrete Applied Mathematics

k-degenerate Graphs Allan Bickle Date Western Michigan University

ACI-matrices all of whose completions have the same rank

Homework 10 Solution

(1) A frac = b : a, b A, b 0. We can define addition and multiplication of fractions as we normally would. a b + c d

STGs may contain redundant states, i.e. states whose. State minimization is the transformation of a given

Groups. 3.1 Definition of a Group. Introduction. Definition 3.1 Group

Contents. 2.1 Vectors in R n. Linear Algebra (part 2) : Vector Spaces (by Evan Dummit, 2017, v. 2.50) 2 Vector Spaces

A PRIMER ON SESQUILINEAR FORMS

MATRICES. a m,1 a m,n A =

Average Reward Parameters

Linear Algebra in Actuarial Science: Slides to the lecture

MATH 61-02: PRACTICE PROBLEMS FOR FINAL EXAM

Boolean Inner-Product Spaces and Boolean Matrices

Combinations. April 12, 2006

ELA

A fast algorithm to generate necklaces with xed content

1. The Polar Decomposition

SUMS PROBLEM COMPETITION, 2000

Chapter 3. Differentiable Mappings. 1. Differentiable Mappings

Canonical lossless state-space systems: staircase forms and the Schur algorithm

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

A Generalized Eigenmode Algorithm for Reducible Regular Matrices over the Max-Plus Algebra

MATH 56A: STOCHASTIC PROCESSES CHAPTER 1

ELA

On Projective Planes

Existence of Some Signed Magic Arrays

Simple Lie subalgebras of locally nite associative algebras

CHANGE OF BASIS AND ALL OF THAT

Contents. 6 Systems of First-Order Linear Dierential Equations. 6.1 General Theory of (First-Order) Linear Systems

A proof of the Jordan normal form theorem

Institute for Advanced Computer Studies. Department of Computer Science. On the Convergence of. Multipoint Iterations. G. W. Stewart y.

On the classication of algebras

Example: physical systems. If the state space. Example: speech recognition. Context can be. Example: epidemics. Suppose each infected

[3] (b) Find a reduced row-echelon matrix row-equivalent to ,1 2 2

Transcription:

G METHOD IN ACTION: FROM EXACT SAMPLING TO APPROXIMATE ONE UDREA PÄUN Communicated by Marius Iosifescu The main contribution of this work is the unication, by G method using Markov chains, therefore, a Markovian unication, in the nite case, of ve (even six) sampling methods from exact and approximate sampling theory. This unication is in conjunction with a main problem, our problem of interest: nding the fastest Markov chains in sampling theory based on Metropolis-Hastings chains and their derivatives. We show, in the nite case, that the cyclic Gibbs sampler the Gibbs sampler for short belongs to our collection of hybrid Metropolis-Hastings chains from [U. Päun, A hybrid Metropolis-Hastings chain, Rev. Roumaine Math. Pures Appl. 56 (0), 07 8]. So, we obtain, for any type of Gibbs sampler (cyclic, random, etc.), in the nite case, the structure of matrices corresponding to the coordinate updates. Concerning our hybrid Metropolis-Hastings chains, to do unications, and, as a result of these, to do comparisons and improvements, we construct, by G method, a chain, which we call the reference chain. This is a very fast chain because it attains its stationarity at time. Moreover, the reference chain is the best one we can construct (concerning our hybrid chains), see Uniqueness Theorem. The reference chain is constructed such that it can do what a exact sampling method, which we called the reference method, does. This method contains, as special cases, the alias method and swapping method. Coming back to the reference chain, this is sometimes identical with the Gibbs sampler or with a special Gibbs sampler with grouped coordinates we illustrate this case for two classes of wavy probability distributions. As a result of these facts, we give a method for generating, exactly (not approximately), a random variable with geometric distribution. Finally, we state the fundamental idea on the speed of convergence: the nearer our hybrid chain is of its reference chain, the faster our hybrid chain is. The addendum shows that we are on the right track. AMS 00 Subject Classication: 60J0, 65C05, 65C0, 68U0. Key words: G method, unication, exact sampling, alias method, swapping method, reference method, reference chain, approximate sampling, hybrid Metropolis-Hastings chain, optimal hybrid Metropolis chain, Gibbs sampler, wavy probability distribution, uniqueness theorem, fundamental idea on the speed of convergence, G comparison. REV. ROUMAINE MATH. PURES APPL. 6 (07), 3, 4345

44 Udrea Päun. SOME BASIC THINGS In this section, we present some basic things (notation, notions, and results) from [78] with completions. [7] refers, especially, to very fast Markov chains (i.e., Markov chains which converge very fast) while in [8], based on [7] (Example. was the starting point), is constructed a collection of hybrid Metropolis-Hastings chains. This collection of Markov chains has something in common this is interesting! with some exact sampling methods (see the next sections), one of these methods, the swapping method, suggesting, in fact, the construction of this collection (in Example. from [7] mentioned above, it is a chain which can do what the swapping method does). These common things could help us to nd fast approximate sampling methods or, at best, fast exact sampling methods (see the fundamental idea on the speed of convergence, etc.). This way based on exact sampling methods, to nd ecient exact or approximate sampling methods, is our way the best way. On the other hand, the exact sampling methods are important sources to obtain very fast Markov chains. Set Par (E) = { is a partition of E }, where E is a nonempty set. We shall agree that the partitions do not contain the empty set. Denition.. Let, Par(E). We say that is ner than if V, W such that V W. Write when is ner than. In this article, a vector is a row vector and a stochastic matrix is a row stochastic matrix. The entry (i, j) of a matrix Z will be denoted Z ij or, if confusion can arise, Z i j. Set m = {,,..., m} (m ), m = {0,,..., m} (m 0), N m,n = {P P is a nonnegative m n matrix}, S m,n = {P P is a stochastic m n matrix}, N n = N n,n, S n = S n,n. Let P = (P ij ) N m,n. Let = U m and = V n. Set the matrices P U = (P ij ) i U,j n, P V = (P ij ) i m,j V, and P V U = (P ij ) i U,j V

3 G method in action: from exact sampling to approximate one 45 (e.g., if then, e.g., E.g., Set P = ( 3 4 P {} = ( 3 4 ), P {3} = ( 4 ), ), and P {} {} = () ). ({i}) i {s,s,...,s t} = ({s }, {s },..., {s t }) ; ({i}) i {s,s,...,s t} Par ({s, s,..., s t }). ({i}) i n = ({}, {},..., {n}). Denition.. Let P N m,n. We say that P is a generalized stochastic matrix if a 0, Q S m,n such that P = aq. Denition.3 ([7]). Let P N m,n. Let Par( m ) and Σ Par( n ). We say that P is a [ ]-stable matrix on Σ if PK L is a generalized stochastic matrix, K, L Σ. In particular, a [ ]-stable matrix on ({i}) i n is called [ ]-stable for short. Denition.4 ([7]). Let P N m,n. Let Par( m ) and Σ Par( n ). We say that P is a -stable matrix on Σ if is the least ne partition for which P is a [ ]-stable matrix on Σ. In particular, a -stable matrix on ({i}) i n is called -stable while a ( m )-stable matrix on Σ is called stable on Σ for short. A stable matrix on ({i}) i n is called stable for short. For interesting examples of -stable matrices on Σ for some and Σ, see Sections and 3. Let Par( m ) and Par( n ). Set (see [7] for G, and [8] for _ G, ) and G, = {P P S m,n and P is a [ ] -stable matrix on } _ G, = {P P N m,n and P is a [ ] -stable matrix on }. When we study or even when we construct products of nonnegative matrices (in particular, products of stochastic matrices) using G, or _ G, we shall refer this as the G method. Let _ P G,. Let K and L. Then a K,L 0, Q K,L S K, L such that PK L = a K,LQ K,L. Set P + = ( P + KL )K,L, P + KL = a K,L, K, L

46 Udrea Päun 4 (P + KL, K, L, are the entries of matrix P + ). If confusion can arise, we write P +(, ) instead of P +. In this article, when we work with the operator ( ) + = ( ) + (, ), we suppose, for labeling the rows and columns of matrices, that and are ordered sets (i.e., these are sets where the order in which we write their elements counts), even if we omit to specify this. E.g., let P = 0 0 0 3 0 0 0 P G,, where = ({, }, {3}) and = ({, }, {3, 4}). Further, we have ) P + = P +(, ) = 4 0 3 0 0 0 3 0 7 0 ( 4 0 0 ({, } and {3} are the rst and the second element of, respectively; based on this order, the rst and the second row of P + are labeled {, } and {3}, respectively. The columns of P + are labeled similarly.) Below we give a basic result. Theorem.5 ([8]). Let P _ G, N m,n and Q _ G, 3 N n,p. Then (i) P Q _ G, 3 N m,p ; (ii) (P Q) + = P + Q +. Proof. See [8]. In this article, the transpose of a vector x is denoted x. Set e = e (n) = (,,..., ) R n, n. Below we give an important result.. Theorem.6 ([8]). Let P _ G ( m ), N m,m, P _ G, 3 N m,m 3,..., P n _ G n, n N mn,m n, P n _ G n,({i}) i mn+ N m n,m n+. Then (i) P P...P n is a stable matrix; (ii) (P P...P n ) {i} = P + P +...Pn +, i m ((P P...P n ) {i} is the row i of P P...P n ); therefore, P P...P n = e π, where π = P + P +...Pn +. Proof. See [8]. Remark.7. Under the assumptions of Theorem.6, but taking P S m,m, P S m,m 3,..., P n S mn,m n, P n S mn,mn+, we have pp P...P n = π 6 0 8 0.

5 G method in action: from exact sampling to approximate one 47 for any probability distribution p on m. Consequently, Theorem.6 could be used to prove that certain Markov chains have nite convergence time (see [7] and Sections and 3 for some examples). and Let P N m,n. Set α (P ) = _ α (P ) = min i,j m k= max i,j m k= n min (P ik, P jk ) n P ik P jk. If P S m,n, then α (P ) is called the Dobrushin ergodicity coecient of P ([5]; see, e.g., also [3, p. 56]). Theorem.8. (i) _ α (P ) = α (P ), P S m,n. (ii) µp νp µ ν _ α (P ), µ, ν, µ and ν are probability distributions on m, P S m,n. (iii) _ α (P Q) _ α (P ) _ α (Q), P S m,n, Q S n,p. Proof. (i) See, e.g., [3, p. 57] or [4, p. 44]. (ii) See, e.g., [5] or [4, p. 47]. (iii) See, e.g., [5], or [3, pp. 5859], or [4, p. 45]. Theorem.6 (see also Remark.7) could be used, e.g., in exact sampling theory based on nite Markov chains (see Section 3; see also Section ) while the next result could be used, e.g., in approximate sampling theory based on nite Markov chains (see Section 4). Theorem.9 ([8]). Let P N m,m, P N m,m 3,..., P n N mn,m n+. Let = ( m ), Par( m ),..., n Par( m n ), n+ = ({i}) i mn+. Consider the matrices L l = ((L l ) V W ) V l,w l+ ((L l ) V W is the entry (V, W ) of matrix L l ), where (L l ) V W = min i V (P l ) ij, l n, V l, W l+. j W Then α (P P...P n ) (L L...L n ) m K. K n+ (Since L L...L n is a m n+ matrix, it can be thought of as a row vector, but above we used and below we shall use, if necessary, the matrix notation for its entries instead of the vector one. Above the matrix notation (L L...L n ) m K was used instead of the vector one (L L...L n ) K because, in this article, the notation A U, where A N p,q and U p, means something dierent.)

48 Udrea Päun 6 Proof. See [8]. (Theorem.9 is part of Theorem.8 from [8].) Denition.0 (see, e.g., [0, p. 80]). Let P N m,n. We say that P is a row-allowable matrix if it has at least one positive entry in each row. Let P N m,n. Set _ (_ ) P = P ij N m,n, _ { if Pij > 0, P ij = 0 if P ij = 0, i m, j n. We call _ P the incidence matrix of P (see, e.g., [3, p. ]). In this article, some statements on the matrices hold, obviously, eventually by permutation of rows and columns. For simplication, further, we omit to specify this fact. Warning! In this article, if a Markov chain has the transition matrix P = P P...P s, where s and P, P,..., P s are stochastic matrices, then any -step transition of this chain is performed via P, P,..., P s, i.e., doing s transitions: one using P, one using P,..., one using P s. (See also Section.) Let S = r. Let π = (π i ) i S = (π, π,..., π r ) be a positive probability distribution on S. One way to sample approximately or, at best, exactly from S when r is by means of the hybrid Metropolis-Hastings chain from [8]. Below we dene this chain. Let E be a nonempty set. Set if and, where, Par(E). Let,,..., t+ Par(S) with = (S)... t+ = ({i}) i S, where t. Let Q, Q,..., Q t S r such that (C) _ Q, Q,..., Q t are symmetric matrices; (C) (Q l ) L K = 0, l t {}, K, L l, K L (this assumption implies that Q l is a block diagonal matrix and l -stable matrix on l, l t {}); (C3) (Q l ) U K is a row-allowable matrix, l t, K l, U l+, U K. Although Q l, l t, are not irreducible matrices if l, we dene the matrices P l, l t, as in the Metropolis-Hastings case ([6] and []; see, e.g., also [7, pp. 3336], [9, Chapter 6], [, pp. 5], [5, pp. 6366], and [, Chapter 0]), namely, ) P l = ((P l ) ij S r, 0 if j i and (Q l ) ij = 0, ( ) (Q (P l ) ij = l ) ij min, π j(q l ) ji π i (Q l ) if j i and (Q l ) ij ij > 0, (P l ) ik if j = i, k i

7 G method in action: from exact sampling to approximate one 49 l t. Set P = P P...P t. Theorem. ([8]). Concerning P above we have πp = π and P > 0. Proof. See [8]. By Theorem., P n e π as n. We call the Markov chain with transition matrix P the hybrid Metropolis-Hastings chain. In particular, we call this chain the hybrid Metropolis chain when Q, Q,..., Q t are symmetric matrices. We call the conditions (C)(C3) the basic conditions of hybrid Metropolis- Hastings chain. In particular, we call these conditions the basic conditions of hybrid Metropolis chain when Q, Q,..., Q t are symmetric matrices. The basic conditions (C)(C3) and other conditions, which we call the special conditions, determine special hybrid Metropolis-Hastings chains. E.g., in [8] was considered the next ( special hybrid Metropolis chain. Supposing that l = K (l),..., K(l), l t +, this chain satis-, K(l) u l ) es the conditions (C)(C3) and, moreover, the conditions: (c) K (l) = K (l) =... = K u (l) l, l t + with ul ; (c) r = r r...r t with r r...r l = l+, l t, and r t = K (t) (this condition is compatible with... t+ ); (c3) (c3.) Q l is a symmetric matrix such that (c3.) (Q l ) ii > 0, i S, and (Q l ) i j = (Q l ) i j, i, i, j, j S with i j, i j, and (Q l ) i j, (Q l ) i j > 0, l t ((c3.) says that all the positive entries of Q l, excepting the entries (Q l ) ii, i S, are equal, l t ); (c4) (Q l ) U K has in each row just one positive entry, l t, K l, U l+ with U K (this condition is compatible with (c3.) because (Q l ) W V is a square matrix, l t, V, W l+ ). The condition (c) is superuous because it follows from (C) and (c4). (c) is also superuous because it follows from (c) and... t+. It is interesting to note that the matrices P, P,..., P t satisfy conditions similar to (C)(C3) and, for this special chain, moreover, (c4) simply we replace Q l with P l, l t, in (C)(C3) and, if need be, in (c4). (c)(c) are common conditions for Q, Q,..., Q t and P, P,..., P t. In [8], for the chain satisfying the conditions (C)(C3) and (c)(c4), the positive entries of matrices Q l, l t, were, taking Theorem.9 into account, optimally chosen, i.e., these were chosen such that the lower bound of α (P P...P t ) from Theorem.9 be as large as possible (we need this condition to obtain a chain with a speed of convergence as large as possible). More

40 Udrea Päun 8 precisely, setting f l = π j min i,j S,(Q l ) ij >0 π i (do not forget the condition (Q l ) ij > 0!) and x l = (Q l ) ij, where i, j S are xed such that i j and (Q l ) ij > 0 (see (c3) again), it was found (taking Theorem.9 into account) x l = f l + r l. We call this chain the optimal hybrid Metropolis chain with respect to the conditions (C)(C3) and (c)(c4) and the inequality from Theorem.9 we call it the optimal hybrid Metropolis chain for short. In Section 3, we show that the Gibbs sampler on h n, h, n (more generally, on h h... h n, h, h,..., h n, n ) belongs to our collection of hybrid Metropolis-Hastings chains. Moreover, we shall show that the Gibbs sampler on h n satises all the conditions (c)(c4), excepting (c3). As to the estimate of p n π (p n and π are dened below), we have the next result. Theorem. (see, e.g., [8]). Let P S r be an aperiodic irreducible matrix. Consider a Markov chain with transition matrix P and limit probability distribution π. Let p n be the probability distribution of chain at time n, n 0. Then p n π _ α (P n ), n 0 (P 0 = I r ; by Theorem.8(iii), α _ (P n ) (_ α (P ) ) n, n 0, α _ ( _α ( (P n ) P k)) n k, n, k n ( x = max {b b Z, b x}, x R), etc.). Proof. See, e.g., [8] (it is used Theorem.8(ii) for the proof).. EXACT SAMPLING In this section, we consider a similarity relation. This has some interesting properties. Then we consider two methods of generation of the random variables exactly in the nite case only. The rst one, the alias method, is a special case of the second one. For each of these methods, we associate a Markov chain

9 G method in action: from exact sampling to approximate one 4 such that this chain can do what the method does. These associate chains are important for our unication. Finally, we associate a hybrid chain with a reference chain. Denition.. Let P, Q _ G, N m,n. We say that P is similar to Q if P + = Q +. Set P Q when P is similar to Q. Obviously, is an equivalence relation on _ G,. Theorem.. Let P, U _ G, N m,m and P, U _ G, 3 N m,m 3. Suppose that P U and P U. Then P P U U. Proof. By Theorem.5 we have P P, U U _ G, 3 N m,m 3. By Theorem.5 and Denition. we have Therefore, (P P ) + = P + P + = U + U + = (U U ) +. P P U U. Theorem.3. Let P, U _ G, N m,m, P, U _ G, 3 N m,m 3,..., P n, U n _ G n, n+ N mn,m n+. Suppose that P U, P U,..., P n U n. Then P P...P n U U...U n. If, moreover, = ( m ) and n+ = ({i}) i mn+, then P P...P n = U U...U n (therefore, when = ( m ) and n+ = ({i}) i mn+, a product of n representatives, the rst of an equivalence class included in G, _, the second of an equivalence class included in included in _ G, 3,..., the nth of an equivalence class _ G n, n+, does not depend on the choice of representatives). Proof. The rst part follows by Theorem. and induction. As to the second part, by Theorem.6, P P...P n and U U...U n are stable matrices and, further, Therefore, (P P...P n ) {i} = P + P +...P + n = = U + U +...U + n = (U U...U n ) {i}, i m. P P...P n = U U...U n.

4 Udrea Päun 0 The reader is assumed to be acquainted with the rst method below. Recall that for each of the two methods below we associate a Markov chain such that this chain can do what the method does.. The alias method (see, e.g., [4, pp. 073] and [5, pp. 57]). To illustrate our Markovian modeling here, we consider, for simplication, the example from [5, p. 5]. Following this example, we have a random variable X with the values,, 3, 4, 5 and probabilities π = 0.4, π = 0.7, π 3 = 0.07, π 4 = 0.4, π 5 = 0., where π i = P (X = i), i 5. The alias method leads, following the example from [5, p. 5] too, to the table (having rows and 5 columns) 0.07 3 0.3 0. 5 0.09 0.4 4 0.06 0.9 0.0 0.0 0 0.07, 0.3, etc. are probabilities while,, 3, 4, 5 in bold print are values of X. In each column of the table, the sum of probabilities is equal to 0.0. We associate the alias method for generating X (when this method is applied to the generation of X) with the Markov chain (X n ) n 0 with state space S = {(3, ), (, ), (5, ), (, ), (4, 3), (, 3), (, 4), (, 4), (, 5)} if (x, x ) S, then x denotes a value of X while x denotes the column of table in which the value x is; for x = 5 (column 5), we only consider the state (, 5) because in column 5 the second probability is 0 and transition matrix P = P P, where P = (3, ) (, ) (5, ) (, ) (4, 3) (, 3) (, 4) (, 4) (, 5) 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 (the columns are labeled similarly, i.e., (3, ), (, ), (5, ), (, ), (4, 3), (, 3), (, 4), (, 4), (, 5) from left to right) and

G method in action: from exact sampling to approximate one 43 P = (3, ) (, ) (5, ) (, ) (4, 3) (, 3) (, 4) (, 4) (, 5) 0.07 0.0 0.07 0.0 0.3 0.0 0.3 0.0 0. 0.0 0. 0.0 0.09 0.0 0.09 0.0 0.4 0.0 0.4 0.0 0.06 0.0 0.06 0.0 0.9 0.0 0.9 0.0 0.0 0.0 0.0 0.0. P G,, P G, 3, where = (S), = ({(3, ), (, )}, {(5, ), (, )}, {(4, 3), (, 3)}, {(, 4), (, 4)}, {(, 5)}), 3 = ({(x, y)}) (x,y) S. By Theorem.6 it follows that P is a stable matrix and, more precisely, where P = e ρ, ρ = (0.07, 0.3, 0., 0.09, 0.4, 0.06, 0.9, 0.0, 0.0) (see the table again). Recall even if, here, P = e ρ that any -step transition of this chain is performed via P, P, i.e., doing two transitions: one using P and the other using P. Passing this Markov chain from an initial state, say, (3, ) (the state at time 0) to a state at time is done using, one after the other, the probability distributions (P ) {(3,)} (this is the rst row of matrix P ) suppose that using this probability distribution the chain arrives at state (i, j) and (P ) {(i,j)}. The alias method for generating X uses these probability distributions too, in the same order, the 0 s do not count, they can be removed e.g., (P ) {(3,)} = (0.0, 0, 0.0, 0, 0.0, 0, 0.0, 0, 0.0) leads, removing the 0 s, to (0.0, 0.0, 0.0, 0.0, 0.0),

44 Udrea Päun which is the probability distribution used by the alias method in its rst step (when, obviously, this method is applied to X from here). Therefore, this chain can do what the alias method does (we need to run this chain just one step (or two steps due to P and P ) until time inclusive). By Theorem.3 we can replace P with any matrix, U, similar to P obviously, it is more advantageous that each of the matrices U {(3,),(,)}, U {(5,),(,)}, U {(4,3),(,3)}, U {(,4),(,4)}, U {(,5)} have in each row just one positive entry. E.g., we can take U = and have (3, ) (, ) (5, ) (, ) (4, 3) (, 3) (, 4) (, 4) (, 5) 0.0 0 0 0.0 0.0 0 0.0 0 0.0 0 0.0 0.0 0 0.0 0 0.0 0 0.0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0.0 0 0.0 0 0 0.0 0.0 0 0.0 0 0.0 0.0 0 0.0 0 0 0.0 0.0 0 0.0 0 0.0 0.0 0 0.0 0 0.0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0.0 0 0 0.0 0.0 0 0 0.0 0.0 P = P P = U P.. The reference method for our collection of hybrid chains (in particular, for the Gibbs sampler). We call it the reference method for short. Its name as well as the name of the chain determined of this, called the reference chain (see below for this chain), were inspired by the reference point from physics and other elds. The reference method is an iterative composition method we include the degenerate case when no composition is done (this case corresponds to the case t = of reference method). Below we present the reference method. Let X be a random variable with positive probability distribution π = (π, π,..., π r ) = (π i ) i S, where S = r. Let,,..., t+ Par(S) with = (S)... t+ = ({i}) i S, where t. Set Obviously, a (l) K,L = π i i L, l t, K l, L l+ with L K. π i i K a (l) K,L = P (X L X K ), l t, K l, L l+ with L K (a () S,L = i L π i = P (X L), L ),

3 G method in action: from exact sampling to approximate one 45 and ( ) a (l) K,L L K l+ is a probability distribution on K l+ = {K A A l+ } = = {B B l+, B K }, l t, K l. We generate random variable X as follows. (See also discrete mixtures in, e.g., [4, p. 6], the decomposition method in, e.g., [4, p. 66], and the composition method in, e.g., [4, p. 66].) ( ) Step. Generate L () a () S,L. Suppose that we obtained L () = L S L (L S = ). Set K = ( L. ) Step. Generate L () a () K,L. Suppose that we obtained L K 3 L () = L (L K 3 ). Set K = L.. Suppose that at Step t we obtained L (t ) = L t (L t K t t ). Set K t = L t. ( ) Step t. Generate L (t) a (t) K t,l. Suppose that we obtained L K t t+ L (t) = L t (L t K t t+ ). Since t+ = ({i}) i S, it follows that i S such that L t = {i}. Set X = i this value of X is generated according to its probability distribution π because by general multiplicative formula (see, e.g., [3, p. 6]) we have P (X = i) = P (X {i}) = P (X L t ) = P (X L L... L t ) = = a () S,L a () L,L...a (t) L t,l t = π i. The reference method is very fast if, practically speaking, we know the quantities a (l) K,L, l t, K l, L l+ with L K. Unfortunately, this does not happen in general if S is too large. But, fortunately, we can compute all or part of the quantities a (l) K,L when K and L are small this is an important thing (see, e.g., in Section 3, the hybrid chains with P ). To connect the reference method to our collection of hybrid chains, we associate the reference method (for generating a (nite) random variable (when this method is applied to the generation of a (nite) random variable)) with a (nite) Markov chain. To do this, rst, recall that the partitions for the reference method are,,..., t+ Par(S) with = (S)... t+ = ({i}) i S, where t. Let R, R,..., R t S r such that (A) R G,, R G, 3,..., R t G t, t+ ; (A) (R l ) L K = 0, l t {}, K, L l, K L (this assumption implies that R l is a block diagonal matrix and l -stable matrix on l, l t {});

46 Udrea Päun 4 (A3) (R l ) U K has in each row just one positive entry, l t, K l, U l+, U K (this assumption implies that (R l ) U K is a row-allowable matrix, l t, K l, U l+, U K). (A) and (A3) are similar to (C) and (c4) from Section, respectively. Therefore, the matrices P, P,..., P t of hybrid chain (obviously, we refer to our hybrid Metropolis-Hastings chain) and the matrices R, R,..., R t have some common things, respectively. This fact contributes to our unication (see Section 4). Suppose that each positive entry of (the matrix) (R l ) U K is equal to a(l) K,U are equal), l t, K (by (A) and (A3), all the positive entries of (R l ) U K l, U l+, U K. Set R = R R...R t. The following result is a main one both for the reference chain (this is dened below) and for (our) hybrid chains. Theorem.4. Under the above assumptions the following statements hold. (i) R l R l+...r t is a block diagonal matrix, l t {}, and l -stable matrix, l t. (ii) πr l R l+...r t = π, l t. (iii) R is a stable matrix. (iv) R = e π. Proof. (i) The rst part follows from (A). Now, we show the second part. By Theorem.5(i) and (A), R l R l+...r t G l, t+. Consequently, R l R l+...r t is a [ l ]-stable matrix, l t, because t+ = ({i}) i S (see Denition.3). Case. l =. Since = (S), it follows that R R...R t is a -stable (stable for short) matrix (see Denition.4). Case. l t {}. By (A) it follows that R l R l+...r t is a l -stable matrix. (ii) Let j S. Then!U (t) t such that {j} U (t) (! = there exists a unique). By (A)(A3) we have ( R + t )U (t) {j} > 0 and ( R + t ) V {j} = 0, V t, V U (t). Further, since t t,!u (t ) t such that U (t) U (t ). By (A) (A3) we have ( R + t > 0 and )U ( R + ) (t ) U (t) t = 0, V V U (t) t, V U (t ). Proceeding in this way, we nd a sequence U (), U (),..., U (t+) such that U (), U (),..., U (t+) t+, {j} = U (t+) U (t)... U () = S,

5 G method in action: from exact sampling to approximate one 47 and ( R + l > 0 and )U ( R + ) (l) U (l+) l = 0, l t, V V U (l+) l, V U (l). Let l t. Let i U (l) (j U (l) as well). By (i) (the fact that R l R l+...r t is a l -stable matrix), (A)(A), and Theorem.5 we have (R l R l+...r t ) ij = (R l R l+...r t ) + U (l) {j} = ( R + l R + ) l+...r + t U (l) {j} = = ( R + ( l )U (l) U R + (l+) l+... )U ( R + ) (l+) U (l+) t = a (l) U (l),u (l+) a (l+) U (l+),u (l+)...a (t) U (t),{j} = π j U (t) {j} = k U (l) π k (it was to be expected see (i) that this ratio will not depend on i U (l) ). Consequently, πr l R l+...r t = π. (iii) This follows by (i) or (iv). (iv) By proof of (ii) we have R ij = π j, i, j S. Therefore, R = e π. We associate the reference method for generating X with the Markov chain (this depends on X too) with state space S = r and transition matrix R = R R...R t. Recall even if, here, by Theorem.4(iv), R = e π that any -step transition of this chain is performed via R, R,..., R t, i.e., doing t transitions: one using R, one using R,..., one using R t. We call the above Markov chain the reference (Markov) chain. This is another example of chain with nite convergence time (see, e.g., also [7] for other examples of chains with nite convergence time). The best case is when we know the quantities a (l) K,L, l t, K l, L l+, L K; we can always know all or part of the quantities a (l) K,L when K and L are small a happy case! Passing the reference chain from an initial state, say, (the state at time 0) to a state at time is done using, one after the other, the probability distributions (R ) {} (this is the rst row of matrix R ), (R ) {i },..., (R t), where {it} i l = the state the chain arrives using (R l ) {il }, l t {}, setting i =. The reference method for generating X uses these probability distributions too, in the same order, the 0 s do not count, they can be removed e.g., if ( ) (R ) {} = a (), where S = 4, K () =, K () S,K (), 0, 0, a () S,K () = {3, 4}, then, removing the 0 s, we obtain ( ) a (), S,K (), a () S,K ()

48 Udrea Päun 6 which is the probability distribution ( used) by the reference method in its rst step, being, here, equal to K (), K(). Therefore, the reference chain can do what the reference method does (we need to run this chain just one step (or t steps due to R, R,..., R t ) until time inclusive). To illustrate the reference chain, we consider a random variable X with probability distribution π = (π, π,..., π 8 ). Taking the partitions = ( 8 ), = ({,, 3, 4}, {5, 6, 7, 8}), 3 = ({, }, {3, 4}, {5, 6}, {7, 8}), 4 = ({i}) i 8, a reference chain is the Markov chain with state space S = 8 and transition matrix R = R R R 3, where R = a () S,K () 0 a () S,K () 0 0 0 a () S,K () 0 0 a () S,K () 0 0 a () S,K () 0 0 0 a () S,K () a () S,K () 0 0 a () S,K () 0 0 0 0 0 0 0 0 a () S,K () 0 0 0 0 a () S,K () 0 0 a () S,K () 0 a () S,K () 0 a () S,K () 0 0 0 0 0 0 0 0 a () 0 S,K () 0 0 0 0 0 a () 0 0 0 0 a () S,K () K () = {,, 3, 4}, K () = {5, 6, 7, 8}, a () = π S,K () + π + π 3 + π 4, a () = π S,K () 5 + π 6 + π 7 + π 8 (R G, ), R () = R = R() 0 a () K (),K(3) a () K (),K(3) a () K (),K(3) 0 a () K (),K(3) R (), 0 a () K (),K(3) 0 a () K (),K(3) 0 0 a () K (),K(3) 0 0 a () K (),K(3), S,K () 0,

7 G method in action: from exact sampling to approximate one 49 R () = a () K (),K(3) 3 a () K (),K(3) 3 0 a () K (),K(3) 3 a () K (),K(3) 3 0 a () K (),K(3) 4 0 0 a () K (),K(3) 4 a () K (),K(3) 4 0 0 a () K (),K(3) 4 K (3) = {, }, K (3) = {3, 4}, K (3) 3 = {5, 6}, K (3) 4 = {7, 8}, a () π + π =, a () π 3 + π 4 =, K (),K(3) π + π + π 3 + π 4 K (),K(3) π + π + π 3 + π 4 a () π 5 + π 6 =, a () π 7 + π 8 = K (),K(3) 3 π 5 + π 6 + π 7 + π 8 K (),K(3) 4 π 5 + π 6 + π 7 + π 8 (R G, 3 (moreover, it is a -stable matrix on, -stable matrix on 3, and block diagonal matrix)), and R 3 = R w (3) = a (3) = K w (3),K (4) w R (3) a (3) K (3) w,k (4) w a (3) K (3) w,k (4) w K (4) R (3) R (3) 3 a (3) K (3) w,k (4) w a (3) K (3) w,k (4) w i = {i}, i 8, π w, a (3) = π w + π w K w (3),K (4) w R (3) 4, 0 0, w 4,, π w π w + π w, w 4 (R 3 G 3, 4 (moreover, it is a 3 -stable matrix on 3, 3 -stable matrix (because 4 = ({i}) i 8, see Denition.4), and block diagonal matrix)). By Theorem.4(iv) or direct computation we have R = e π. Warning! In the above example, R, R, R 3 are representatives of certain equivalence classes: R l, where l 3, is a representative of the equivalence class determined by quantities a (l) K,L, K l, L l+ (the number of elements of this class can easily be determined; e.g., R belongs to an equivalence class with the cardinal equal to 4 6 because R K() and R K() are 8 4 matrices,...). Each triple (R, R, R 3 ) of representatives determines a reference chain, all these chains having the product R R R 3 equal to e π (see Theorems.3 and.4(iv)).

430 Udrea Päun 8 Now, it is easy to see that the chain associated with the alias method is a special case of the reference chain. Therefore (this was to be expected), the alias method is a special case of the reference method. Another interesting special case of the reference method (and of the reference chain) is the method of uniform generation of the random permutations of order n from, e.g., Example. in [7]. In Example. from [7], it is presented and analyzed a Markov chain which can do what the swapping method does (for the swapping method, see, e.g., [4, pp. 645646]). When the probability distribution of interest, π, is uniform, we have a (l) K,L = L K, l t, K l, L l+, L K another happy case! We here supposed that there are t + partitions; in Example. from [7], t = n. Finally, for the next sections, we need to associate a hybrid Metropolis- Hastings chain with a reference chain. Below we state the terms of this association. Remark.5. This association makes sense (warning!) if, obviously, both chains are dened by means of the same state space S, the same probability distribution (of interest) π on S, and the same sequence of partitions,,..., t+ on S with = (S)... t+ = ({i}) i S, where t. We shall use expressions as the hybrid Metropolis-Hastings chain and its reference chain, the Gibbs sampler and its reference chain, the reference chain of a hybrid Metropolis-Hastings chain, etc., meaning that both chains from each expression are associated in this manner and the reference chain (with transition matrix R = R R...R t ) from each expression has the matrices R, R,..., R t specied or not, in the latter case, R, R,..., R t are only from the equivalence classes R, R,..., R t, respectively ( R l = the equivalence class of R l, l t ), so, the reader, in this latter case, has a complete freedom to choose the matrices R, R,..., R t as he/she wishes. The association from Remark.5 is good for our unication and, as a result of this, is good for comparisons and improvements (see the next sections). 3. EXACT SAMPLING USING HYBRID CHAINS In this section, rst, we show that the Gibbs sampler on h n (i.e., on {0,,..., h} n ), h, n, belongs to our collection of hybrid Metropolis-Hastings chains from [8]. Second, we give some interesting classes of probability distributions they are interesting because: ) supposing that the generation time is not limited, we can generate any random variable with probability distribution belonging to the union of these classes exactly (not approximately) by Gibbs sampler (sometimes by optimal hybrid Metropolis chain) or by a special Gibbs

9 G method in action: from exact sampling to approximate one 43 sampler with grouped coordinates in just one step; ) sometimes, the Gibbs sampler or a special Gibbs sampler with grouped coordinates is identical with its reference chain. An application on the random variables with geometric distribution is given. Third, results on the hybrid chains or reference chains are given. We begin the denition of Gibbs sampler we refer to the cyclic Gibbs sampler. Below we consider the (cyclic) Gibbs sampler on h n, h, n (more generally, we can consider the state space h h... h n, h, h,..., h n, n ), see [0]; see, e.g., also [, 3, 6], [7, pp. 364], [8], [9, Chapter 5], [, pp. and 554], [5, pp. 698], and [, Chapters 5 and 7]. Recall that the entry (i, j) of a matrix Z is denoted Z ij or, if confusion can arise, Z i j. We use the convention that an empty term vanishes. Let x = (x, x,..., x n ) S = h n, h, n. Set x [k l ] = (x, x,..., x l, k, x l+,..., x n ), k h, l n (consequently, x [k l ] S, k h, l n ). Let π be a positive probability distribution on S = h n (h, n ). Set the matrices P l, l n, where 0 if y x [k l ], k h, (P l ) xy = π x[k l ] π if y = x [k l ] for some k h, x[j l ] j h l n, x, y S. Set P = P P...P n. Consider the Markov chain with state space S = h n (h, n ) and transition matrix P above. This chain is called the cyclic Gibbs sampler the Gibbs sampler for short. For labeling the rows and columns of P, P,..., P n and other things, we consider the states of S = h n in lexicographic order, i.e., in the order (0, 0,..., 0), (0, 0,..., 0, ),..., (0, 0,..., 0, h), (0, 0,..., 0,, 0), (0, 0,..., 0,, ),..., (0, 0,..., 0,, h),..., (0, 0,..., 0, h, 0), (0, 0,..., 0, h, ),..., (0, 0,..., 0, h, h),..., (h, h,..., h, 0), (h, h,..., h, ),..., (h, h,..., h). Further, we show that the Gibbs sampler on h n, h, n (more generally, on h h... h n, h, h,..., h n, n ) belongs to our collection of hybrid Metropolis-Hastings chains from [8] and satises, moreover, the conditions (c)(c4), excepting (c3). More precisely, we show that the Gibbs sampler on h n satises all the conditions (C)(C3) (basic conditions) and (c)(c4) (special conditions), excepting (c3), and the equations from the denition of hybrid Metropolis-Hastings chain. To see this, following the second special case from Section 3 in [8] (there it was considered a more

43 Udrea Päun 0 general framework, namely, when the coordinates are grouped (blocked) into groups (blocks) of size v), set K (x,x,...,x l ) = {(y, y,..., y n ) (y, y,..., y n ) S and y i = x i, i l }, l n, x, x,..., x l h (obviously, K (x,x,...,x n) = {(x, x,..., x n )}), and = (S), l+ = ( ) K (x,x,...,x l ), l n. x,x,...,x l h Obviously, = (S)... n+ = ({x}) x S. Note also that the sets S, K (x,x,...,x l ), x, x,..., x l h, l n, determine, by inclusion relation, a tree which we call the tree of inclusions. For simplication, below we give the tree of inclusions for S = n (i.e., for S = {0, } n ). S K (0) K () K (0,0) K (0,) K (,0) K (,).. K (0,0,...,0) K (0,0,...,0,) K (,,...,,0) K (,,...,) Following, e.g., [7, pp. 4], we dene the matrices Q, Q,..., Q n as follows: Q l = P l, l n. It is easy to prove that the matrices Q l, l n, satisfy the basic conditions (C)(C3) from Section. Further, it is easy to prove that the matrices P l and Q l satisfy the equations 0 if y x and (Q l ) xy = 0, ( ) (Q (P l ) xy = l ) xy min, πy(q l) yx π x(q l ) if y x and (Q l ) xy xy > 0, (P l ) xz if y = x, z S,z x l n, x, y S. (Further, it follows that the conclusion of Theorem. holds, in particular, for P = P P...P n.) Therefore, the Gibbs sampler on h n belongs to our collection of hybrid Metropolis-Hastings chains from [8]. Now,

G method in action: from exact sampling to approximate one 433 it is easy to prove that the Gibbs sampler on h n satises, moreover, the special conditions (c)(c4), excepting (c3). Finally, we have the following result. Theorem 3.. The Gibbs sampler on h n belongs to our collection of hybrid Metropolis-Hastings chains. Moreover, this chain satises the conditions (c)(c4), excepting (c3). Proof. See above. Based on Theorem 3., it is easy now to show that the chain on h n, h, n (more generally, on h h... h n, h, h,..., h n, n ), dened below is, according to our denition, a hybrid Metropolis- Hastings chain which satises, moreover, all or part of the conditions (c)(c4). This chain on h n is a generalization of the (cyclic) Gibbs sampler on h n as follows: the matrices Q, Q,..., Q n of Gibbs sampler (see before Theorem 3.) are, more generally, replaced with the matrices Q, Q,..., Q n (we used the same notation for these) such that _ Q, _ Q,..., _ Q n of the former matrices are identical with _ Q, _ Q,..., _ Q n of the latter matrices, respectively; P = P P...P n is the transition matrix of this chain, where, using Metropolis-Hastings rule, P, P,..., P n are dened by means of the more general matrices Q, Q,..., Q n, respectively. Since we now know the structure of matrices P, P,..., P n corresponding to the update of coordinates,,..., n, respectively, we could study other types of Gibbs samplers on h n, h, n (the random Gibbs sampler, etc., see, e.g., [], [8], [5, pp. 773], and [9]), and, more generally, other types of chains on h n, h, n (a generalization of the random Gibbs sampler, etc.), derived from the generalization from the above paragraph of Gibbs sampler. Recall that R + = {x x R and x > 0}. Recall that the states of S = h n are considered in lexicographic order. Theorem 3.. Let S = h n, h, n. Let w = (h + ) t, 0 t n. Consider on S the probability distribution π = (c 0, c 0 a,..., c 0 a w, c, c a,..., c a w,..., c h, c h a,..., c h a w, c 0, c 0 a,..., c 0 a w, c, c a,..., c a w,..., c h, c h a,..., c h a w,..., c 0, c 0 a,..., c 0 a w, c, c a,..., c a w,..., c h, c h a,..., c h a w ) (the sequence c 0, c 0 a,..., c 0 a w, c, c a,..., c a w,..., c h, c h a,..., c h a w appears (h + ) n t times if 0 t < n and c 0, c 0 a,..., c 0 a w only appears if t = n), where c 0, c,..., c h, a R +. Then, for the Gibbs sampler and, when h =, for the _ optimal hybrid Metropolis chain with the matrices Q, Q,..., Q n such that Q, Q _,..., Q n are identical with Q, Q _,..., Q _ n of the Gibbs sampler on

434 Udrea Päun S = n, respectively, we have, using the same notation, P = P P...P n, for the transition matrices of these two chains, P = e π (therefore, the stationarity of these chains is attained at time ). Proof. Since (see the proof of Theorem 3.) = (S), = ( K (0), K (),..., K (h) ), 3 = ( K (0,0), K (0,),..., K (0,h), K (,0), K (,),..., K (,h),..., K (h,0), K (h,),..., K (h,h) ),. n+ = ({x}) x S, we have S = (h + ) n ( S is the cardinal of S), K (0) = K () =... = K (h) = (h + ) n, K (0,0) = K (0,) =... = K (0,h) = K (,0) = K (,) =... = K (,h) =...... = K (h,0) = K (h,) =... = K (h,h) = (h + ) n,. {x} = (h + ) 0 =, x S. Let l n. Let K l and L l+, L K. Then v, v,..., v l h such that and K = { S if l =, K (v,v,...,v l ) if l, L = K (v,v,...,v l ). Let x = (x, x,..., x n ) K. It follows that x = v, x = v,..., x l = v l (these equations vanish when l = ) and, obviously, x [v l l ] = (x, x,..., x l, v l, x l+,..., x n ) = (v, v,..., v l, v l, x l+,..., x n ) L. Note also that K = (h + ) n l+, L = (h + ) n l, and (P l ) xx[vl l ] > 0. (The reader, if he/she wishes, can use the notation (P l ) x x[vl l ] instead of (P l ) xx[vl l ].)

3 G method in action: from exact sampling to approximate one 435 First, we consider the Gibbs sampler. To compute the probabilities (P l ) xx[vl l ], we consider three cases: n l < t; n l = t; n l > t. The case n l < t is a bit more dicult. In this case, the probabilities π x corresponding to the elements x K are, keeping the order, c i a v, c i a v+,..., c i a v+(h+)n l+ for some i h and v w (h + ) n l+ + and those corresponding to the elements x L are, keeping the order, We have c i a v+v l(h+) n l, c i a v+v l(h+) n l +,..., c i a v+(v l+)(h+) n l. c i a v+v l(h+) n l +z h c i a v+s(h+)n l +z s=0 = av l(h+) n l, z h a s(h+)n l s=0 It follows that the rst ratio does not depend on z, z Moreover, it does not depend on c i and v. The others two cases are obvious. We now have (P l ) xx[vl l ] = a v l (h+)b h s=0 c vl h i=0 (h + ) n l. (h + ) n l. a s(h+)b if n l = b for some b t, c i if n l = t, h+ if n l > t. Consequently, P G,, P G, 3,..., P n G n, n+. By Theorems.6,., and 3., P = e π. Second, we consider the optimal hybrid Metropolis chain when h =. In this case, w = t, 0 t n, and π = (c 0, c 0 a,..., c 0 a w, c, c a,..., c a w, c 0, c 0 a,..., c 0 a w, c, c a,..., c a w,..., c 0, c 0 a,..., c 0 a w, c, c a,..., c a w ). As to the positions of positive entries of Q, Q,..., Q n, we have, by hypothesis, Q, Q,..., Q n such that _ Q, _ Q,..., _ Q n are identical with _ Q, _ Q,..., _ Q n of

436 Udrea Päun 4 the Gibbs sampler on S = n (see Section and Theorem 3.; see also the second special case from Section 3 in [8]), respectively. It follows that ( min a b, a b) if n l = b for some b t, ( ) f l = min c0 c, c c 0 if n l = t, if n l > t, = x l = f l + r l = f l + = f l + = ( min a b,a b) + min( c0 c, c c0 )+ if n l = b for some b t, if n l = t, if n l > t, and (cases for c 0 and c : c 0 c, c 0 > c (c 0, c R + ); cases for a : a, a > (a R + )) v l +v l a b if n l = b for some b t, +a b (P l ) xx[vl l ] = c vl c 0 +c if n l = t, if n l > t. Note that v l + v l a b = a v l b because v l. It follows that these transition probabilities are identical with those for the Gibbs sampler when h =. This is an interesting thing. See also Theorem 3.6 (the optimal hybrid Metropolis chain is not considered there because of this thing). Proceeding as in the Gibbs sampler case, it follows that P = e π. We call the probability distribution from Theorem 3. the wavy probability distribution (of rst type). Remark 3.3. As to the class of wavy distributions from Theorem 3., the Gibbs sampler is better than the optimal hybrid Metropolis chain when the latter chain has the matrices Q, Q,..., Q n such that _ Q, _ Q,..., _ Q n are identical with _ Q, _ Q,..., _ Q n of the Gibbs sampler on S = h n, respectively. For this see Theorem 3. and the following two examples. () Consider (the probability distribution) π = ( c 0, c 0 a, c 0 a, c, c a, c a, c, c a, c a )

5 G method in action: from exact sampling to approximate one 437 on S =, where c 0, c, c, a R +, c i c j, i, j. π is a wavy probability distribution. Suppose, for simplication, that c 0 < c < c. By Theorem 3., for the Gibbs sampler, we have P = e π (P = P P ). It is easy to prove, for the optimal hybrid Metropolis chain, that P e π (P = P P ; we used the same notation for matrices in both cases). () Consider π = (c, ca,..., ca w, c, ca,..., ca w,..., c, ca,..., ca w ) on S = h n, c, a R +, h, n. π is also a wavy probability distribution (the case when c 0 = c =... = c h := c). By Theorem 3., for the Gibbs sampler, we have P = e π (P = P P...P n ). It is easy to prove, for the optimal hybrid Metropolis chain when, e.g., π = ( c, ca,..., ca 8) on S =, c, a R +, a (for a =, π = the uniform probability distribution), that P e π (we also used the same notation for matrices in both cases with the only dierence that P = P P here). By Theorem 3. and Remark 3.3 it is possible that, on h n, the Gibbs sampler or a special generalization of it be the fastest chain in our collection of hybrid Metropolis-Hastings chains. The word fastest refers to Markov chains strictly, not to computers. The running time of our hybrid chains on a computer is another matter (the computational cost per step is the main problem; on a computer, a step of a Markov chain can be performed or not). Example 3.4. Consider the probability distribution π on S = 00 (i.e., on S = {0, } 00 ), where π (0,0,...,0) = d, π x = d 00, x S, x (0, 0,..., 0), where d (0, ) (e.g., d =, or d = 3 4, or d = 9 0 ). Since the sampling from S using the Gibbs sampler or optimal hybrid Metropolis chain can be intractable

438 Udrea Päun 6 (on any computer) for some d, one way is breaking of d into many pieces. For this we consider the probability distribution ρ on S = 0, where ρ (0,x) = d 00, ρ (,x) = ( d) 00 00 00, x S ((0, x) and (, x) are vectors from S ). Since ρ (0,x) = π x, x S, x (0, 0,..., 0), it follows that the sampling from S can be performed via the sampling from S. Indeed, letting X be a random variable with the probability distribution π, if, using ρ (on S ), we select a value equal to (0, u) for some u S, u (0, 0,..., 0) S, then we set X = u this value of X is selected according to its probability distribution π on S while, if, using ρ too, we select a value equal to (0, 0,..., 0) S or (, v) for some v S, then we set X = (0, 0,..., 0) (obviously, (0, 0,..., 0) S) this value of X is also selected according to its probability distribution. By Theorem 3. the Gibbs sampler and optimal hybrid Metropolis chain sample exactly (not approximately) from S (equipped with ρ); this implies that the sampling from S (equipped with π) is also exactly. The wavy probability distribution(s) from Theorem 3. has (have) something in common with the geometric distribution. This fact suggests the next application. Application 3.5. To generate, exactly, a random variable with geometric distribution ( p, pq, pq,... ), p, q (0, ), q = p, we can proceed as follows (see, e.g., also [4, p. 500]). We split the geometric distribution into two parts, a tail carrying small probability and a main body of size n, where n, n 0, is suitably chosen. The main body contains the rst n values of geometric distribution and determines the probability distribution where π = ( Zp, Zpq, Zpq,...Zpq n ), Z = q n. We choose the main body with the probability q n (= p + pq +... + pq n ) and the tail with the probability q n. (See also discrete mixtures in, e.g., [4, p. 6], the decomposition method in, e.g., [4, p. 66], and the composition method in, e.g., [4, p. 66].) If the output of choice is the main body, then we can sample exactly (not approximately) from {,,..., n } (equipped with the probability distribution π), using the Gibbs sampler or optimal hybrid Metropolis chain when n, see Theorem 3. (the stationarity is attained at time for each of these chains). Obviously, to use the former or latter chain, we

7 G method in action: from exact sampling to approximate one 439 need another distribution, µ we replace the probability distribution π = (π i ) on {,,..., n }, π = Zp, π = Zpq,..., π n = Zpq n, with µ = (µ i ) on n, µ (0,0,...,0) = π, µ (0,0,...,0,) = π, µ (0,0,...,0,,0) = π 3, µ (0,0,...,0,,) = π 4,..., µ (,,...,,0) = π n, µ (,,...,) = π n. Otherwise, i.e., if the output of choice is the tail, we can proceed as follows. Supposing that X is a random variable with the geometric distribution above, i.e., ( p, pq, pq,... ), then, due to the lack-of-memory property of X, X n (X > n here) is a random variable with the same geometric distribution as X, i.e., ( p, pq, pq,... ). Therefore, further, we can work with X n and its probability distribution ( p, pq, pq,... ) (we again split this distribution into two parts, a main body and a tail,...), etc. The case when all the main bodies are of size ( 0 = ) is well-known, see, e.g., [4, p. 498]; we here gave a generalization of this case by Gibbs sampler or optimal hybrid Metropolis chain. The next result says that, sometimes, the Gibbs sampler (in some cases even the optimal hybrid Metropolis chain, see Theorem 3. and its proof and the next result) is identical with its reference chain. Theorem 3.6. Consider on S = h n, h, n, the wavy probability distribution π from Theorem 3.. Consider on S (equipped with π) the Gibbs sampler with transition matrix P = P P...P n and its reference chain with transition matrix R = R R...R n (see Remark.5). Then (P l ) xy = a (l) K,L, l n, K l, L l+ with L K, x K, y L with (P l ) xy > 0 ( l, l n +, are the partitions determined by the Gibbs sampler, see the proof of Theorem 3.), and _ P = R. If, moreover, R l = P _ l, l n, then P l = R l, l n. (Therefore, under all the above conditions, the Gibbs sampler is identical with its reference chain, leaving the initial probability distribution aside.) Proof. First, we show that (P l ) xy = a (l) K,L, l n, K l, L l+ with L K, x K, y L with (P l ) xy > 0. For the Gibbs sampler on S = h n, in the proof of Theorem 3., it was shown that a v l (h+)b if n l = b for some b t, h a s(h+)b s=0 (P l ) xx[vl l ] = c vl if n l = t, h c i i=0 h+ if n l > t.

440 Udrea Päun 8 Recall that a (l) K,L = π x x L x K π x, l n, K l, L l+, L K, see the reference method in Section. Recall that K = (h + ) n l+ and L = (h + ) n l, see the proof of Theorem 3.. Case. n l = b for some b t. By proof of Theorem 3. we have Since and a (l) K,L = = we have π x x L x K (h+) b w =0 π x = (h+) b w =0 (h+) b+ w =0 c i a v+v l(h+) b +w c i a v+w = (h+) b w =0 (h+) b+ w =0 a v l(h+) b +w a w a v l(h+) b +w = a v l(h+) b ( + a +... + a (h+)b ) (h+) b+ w =0 a w = ( + a +... + a (h+)b ) + ( ) + a (h+)b + a (h+)b + +... + a (h+)b +... ( )... + a h(h+)b + a h(h+)b + +... + a (h+)b+ = ( ) ( ) + a +... + a (h+)b + a (h+)b + a +... + a (h+)b +... ( )... + a h(h+)b + a +... + a (h+)b = ( ) ( = + a +... + a (h+)b + a (h+)b +... + a h(h+)b), a (l) K,L = av b l(h+). h a s(h+)b s=0 Case. n l = t. By denition of π (see Theorem 3.) and proof of Theorem 3. we have π x = c vl + c vl a +... + c vl a w = c vl ( + a +... + a w ) x L.

9 G method in action: from exact sampling to approximate one 44 and π x = (c 0 + c 0 a +... + c 0 a w ) + (c + c a +... + c a w ) +... x K Consequently,... + (c h + c h a +... + c h a w ) = ( + a +... + a w ) a (l) K,L = c v l. h c i i=0 h c i. Case 3. n l > t. By denition of π and proof of Theorem 3., setting it is easy to see that Consequently, σ L = x L π x, π x = (h + ) σ L. x K a (l) K,L = h +. From Cases 3, we have (P l ) xx[vl l ] = a (l) K,L. Therefore, (P l) xy = a (l) K,L (y = x [v l l ] for some v l h ), l n, K l, L l+ with L K, x K, y L with (P l ) xy > 0. By Theorem.6 and above result we have P = e π. By Theorem.4(iv), R = e π. Therefore, P = R. The other part of conclusion is obvious. In [8], we modied our hybrid (Metropolis-Hastings) chains such that the modied hybrid chains have better upper bounds for p n π (see Theorem.; see also Theorems.8 and.9). Below we present this modication. If P = P P...P t (see Section ) is the transition matrix of a hybrid Metropolis-Hastings chain, we replace the product P s+ P s+ (...P t ( s < t) by the block diagonal s+ -stable matrix (recall that l = K (l),..., K(l), l t +, see Section ) P = P (s) = A (s+) A (s+)... i=0 A (s+) u s+, K(l), u l )