Stein s method, Palm theory and Poisson process approximation

Size: px

Start display at page:

Download "Stein s method, Palm theory and Poisson process approximation"

Cornelius Fleming
5 years ago
Views:

1 Stein s method, Palm theory and Poisson process approximation Louis H. Y. Chen and Aihua Xia National University of Singapore and University of Melbourne 3 November, 2003 (First version: 24 June, 2002 Abstract The framework of Stein s method for Poisson process approximation is presented from the point of view of Palm theory, which is used to construct Stein identities and define local dependence. A general result (Theorem 2.3 in Poisson process approximation is proved by taking the local approach. It is obtained without reference to any particular metric thereby allowing wider applicability. A Wasserstein pseudo-metric is introduced for measuring the accuracy of point process approximation. The pseudo-metric provides a generalization of many metrics used so far, including the total variation distance for random variables and the Wasserstein metric for processes as in Barbour and Brown (992. Also, through the pseudo-metric, approximation for certain point processes on a given carrier space is carried out by lifting it to one on a larger space, extending an idea of Arratia, Goldstein and Gordon (990. The error bound in the general result is similar in form to that for Poisson approximation. As it yields the Stein factor /λ as in Poisson approximation, it provides good approximation, particularly in cases where λ is large. The general result is applied to a number of problems including Poisson process modeling of rare words in a DNA sequence. Key words and phrases. Stein s method, point process, Poisson process approximation, Palm process, Wasserstein pseudo-metric, local approach, local dependence. Postal address: Institute for Mathematical Sciences, National University of Singapore, 3 Prince George s Park, Singapore 8402, Republic of Singapore. address: lhychen@ims.nus.edu.sg Postal address: Department of Mathematics and Statistics, the University of Melbourne, VIC 300, Australia. address: xia@ms.unimelb.edu.au

2 2 L. H. Y. Chen and A. Xia Introduction Poisson approximation was developed by Chen (975 as a discrete version of Stein s normal approximation (972. It involves the solution of a first order difference equation, which we call Stein equation. In extending Poisson approximation to higher dimensions and to Poisson process approximation, Barbour (988 converted the first order difference equation into a second order difference equation and solved it in terms of an immigration-death process. This work was further extended by Barbour and Brown (992, who introduced a Wasserstein metric on point processes and initiated a program to obtain error bounds of similar order to that on the total variation distance in Poisson approximation. This has been achieved for some special cases by Xia (997, 2000, and a general result with error bounds of the desired order has been obtained by Brown, Weinberg and Xia (2000. In this paper, another general result on Poisson process approximation is proved by taking the local approach. It is obtained without reference to any particular metric thereby allowing wider applicability. In proving this result, the framework of Stein s method is first presented from the point of view of Palm theory, which is used to construct Stein identities and define local dependence. Although the connection between Stein s method and Palm theory has been known to others [see for example Barbour and Månsson (2002], little of it has been exploited. In applying the general result, a Wasserstein pseudo-metric is introduced for measuring the accuracy of point process approximation. The pseudo-metric provides a generalization of many metrics used so far, including the total variation distance for random variables and the Wasserstein metric for processes as in Barbour and Brown (992. Also, through the pseudo-metric, approximation for certain point processes on a given carrier space is carried out by lifting it to one on a larger space, extending an idea of Arratia, Goldstein and Gordon (990, Section 3., which was refined by Chen (998, Section 5. The error bound in the general result is similar in form to that for Poisson approximation [see for example Arratia, Goldstein and Gordon (989, Theorem ]. It is simpler and easier to apply than that in Brown, Weinberg and Xia (2000. As it yields the Stein factor /λ as in Poisson approximation, it provides good approximation, particularly in cases where λ is large. The general result is applied to prove approximation theorems for Matérn hardcore processes and for marked dependent trials. The latter is in turn applied to the classical occupancy problem and rare words in biomolecular sequences. The last application, in fact this paper, is motivated by an interest in modeling the distribution of palindromes in a herpesvirus genome by a Poisson process. In Leung, Choi, Xia

3 Poisson process approximation 3 and Chen (2002, the Poisson process model is used to provide a mathematical basis for using r-scans in determining nonrandom clusters of palindromes in herpesvirus genomes [see also Leung and Yamashita (999]. 2 From Palm theory to Stein s method Let be a fixed locally compact second countable Hausdorff topological space. Such a space is also a Polish space, i.e. a space for which there exists a separable and complete metric in which generates the topology. Define H to be the space of non-negative integer-valued locally finite measures on, and let B be the smallest σ-algebra in H making the mappings ξ ξ(c measurable for all relatively compact Borel sets C. Recall that a point process on is a measurable mapping of some fixed probability space into (H, B [Kallenberg (983, page 5]. For a point process Ξ on with locally finite mean measure λ, the point process Ξ α is said to be a Palm process associated with Ξ at α if for any measurable function f : H IR + := [0,, ( ( f(α, ΞΞ(dα = f(α, Ξ α λ(dα, (2. [Kallenberg (983, Chapter 0]. Intuitively, IP(Ξ α B = [Ξ(dα; Ξ B], for all B B. Ξ(dα An important characterization of Poisson process in the language of Palm theory is that Ξ is a Poisson process if and only if L(Ξ α = L(Ξ + δ α λ-a.s., where δ α is the Dirac measure at α. This highlights an idea of Poisson process approximation: if we define Df(ξ := f(x, ξξ(dx f(x, ξ + δ x λ(dx, then L(Ξ is close to the Poisson process distribution over with mean measure λ, denoted as Po(λ, in terms of a certain metric if for the set of suitable corresponding test functions f : H IR := (,, Df(Ξ 0. (2.2 In other words, for a function g : H IR, if we can find a solution f g to the equation g(ξ Po(λ(g = Df(ξ, (2.3

4 4 L. H. Y. Chen and A. Xia then the distance between the distribution of Ξ and Po(λ is achieved by the supremum of Df g (Ξ over the class of g which defines the metric. Equation (2.3 is known as a Stein equation. If there exists a function h : H IR such that f(x, ξ = h(ξ δ x h(ξ, then Df(ξ = [h(ξ + δ x h(ξ]λ(dx + [h(ξ δ x h(ξ]ξ(dx := Ah(ξ. It is known that A is the generator of an H-valued immigration-death process Z ξ (t with immigration intensity λ and unit per capita death rate, where Z ξ (0 = ξ. This fact was noted by Barbour (988 who developed a probabilistic approach to Stein s method for multivariate Poisson and Poisson process approximations. The equilibrium distribution of Z ξ is a Poisson process with mean measure λ. The idea of introducing a Markov point process is to exploit the probabilistic properties of the Markov process for obtaining bounds on the metrics of interest [see Barbour and Brown (992 and Brown and Xia (200]. For ξ H and a Borel set B, we define ξ B as the restriction of ξ to B, i.e., ξ B (C = ξ(b C for Borel sets C. Let Ξ be a point process on with Palm processes {Ξ α }. Assume that, for each α, there is a Borel set A α such that α A α and the mapping H H : (α, ξ (α, ξ (α (2.4 is product measurable, where ξ (α := ξ A c α. Note that ξ (α does not refer to the Palm measure. As the measurability of (2.4 is often hard to check, we give a sufficient condition for (2.4 to hold: A = {(x, y : y A x, x } is a measurable set of the product space 2 :=. We give a brief proof for the sufficiency. By the monotone class theorem, it suffices to show that the mapping M A (α, ξ := (α, ξ (α is measurable for rectangular sets A = B B 2, where B and B 2 are measurable subsets of. Indeed, { (x, ξ B c M B B 2 (x, ξ = if x B 2, (x, ξ if x B is measurable. The requirement of A being measurable in 2 is almost necessary. To see this, let = [0, ], A = B B 2, where B is not Borel measurable [Nielsen (997, p.28, 9.6 (h] and B 2 is a Borel set. Define C = {ξ : ξ(b 2 0}, then M A ( C = Bc C is not a measurable set of H. Remark 2. In Barbour and Brown (992, page 5, it is proved that if A α is a ball of fixed radius, then the mapping in (2.4 is measurable.

5 Poisson process approximation 5 We define Ξ to be locally dependent with neighborhoods (A α ; α if L ( (Ξ α (α = L ( Ξ (α λ a.s. Lemma 2.2 The following statements are equivalent: (a f(α, Ξ(α + δ α Ξ(dα = f(α, Ξ(α + δ α λ(dα for all measurable f : H IR +. (b L ( (Ξ α (α = L ( Ξ (α λ-a.s. Proof. By the definition of Palm process, we have f(α, Ξ (α + δ α Ξ(dα = f(α, (Ξ α (α + δ α λ(dα. (2.5 Hence, (b implies (a. Now assume (a. With the vague topology, H is a Polish space [see Kallenberg (983, page 95], so there exists a sequence of bounded uniformly continuous functions (f j ; j on H which form a determining class [Billingsley (968, page 5]: for every two probability measures Q and Q 2 on H, if f j dq = f j dq 2 for all j, then Q = Q 2 [see Parthasarathy (967, Theorem 6.6]. By taking f(α, ξ + δ α = k(αf j (ξ, it follows from (2.5 that k(α[f j (Ξ (α ]λ(dα = k(α[f j ((Ξ α (α ]λ(dα for all bounded measurable functions k : IR + and f j. Fixing f j and allowing k to vary, we have f j (Ξ (α = f j ((Ξ α (α λ-a.s. Now vary f j and (b follows. In general, a point process is not necessarily locally dependent, but Lemma 2.2 suggests that, in a loose sense, if and only if f(α, Ξ (α + δ α Ξ(dα L ( (Ξ α (α L ( Ξ (α λ a.s. (2.6 f(α, Ξ (α + δ α λ(dα for suitable f. (2.7 This will be our guiding principle in proving Theorem 2.3 using the local approach, as follows (an extension of the approach of Chen (975 which was elaborated by Barbour and Brown (992: f(α, ΞΞ(dα = [f(α, Ξ f(α, Ξ (α + δ α ]Ξ(dα + f(α, Ξ (α + δ α [Ξ(dα λ(dα] + [f(α, Ξ (α + δ α f(α, Ξ + δ α ]λ(dα + f(α, Ξ + δ α λ(dα

6 6 L. H. Y. Chen and A. Xia which implies Df(Ξ = [f(α, Ξ f(α, Ξ (α + δ α ]Ξ(dα + f(α, Ξ (α + δ α [Ξ(dα λ(dα] + [f(α, Ξ (α + δ α f(α, Ξ + δ α ]λ(dα. (2.8 Hence, a bound on Df g (Ξ can be obtained by bounding the right hand side of (2.8. There are two ways to handle the second term in (2.8: one uses coupling and the other involves Janossy densities [Janossy (950, Daley and Vere-Jones (988]. For a finite point process Ξ, that is IP( Ξ < =, there exists measures (J n n such that for measurable functions f : H IR +, f(ξ = n 0 n n! f ( n δ xi J n (dx,..., dx n. i= The term (n! J n (dx,..., dx n can be intuitively explained as the probability of Ξ having n points and these points being located near (x,..., x n. The measures (J n n are called Janossy measures by Srinivasan (969. Suppose there is a reference measure ν on such that for each n, J n is absolutely continuous with respect to ν n. Then, by the Radon-Nikodym theorem, the derivatives j n of J n with respect to ν n exist, so that f(ξ = n 0 n n! f ( n δ xi j n (x,..., x n ν n (dx,..., dx n. i= The derivatives (j n n are called Janossy densities. The density of the mean measure λ of a finite point process Ξ with respect to ν can be expressed by its Janossy densities (j n n as φ(x = m 0 m m! j m+(x, x,..., x m ν m (dx,..., dx m, where the term with m = 0 is interpreted as j (x [Daley and Vere-Jones (988, page 33].

7 Poisson process approximation 7 When the point process is simple, the Janossy densities can also be used to describe the conditional probability density of a point being at α, given the configuration Ξ (α of Ξ outside A α. More precisely, let m N be fixed and β = (β,..., β m (A c α m, define G(α, β := r 0 A r α s 0 A s α j m+r+ (α, β, γ(r! ν r (dγ, (2.9 j m+s (β, η(s! ν s (dη where the term with r = 0 is interpreted as j m+ (α, β and the term with s = 0 as j m (β. Then G(α, β is the conditional density of a point being near α given that Ξ (α is m i= δ β i. Direct verification gives that, for any bounded measurable function f over H, ( f(α, Ξ (α Ξ(dα ( = f(α, Ξ (α G(α, Ξ (α ν(dα. (2.0 For each f : H IR +, ξ H, write ξ(a x = m and define (δf(x, ξ = sup max {z,...,z m} 0 j m ( f x, ξ (x + δ x + ( j+ j δ zi f x, ξ (x + δ x + δ zi, where the right hand side is interpreted as 0 if m = 0. Combining (2.3 and (2.8 gives Theorem 2.3 For each bounded measurable function g : H IR +, i= g(ξ Po(λ(g (δf g (α, Ξ(Ξ(A α Ξ(dα + min{ɛ (g, Ξ, ɛ 2 (g, Ξ} α + (δf g (α, Ξλ(dαΞ(A α, (2. α where ɛ (g, Ξ = f g (α, Ξ (α + δ α G(α, Ξ (α φ(α ν(dα (2.2 α which is valid if Ξ is a simple point process, and ɛ 2 (g, Ξ = f g (α, Ξ (α + δ α f g (α, (Ξ α (α + δ α λ(dα. (2.3 α Remark 2.4 How judicious (A α ; α are chosen is reflected in the upper bound in (2., and (2.3 suggests that (A α ; α should normally be chosen such that (2.6 holds. i=

8 8 L. H. Y. Chen and A. Xia 3 Poisson process approximation in Wasserstein pseudo-metric We now look at special test functions g which define metrics of our interest. We begin with a pseudo-metric ϱ 0 on bounded by [cf Barbour and Brown (992]. In order for Theorem 2.3 to be applicable, we assume that the topology generated by ϱ 0 is weaker than the given topology of. Let K stand for the set of ϱ 0 -Lipschitz functions k : [, ] such that k(α k(β ϱ 0 (α, β for all α, β. The first Wasserstein pseudo-metric ϱ is defined on H by ϱ (ξ, ξ 2 = {, if ξ ξ 2, ξ sup k K kdξ kdξ 2, if ξ = ξ 2 > 0, where ξ i is the total mass of ξ i. A pseudo-metric ϱ equivalent to ϱ can be defined as follows [cf Brown and Xia (995]: for two configurations ξ = n i= δ y i and ξ 2 = m i= δ z i with m n, n ϱ (ξ, ξ 2 = min ϱ 0 (y i, z π(i + (m n, π i= where π ranges all permutations of (,..., m. Let F denote the set of ϱ -Lipschitz functions on H such that f(ξ f(ξ 2 ϱ (ξ, ξ 2 for all ξ and ξ 2 H. The second Wasserstein pseudo-metric is defined on probability measures on H with respect to ϱ by ϱ 2 (Q, Q 2 = sup fdq fdq 2. f F The use of a pseudo-metric ϱ 0 provides not only generality but also wider applicability. For example, if we choose ϱ 0 (x, y 0, then ϱ 2 (Q, Q 2 = d T V (L( X, L( X 2, the total variation distance between L( X and L( X 2, where X i has distribution Q i, i =, 2. It is known that, for g = B with B Z + := {0,, 2,...}, (δf g (x, ξ e λ 2, f g λ eλ, where, and throughout this paper, λ is the total mass of λ and is assumed to be finite [see Barbour, Holst and Janson (992, Brown and Xia (200]. So Theorem 2.3 gives

9 Poisson process approximation 9 Theorem 3. We have d T V (L(Ξ(, Po(λ e λ λ where 2 ɛ = eλ which is valid for Ξ simple, and ɛ 2 = e λ λ + e λ λ α α α α (Ξ(A α Ξ(dα + min{ɛ, ɛ 2 } λ(a α λ(dα, G(α, Ξ (α φ(α ν(dα, Ξ (α (Ξ α (α λ(dα. Theorem 3. with ɛ is a generalisation of Chen (975 [see also Barbour and Brown (992] and with ɛ 2 allows the use of the coupling approach [see Barbour and Brown (992]. Another example is in Section 4, where it is possible to introduce an index space so that the results also include the approximation in distribution by a Poisson process to discrete sums of the form n i= X iδ Yi, where Y i is a random mark associated with X i, as in Arratia, Goldstein and Gordon (989. We now establish a general statement of this section. As the arguments in Barbour and Brown (992 and Brown and Xia (200 never rely on the property that ϱ 0 (x, y = 0 implies x = y, the results are still valid for ϱ 0 and the pseudometrics ϱ and ϱ 2 generated from ϱ 0. The following two lemmas are taken from Barbour and Brown (992 and Brown and Xia (200. Lemma 3.2 For each ϱ -Lipschitz function g F, x, y and ξ H with ξ = n, the solution f g of (2.3 satisfies f g (x, ξ + δ x + δ y f g (x, ξ + δ x 5 λ + 3 n +, (3. f g (y, ξ + δ y.65λ /2. (3.2 Lemma 3.3 For each g F, ξ, η H and x, f g (x, ξ + δ x f g (x, η + δ x 2 η ξ + [ϱ (ξ, η η ξ ] + ( 5 λ + 3 ϱ η ξ + (ξ, η. ( 5 λ + 3 η ξ η ξ +

10 0 L. H. Y. Chen and A. Xia With the above two lemmas, we write another version of Theorem 2.3. Theorem 3.4 We have ϱ 2 (LΞ, Po(λ + α ( 5 λ + 3 (Ξ(A Ξ (α α Ξ(dα + min{ɛ, ɛ 2 } + ( 5 λ + 3 λ(dαλ(dβ, (3.3 (Ξ β (α + α β A α where ɛ = ( (.65λ /2 ɛ 2 = α α ( 5 λ + 3 (Ξ α (α Ξ (α + G(α, Ξ (α φ(α ν(dα, (3.4 ϱ ((Ξ α (α, Ξ (α λ(dα. (3.5 In many applications, we can obtain the Stein factor /λ from the terms ( Ξ (α +, ( (Ξα (α Ξ (α + and ( (Ξβ (α + by applying Lemma 3.5 below. Lemma 3.5 [Brown, Weinberg and Xia (2000, Lemma 3.] For a random variable X, ( κ( + κ/4 + + κ/2, X (X where κ = Var(X/(X. Corollary 3.6 If Ξ is a locally dependent point process with neighborhoods (A α ; α, then ϱ 2 (LΞ, Po(λ + where ξ (αβ = ξ A c α A c β. α α ( 5 λ + 3 (Ξ(A Ξ (α α Ξ(dα + ( 5 λ + 3 λ(dαλ(dβ, (3.6 Ξ (αβ + β A α

11 Poisson process approximation Remark 3.7 Since Ξ(A α Ξ (α + Ξ(dα = α α α Ξ (α + Ξ(dβΞ(dα β A α β A α Ξ (αβ + Ξ(dβΞ(dα, to simplify the first term of (3.6 using the assumption of local dependence, it is tempting to ask whether Ξ (αβ + Ξ(dβΞ(dα = Ξ (αβ + Ξ(dβΞ(dα. The answer is generally negative, although it might be true in many applications, as shown in Section 5. To see this, let IP(B i = q = 0. for i =, 2, 3, IP(B i B j = q 2 for i j 3 and IP(B B 2 B 3 = 2q 3. Set = {, 2, 3}, Ξ({i} = Bi, i 3 and A = A 2 = {, 2} and A 3 = {, 3}, then Ξ is locally dependent with neighborhoods (A i ; i. However, direct calculation gives Ξ({3} + Ξ({}Ξ({2} = q2 q 3 and so Ξ({3} + Ξ({}Ξ({2} = ( 0.5qq2, Ξ({3} + Ξ({}Ξ({2} Ξ({3} + Ξ({}Ξ({2}. 4 Sums of marked dependent trials The case of Poisson process approximation for sums of marked dependent trials is of particular interest as it has applications in computational biology, occupancy and random graphs. We devote this section to this case. Let I i, i I be dependent indicators with I a finite or infinitely countable index space and IP(I i = = IP(I i = 0 = p i, i I. Let U i, i I be S-valued independent random elements, where S is a locally compact second countable Hausdorff space with metric d 0 bounded by. Assume that {U i, i I} is independent of {I i, i I}. Our interest is to approximate the distribution of M := i I I iδ Ui by that of a Poisson process.

12 2 L. H. Y. Chen and A. Xia Let H(S be the space of non-negative integer-valued locally finite measures on S. The metric d 0 will generate the first Wasserstein metric d on H(S and second Wasserstein metric d 2 on probability measures on H(S as in Section 3 [see also Barbour and Brown (992]. For each i I, let A i I such that i A i. Let µ i = L(U i, the law of U i, i I, and let λ = i I p iµ i. Define V i = j A i I j. Theorem 4. We have λ = i I p i and where d 2 (LM, Po(λ i I + i I j A i \{i} j A i ( 5 λ + 3 ( 5 λ + I i I j + min{ɛ, ɛ 2 } V i + [ ] 3 V i + I j = p i p j, (4. ɛ = (.65λ /2 (I i I j, j A i p i, i I ɛ 2 = ( 5 λ + 3 V i J ji I j p i, i I j A i J ji + j A i and (J ji ; j I and (I j ; j I are defined on the same probability space with L(J ji ; j I = L(I j ; j I I i =. Remark 4.2 The bound in (4. does not depend on the distribution of the marks (U i i I, since the mean measure of the approximating Poisson process has been chosen to reflect the contribution of the marks. Remark 4.3 Since M is in general not a simple point process, the Janossy density approach via (2.9 is not applicable. Also, due to the structure of M, the neighborhoods {A α, α S} cannot be determined. By introducing a pseudo-metric and by lifting the process M from S to a larger carrier space = S I, the lifted process becomes simple and the neighborhoods {A α, α } determinable. Proof of Theorem 4.. We consider the approximation on the lifted space = S I with pseudo-metric ϱ 0 ((s, i, (t, j = d 0 (s, t. For each ξ l H( (l means lifted, define ξ H(S by ξ(ds = i I ξ l(ds, {i}. Let M l (ds, {i} = I i δ Ui (ds and

13 Poisson process approximation 3 let λ l (ds, {i} = p i µ i (ds. Then M l is a simple point process on, M(ds = i I M l(ds, {i}, λ(ds = i I λ l(ds, {i}, and ϱ 2 (LM l, Po(λ l = d 2 (LM, Po(λ. For each (s, i, define A (s,i := S A i. Then M ((s,i l = V i. The first term in the upper bound of (3.3 becomes ( 5 (s,i λ + 3 (Ml (A (s,i I i δ Ui V i + (ds = ( 5 λ + 3 ( I j I i, V i + i I j A i which gives the first term of the bound (4.. Refering to (3.4, if we take the reference measure ν(ds, {i} = µ i (ds, then φ((s, i = p i and for i,..., i k I, where i,..., i k are all different, where j k ((s, i,..., (s k, i k = IP(C i,...,i k, C i,...,i k := {I l = for l = i,..., i k and I l = 0 for l i,..., i k }. For α = (s, i, β = ((s, i,..., (s k, i k (A c (s,i k, the numerator of (2.9 becomes r 0 r! {j,...,j r} A i \{i} IP(C i,i,...,i k,j,...,j r = IP(I j = for j = i, i,..., i k and I j = 0 for j A c i\{i,..., i k }; and the denominator of (2.9 is reduced to It follows that IP(C i,...,i r! k,j,...,j r r 0 {j,...,j r} A i = IP(I j = for j = i,..., i k and I j = 0 for j A c i\{i,..., i k }. G((s, i, ((s, i,..., (s k, i k = IP(I i = I j = for j = i,..., i k and I j = 0 for j A c i\{i,..., i k }. Therefore, G((s, i, M ((s,i l = (I i I j ; j A i,

14 4 L. H. Y. Chen and A. Xia and (s,i G((s, i, M ((s,i l φ((s, i ν(ds, {i} = (I i I j, j A i p i, i I which gives ɛ of Theorem 4.. On the other hand, in view of ɛ 2 in (3.5, we can write the Palm process associated with M l at (s, i as J ji δ Uj (dt, if j i, M (s,i (dt, {j} = 0, if j = i and t s, δ t (dt, if j = i and t = s. With this coupling, we have (M(s,i ((s,i = j A i J ji and M ((s,i l = V i. So, (M(s,i ((s,i M ((s,i l + = V i j A i J ji +, and ϱ ((M (s,i ((s,i, M ((s,i l j A i J ji I j, which yields ɛ 2 of Theorem 4.. Finally, since λ(ds, {i}λ(dt, {j} = p i p j µ i (dsµ j (dt, the last term of (4. follows from the last term of (3.3. [ ] [ ] Bounds on V i + I j = and V i + I j = I i = may be obtained by applying Lemma 3.5. Sharper bounds can be achieved if additional information about the relationship of I i s is available, e.g., if I i s are independent. Remark 4.4 If I i, i I are locally dependent with neighborhoods (A i ; i I, then [ ] [ ] V i + I j =, V ij + where V ij = k A i A j I k. Random indicators (I j ; i I are said to be negatively related (resp. positively related if, for each i, (J ji, j I can be constructed in such a way that J ji (resp. I j for j I, j i [see Barbour, Holst and Janson (992, page 24].

15 Poisson process approximation 5 Proposition 4.5 Suppose (I j ; j I are negatively related, let λ = i I I i, then e λ i I I. i + λ Proof. Indeed, since (I j ; j I are negatively related, for decreasing function Φ, Φ I i I j = Φ I i I j = 0, i I\{j} ( so for fixed 0 < z <, function in I j, giving z i I I i = [ ( z i I\{j} I i [ ( z i I\{j} I i i I\{j} I j is increasing in I j and z I j is a decreasing z ] i I\{j} I i I j z I j ] I j [ z ] ( I j = z i I\{j} I i z I j, [see Liggett (985, page 78]. Since I is a finite or infinitely countable index set, by mathematical induction, z i I I i Π i I z I i. Hence i I I i + = 0 0 z i I I i dz Π i I e pi( z dz = e λ. λ 0 Π i I ( p i + p i zdz Corollary 4.6 With the same setup as in Theorem 4., suppose (I j ; j I are negatively related, then d 2 (LM, Po(λ ( [ ] 5 λ + 3 i I j i J p 2 i + p i [I j J ji ]. (4.2 ji + Proof. By Theorem 4. with A i = {i} and ɛ 2, the first term of (4. vanishes and the last two terms of (4. can be rewritten as (4.2. As we need to bound [(V i + I i = I j = ], it is relevant to ask whether (J ki, k I are also negatively (resp. positively related if (I j ; j I are negatively (resp. positively related. The answer is generally negative as the following counterexample shows. j i

16 6 L. H. Y. Chen and A. Xia Counterexample 4.7 Choose 4 sets B i, i 4 so that IP(B i = q, IP(B i B j = bq 2, IP(B i B j B k = bq 3, for all different i, j, k 4; and IP(B B 2 B 3 B 4 = bq 4 with b 2 and q sufficiently small (e.g., 0.0 so that the sets are properly defined. Set I i = Bi. Then for any increasing function Φ on {0, } 3 [see page 27, Barbour, Holst and Janson (992], we have [Φ(I, I 2, I 3 I 4 = ] Φ(I, I 2, I 3 = q(b [Φ(, 0, 0 + Φ(0,, 0 + Φ(0, 0, 3Φ(0, 0, 0]. Hence, by Theorem 2.D of Barbour, Holst and Janson (992, if we choose b > (resp. <, then (I i ; i 4 are positively (resp. negatively related. But IP(J 3 = J 4 = J 2 = = IP(I 3 = I 4 = I = I 2 = = q 2 and so IP(J 3 = J 4 = = IP(I 3 = I 4 = I = = bq 2, IP(J 3 = J 4 = J 2 = < (resp. >IP(J 3 = J 4 =, which implies that (J k, k =,..., 4 are not positively (resp. negatively related. 5 Applications In this section, we apply the main results in sections 3 and 4 to the Matérn hardcore process, an occupancy problem and rare words in DNA sequences, all of which are different in nature. The results in section 4 can also be applied to random graphs, for example to the isolated vertices resulting from the deletion with small probability of each of the edges of a connected graph, where the resulting isolated vertices may remain in their original positions or may be distributed independently and randomly in a carrier space. Since this random graph problem is similar in nature to that of rare words in DNA sequences, it will not be discussed further in this section. A special case of this problem which involves counting the number of isolated vertices has been considered by Roos (994 and Eichelsbacher and Roos ( Matérn hard core process Consider a Poisson number, with mean µ, of points placed independently and uniformly at random in, where is a compact subset of R d with volume V ( 0. A Matérn hard core process Ξ is produced by deleting any point within distance r of

17 Poisson process approximation 7 another point, irrespective of whether the latter point has itself already been deleted [see Cox and Isham (980, page 70]. More precisely, let {α n} be a realization of points of the Poisson process. Then the points deleted are {α n} = {x {α n} : x y < r for some y x, y {α n}} and {α n } := {α n}\{α n} constitutes a realization of the Matérn hard core process Ξ [see Daley and Vere-Jones (988]. The Matérn hard core process is one of the hard core processes introduced in statistical mechanics to model the distribution of particles with repulsive interactions [see Ruelle (969, page 6]. It is a special case of the distance models [see Matérn (986, page 37] and is also a model for underdispersion [see Daley and Vere-Jones (988, page 366]. Let X, X 2,... be independent uniform random variables on, and let N be a Poisson random variable with mean µ and independent of {X i ; i }. Then the Poisson process for the arrival points in is Z = N i= δ X i. Let B(x, r = {y : 0 < d 0 (y, x < r}, the r-neighborhood of x, where d 0 (x, y = x y. Then the Matérn hard core process Ξ can be written as Ξ = N i= δ X i {Z(B(Xi,r=0}. Also, Ξ(dα = N δ Xi (dα {Z(B(Xi,r=0} = {Z(B(α,r=0} Z(dα. i= Let κ d be the volume of the unit ball in R d and d 2 be the second Wasserstein metric generated from d 0 as in Section 3. Theorem 5. The mean measure of Ξ is λ(dα = e µv (α,r/v ( µv ( dα, and d 2 (LΞ, Po(λ 0ϑ + 6ϑ[3 + ( e 2 dϑ ϑ]/( + ( 2ϑ/λ, where V (α, r is the volume of B(α, r and ϑ = µκ d (2r d /V (. Proof. The Poisson property of Z implies that the counts of points in disjoint sets are independent. So λ(dα = (Ξ(dα = {Z(B(α,r=0} Z(dα = e µv (α,r/v ( µv ( dα. Also, whether a point outside B(α, 2r {α} is deleted or not is independent of the behavior of Z in B(α, r {α}. Hence, we choose A α = B(α, 2r {α} so that Ξ is locally dependent with neighborhoods (A α ; α and Ξ (αβ + Ξ(dαΞ(dβ = Ξ (αβ + Ξ(dαΞ(dβ.

18 8 L. H. Y. Chen and A. Xia Applying Corollary 3.6 gives d 2 (LΞ, Po(λ α + α β A α\{α} β A α ( 5 λ + 3 Ξ (αβ + ( 5 λ + 3 Ξ (αβ + Ξ(dαΞ(dβ λ(dαλ(dβ. (5. Now, Ξ (αβ = e µ V (x,r µ dx, αβ where αβ = \(A α A β and µ = µ/v (. On the other hand, e µ (V (α,r+v (β,r µ 2 dαdβ, if α β 2r, e Ξ(dαΞ(dβ = µ (V (α,r+v (β,r V (α,β,r µ 2 dαdβ, if r α β < 2r, 0, if 0 < α β < r, e µ V (α,r µ dα, if α = β; where V (α, β, r is the volume of B(α, r B(β, r. Hence, [ Ξ (αβ 2 ] = Ξ(dxΞ(dy x,y αβ = e µ V (x,r µ dx + e µ (V (x,r+v (y,r µ 2 dxdy αβ x,y αβ, x y 2r + e µ (V (x,r+v (y,r V (x,y,r µ 2 dxdy. Writting we have Var( Ξ (αβ = x,y αβ,r x y <2r [ Ξ (αβ ] 2 = e µ (V (x,r+v (y,r µ 2 dxdy, x,y αβ e µ V (x,r µ dx αβ + x,y αβ,r x y <2r x,y αβ, x y <2r e µ V (x,r µ dx αβ + x,y αβ, x y <2r e µ (V (x,r+v (y,r V (x,y,r µ 2 dxdy e µ (V (x,r+v (y,r µ 2 dxdy e µ V (x,r [ e µ V (y,r ]µ 2 dxdy { + ( e µ κ d r d µ κ d (2r d } e µ V (x,r µ dx. αβ

19 Poisson process approximation 9 Thus, κ = Var( Ξ(αβ + ( Ξ (αβ + Var( Ξ(αβ ( Ξ (αβ + ( e µ κ d rd µ κ d (2r d, which, together with Lemma 3.5, yields Finally, and Ξ (αβ κ αβ e µ V (x,r µ dx + α β A α\{α} Ξ(dαΞ(dβ α 3 + ( e µκdrdµ κ d (2r d. λ + 2µ κ d (2r d β A α e µ V (α,r µ 2 dαdβ µ κ d (2r d λ, λ(dαλ(dβ µ κ d (2r d λ. α β A α Applying these inequalities to the relavent terms in (5. gives Theorem Occupancy problem Suppose s balls are dropped independently into n urns with probability p k of going into the kth urn. Two cases of the distribution of urns with given content have been studied in the literature. They are urns with at most m balls (right hand domain and urns with at least m balls (left hand domain, where m is a fixed non-negative integer [see Kolchin, Sevast yanov and Chistyakov (978 and also Barbour, Holst and Janson (992, Chapter 6]. In this section, we consider the right hand domain. So far, the focus in the literature has been on the total number of urns [see Arratia, Goldstein and Gordon (989, Barbour, Holst and Janson (992, and references therein] and little attention has been paid to the locations of the urns. We assume the urns are numbered from to n and let X i be the number of balls in the ith box, i n. Define a point process Ξ on = [0, ] as follows: n Ξ = {Xi m}δ i/n. The mean measure of Ξ is then µ = n i= π iδ i/n, where π i = m j=0 i= ( s j p j i ( p i s j, and µ = n i= π i. Set λ(dt = nπ i dt for (i /n < t i/n, i =, 2,..., n, and d 0 (t, t 2 = t t 2 for t, t 2. Let µ = min i j, i,j n k i,j, k n IP(X k m X i = X j = 0

20 20 L. H. Y. Chen and A. Xia and µ = min i n j i, j n IP(X j m X i = 0 µ. If s min i n p i is large, then we would expect good Poisson process approximation. Theorem 5.2 With the above setup, d 2 (LΞ, Po(λ ( 5 2n + µ + 3 µ { 2n + C π + s µ [( Ξ Var( Ξ ] (5.2 ( ln s + m ln ln s + 5m s ln s m ln ln s 4m µ + 4 } 2, (5.3 s where (5.3 is valid for s > ln s + m ln ln s + 4m, ( 3p + 2p 2 s C = ( 2π /µ 3p with π = max i n π i and p = max i n p i < /3. Proof. By the triangle inequality, d 2 (LΞ, Po(λ d 2 (LΞ, Po(µ+d 2 (Po(µ, Po(λ, so the term /(2n follows immediately from estimating d 2 (Po(µ, Po(λ [see Brown and Xia (995, (2.8]. For each i n, let I i = {Xi m}, then (I i ; i n are negatively related. Indeed, if X i m, take Y ji = X j for all j. If X i > m, take a random variable X i which is independent of {X,..., X n } and has distribution L(X i X i m and take X i X i balls from urn i and re-distribute them to the other urns with probabilities p j /( p i for j i. Let Y ji be the number of balls in urn j after the re-distribution and set J ji = {Yji m}. This coupling (J ji ; j n satisfies L(J ji ; j n = L(I j ; j n I i = ; J ji I j, for all j i, [see Barbour, Holst and Janson (992, page 22]. We have from Corollary 4.6 that ( ( n 5 d 2 (LΞ, Po(µ µ + 3 j i J (I k J ki π i + πi 2. (5.4 ji + i= Now, the above coupling can be modified to show that, for l, k i (I j ; j i,..., i l X i =... = X il = 0

21 Poisson process approximation 2 are also negatively related. In fact, denote i = (i,..., i l. If X i =... = X il = 0, take Z ji = X j for all j i,..., i l. If one of X i,..., X il is not 0, take all balls in urns i,..., i l and relocate them to the other urns with probabilities p j := p j /( p i... p il for j i,..., i l. After the relocation, let Z ji be the number of balls in urn j and J ji = {Z ji m}. Next, for k i,..., i l, if Z ki m, take J jki = J ji. If Z ki > m, take a random variable Z ki which is independent of {Z i,..., Z ni } and has distribution L(Z ki Z ki m. Remove Z ki Z ki balls from urn k and re-distribute them to the other urns with probabilities p j/( p k, j k, i,..., i l. After this, let J jki = if there are at most m balls in urn j, otherwise, let J jki = 0. These couplings satisfy L(J ji; j i,..., i l = L(I j ; j i,..., i l X i =... = X il = 0, L(J jki; j i,..., i l = L(J ji; j i,..., i l J ki =, J jki J ji, for all j k, i,..., i l, J ji I j, for all j i,..., i l. In particular, if i = i, then J ji J ji for j i. By these couplings and Proposition 4.5, we have j i J ji + j i J ji + On the other hand, for k i, denote k = (i, k, we have Hence, j i J ji µ. (5.5 I k J ki j i J ji + I k J ki j i,k J ji + [ ] m s = l =0 l 2 =m+ j i,k {Y ji m} + X k = l, Y ki = l 2 IP(X k = l, Y ki = l 2 m s j i,k IP(X k = l, Y ki = l 2 {Z jk m} + l =0 l 2 =m+ = j i,k J jk + IP(I k =, J ki = 0 [ J jk] IP(I k =, J ki = 0 j i,k IP(I k =, J ki = 0 µ = (I k J ki µ. ( k i I k k i J [ ] ki j i J (/µ (I k J ki, ji + k i

22 22 L. H. Y. Chen and A. Xia which, combined with (5.4 and (5.5, yields ( 5 d 2 (LΞ, Po(µ µ + 3 [( n I µ k ] J ki π i + πi 2. i= k i k i On the other hand, since for k i, (J ki π i = IP(I k = I i = = (I k I i, we have [( n I k ] J ki π i + πi 2 i= k i k i = (I k (I i (I i I k i,k n = ( Ξ Var( Ξ. i k, i,k n Therefore, (5.2 follows. To prove (5.3, we note from Theorem 6.D of Barbour, Holst and Janson (992 (page 22 that Var( Ξ ( Ξ π + s ( ln s + m ln ln s + 5m µ s ln s m ln ln s 4m µ s So, it remains to show that ( µ 3p + 2p 2 s ( 2π µ /µ. (5.6 3p To prove (5.6, notice that for i, j n with i j, IP(X k m X i = X j = 0 Hence k i,j which implies (5.6. = ( ( l ( s p k p k l p i p j p i p j k i,j 0 l m ( ( s p l l k( p k s l pi p j p k ( p k ( p i p j k i,j 0 l m ( s 3p ( s p l 3p + 2p 2 l k( p k s l k i,j 0 l m ( s 3p (µ 2π 3p + 2p 2. ( s µ 3p (µ 2π 3p + 2p 2, s l s

23 Poisson process approximation Rare words in biomolecular sequences One of the important problems in biomolecular sequence analysis is the study of the distribution of words in a DNA sequence. A DNA sequence may be regarded as a sequence of letters taken from the alphabet {A, C, G, T}. The letters A, C, G, T represent the four nucleotides, adenine, cytosine, guanine and thymine, respectively. They form two complimentary pairs, namely {A,T} and {C,G}. It is known that repetition of a given word or a group of words or occurrences of unusually large clusters of words are known to have biological functions. For example, unusually large clusters of palindromes are known to contain such significant sites as origins of replication and gene regulators. Here palindromes are symmetrical words of DNA in the sense that it reads exactly the same as its reverse complimentary sequence. In Leung and Yamashita (999, palindromes of certain lengths are assumed to be independent and uniformly distributed in herpesvirus genomes and the r-scan statistic is used to identify unusually large clusters of palindromes. It is commonly assumed that the bases of a DNA are independent random variables taking values in the set {A, C, G, T}. Under this assumption, if each word of a particular type is represented by a point, then the points representing these words form a locally dependent point process. Theorem 4. in this paper provides an error bound for approximating such a point process by a Poisson process. The error bound can be used to find conditions for which the approximation is good. In general, the approximation is good if the words are rare in the sense that the probabilities of their occurrences are small. However, the error bound can be made more explicit only when the words are specified. As an application, Theorem 4. provides a mathematical basis for Poisson process modeling of rare words in a biomolecular sequence, and in particular of palindromes in a DNA sequence. A consequence of this is that the observed rare words may be regarded as a realization of i.i.d. random variables, thus providing a mathematical basis for the assumption in Leung and Yamashita (999 that the points representing the palindromes are independent and uniformly distributed in the herpesvirus genomes. In Leung, Choi, Xia and Chen (2002, Poisson process approximation for palindromes in sixteen herpesvirus genomes is studied. The centers of palindromes in each herpesvirus genome are represented by the point process on {0, /n, 2/n,..., (n /n, }: n Ξ = I i δ i/n, (5.7 i= where the length of genome (number of base pairs is denoted by M, those palin-

24 24 L. H. Y. Chen and A. Xia dromes considered are of length at least 2L (the length must be even and called 2L-palindromes, the center of a palindrome of length 2K is the Kth base in the palindrome from the left, and the number of possible centers of 2L-palindromes is M 2L +, denoted by n. Also, I i is the indicator random variable for the occurrence of a 2L-palindrome centered at base i + L of the DNA sequence. The palindromes are represented by their centers because the latter are fixed irrespective of the lengths of the former. Whereas the first base pair of a palindrome of at least a certain length is random and will give rise to complications in the analysis if it is used to represent the palindrome. Since 2L-palindromes with centers sufficiently far apart (more specifically, further than 2L bases apart occur independently, the point process (5.7 is a sequence of marked locally dependent trials as described in Section 4 of this paper, to which Theorem 4. is applicable. Here (I i ; i n are locally dependent with neighborhoods A i = {j : i 2L + j i + 2L } {, 2,..., n}, i =, 2,..., n. Take = [0, ] and d 0 (x, y = x y. Let p i = IP(I i = and p ij = IP(I i = I j =. It can be shown that p i = θ L, where θ = 2(p A p T + p C p G and p A, p T, p C, p G are the probabilities of A, T, C, G respectively. Suppose p A = p T, p C = p G and 4 L n 500. (5.8 Then the next theorem follows from Theorem 4. with U i = i/n, Lemma 3.5 and a two-step approximation as in Section 5.2, namely, first approximate Ξ by a Poisson process with the same mean measure as that of Ξ and then approximate the latter by a Poisson process on [0,] with intensity measure λdx. Theorem 5.3 We have d 2 (LΞ, Po(λ 26 λ (b + b 2 + 2n 3LθL/2, (5.9 where λ = n i= p i = nθ L, b = n i= j A i p i p j n(4l θ 2L, b 2 = n p ij n(4l 2θ 3L/2 i= j A i,j i and λ(dx = λdx.

25 Poisson process approximation 25 Since a proof of Theorem 5.3 is given in Leung, Choi, Xia and Chen (2002, we will not give one here. It suffices to mention that the explicit bound on the overlap probabilities in (5.9 is due to the explicit nature of the palindrome. In order for Po(λ to be nondegenerate in the limit, λ = nθ L must converge to a positive number as n. This means that L = ln n/ ln(/θ + d where d is bounded. For such an L, the assumption (5.8 is satisfied for sufficiently large n, Theorem 5.3 holds and the upper bound in (5.9 tends to 0 as n. A significant feature of the bound in (5.9 is that it has the Stein factor /λ. This is crucial for accuracy as the value of λ ranges from about 00 to 300 for the sixteen herpesvirus genomes under study. In Leung, Choi, Xia and Chen (2002, a direct proof of a special case of Theorem 4. with U i = i/n is given (see Theorem and Appendix. Also given are the details of deducing Theorem 5.3 from the special case of Theorem 4. and the proof of the upper bound 3Lθ L/2 (see Lemmas and 2 and Propositions and 2. This upper bound is then used as a guide to choose optimal lengths of palindromes for the approximation. The scan statistics is then applied to identify unusually large clusters of palindromes for each of the sixteen herpesviruses. Acknowledgements This work was supported by grant R at the National University of Singapore and ARC Discovery grant DP from the University of Melbourne. Part of this work was done while both authors were visiting the Australian National University in June 2002 at the invitation of Peter Hall. This work is also partially done at the Institute for Mathematical Sciences, National University of Singapore under a grant (No. 0//2/9/27 from BMRC of Singapore. References [] Arratia, R., Goldstein, L. and Gordon, L., Two moments suffice for Poisson approximations: the Chen-Stein method, Ann. Probab. 7 (989, [2] Arratia, R., Goldstein, L. and Gordon, L., Poisson approximation and the Chen-Stein method, Statistical Science 5 (990, [3] Barbour, A. D., Stein s method and Poisson process convergence, J. Appl. Probab. 25 (A (988, [4] Barbour, A. D. and Brown, T. C., Stein s method and point process approximation, Stochastic Processes and their Applications 43 (992, 9-3.

26 26 L. H. Y. Chen and A. Xia [5] Barbour, A. D., Holst, L. and Janson, S., Poisson Approximation, Oxford Univ. Press, 992. [6] Barbour, A. D. and Månsson, M., Compound Poisson process approximation, Ann. Probab. 30 (2002, [7] Billingsley, P., Convergence of Probability Measures, John Wiley & Sons (968. [8] Brown, T. C., Weinberg, G. V. and Xia, A., Removing logarithms from Poisson process error bounds, Stochastic Processes and their Applications 87 (2000, [9] Brown, T. C. and Xia, A., On Metrics in Point Process Approximation, Stochastics and Stochastics Reports 52 (995, [0] Brown, T. C. and Xia, A., Stein s method and birth-death processes, Ann. Probab. 29 (200, [] Chen, L. H. Y., Poisson approximation for dependent trials, Ann. Probab. 3 (975, [2] Chen, L. H. Y., Stein s method: some perspectives with applications. Probability towards 2000, Lecture Notes in Statist., 28 (998, [3] Cox, D. R. and Isham, V., Point Processes, Chapman & Hall, 980. [4] Daley, D. J. and Vere-Jones, D., An Introduction to the Theory of Point Processes, Springer-Verlag, 988. [5] Eichelsbacher, P. and Roos, M., Compound Poisson approximation for dissociated random variables via Stein s method, Combin. Probab. Comput. 8 (999, [6] Janossy, L., On the absorption of a nucleon cascade, Proc. R. Irish Acad. Sci. Sec. A 53 (950, [7] Kallenberg, O., Random Measures, Academic Press, London, 983. [8] Kolchin, V. F., Sevast yanov, B. A. and Chistyakov, V. P., Random Allocations, Winston & Sons, Washington DC, 978. [9] Leung, M. Y., Choi, K. P., Xia, A. and Chen, L. H. Y., Nonrandom clusters of palindromes in herpesvirus genomes, Preprint (2002.

27 Poisson process approximation 27 [20] Leung, M. Y. and Yamashita, T. E., Applications of the scan statistic in DNA sequence analysis, Scan statistics and applications (999, Stat. Ind. Technol., Birkhäuser Boston, Boston, MA, [2] Liggett, T. M., Interacting particle systems, Springer, 985. [22] Matérn, B., Spatial variation, Second edition, Lecture Notes in Statistics 36, Springer-Verlag, 986. [23] Nielsen, O. A., An Introduction to Integration and Measure Theory, John Wiley & Sons, 997. [24] Parthasarathy, K. R., Probability measures on metric spaces, Academic Press, 967. [25] Roos, M., Stein s method for compound Poisson approximation: the local approach, Ann. Appl. Probab. 4 (994, [26] Ruelle, D., Statistical mechanics: Rigorous results, W. A. Benjamin, 969. [27] Srinivasan, S. K., Stochastic Theory and Cascade Processes, American Elsevier, New York, 969. [28] Stein, C., A bound for the error in the normal approximation to the distribution of a sum of dependent random variables, Proc. Sixth Berkeley Symp. Math. Statist. Prob. 3 (972, Univ. California Press, [29] Xia, A., On using the first difference in the Stein-Chen method, Ann. Appl. Probab. 7 (997, [30] Xia, A., Poisson approximation, compensators and coupling, Stochastic Anal. Appl. 8 (2000,

Poisson Process Approximation: From Palm Theory to Stein s Method

IMS Lecture Notes Monograph Series c Institute of Mathematical Statistics Poisson Process Approximation: From Palm Theory to Stein s Method Louis H. Y. Chen 1 and Aihua Xia 2 National University of Singapore