Poisson Approximation for Two Scan Statistics with Rates of Convergence

Poisson Approximation for Two Scan Statistics with Rates of Convergence Xiao Fang (Joint work with David Siegmund) National University of Singapore May 28, 2015

Outline The first scan statistic The second scan statistic Other scan statistics

A statistical testing problem Let {X 1,..., X n } be an independent sequence of random variables. We want to test the hypothesis against the alternative H 0 : X 1,..., X n F θ0 ( ) H 1 : for some i < j, X i+1,..., X j F θ1 ( ) X 1,..., X i, X j+1,..., X n F θ0 ( ) i and j are called change-points. They are not specified in the alternative hypothesis. θ 0 may be given, or may need to be estimated. θ 1 may be given, or may be a nuisance parameter.

The first scan statistic If j i = t is given and F θ0 ( ) and F θ1 ( ) have different mean values, a natural statistic is M n;t = max T i, T i = X i + + X i+t 1. 1 i n t 1 We are interested in its p-value: Assume X 1,..., X n F θ0 ( ), P(M n;t b) = P( max T i b) 1 i n t+1 =?

Known results Let Y i = I (T i b). {max 1 i n t+1 T i b} = { n t+1 i=1 Y i 1}. Dembo and Karlin (1992) proved that if t is fixed and b, n plus mild conditions on F θ0 ( ), then n t+1 P(M n;t b) = P( Y i 1) 1 e λ where λ = (n t + 1)E(Y 1 ). i=1 Mild conditions on F θ0 ( ) ensures that P(Y i+1 = 1 Y i = 1) 0.

t : If X i Bernoulli(p) and b is an integer, Arratia, Gordon and Waterman (1990) prove that P(M n;t b) (1 e λ ) C(e ct + t )(λ 1) (1) n where λ = (n t + 1)P(T 1 = b)( b t p). Haiman (2007) derived more accurate approximations using the distribution function of Z k := max{t 1,..., T kt+1 } for k = 1 and 2. The distribution functions of Z k for k = 1 and 2 are only known for Bernoulli and Poisson random variables. Our objective is to extend (1) to other random variables.

Preparation for the main result: Let µ 0 = E(X 1 ). We assume b = at where a > µ 0. P( max T i b) = P( 1 i n t+1 max 1 i n t+1 X i + + X i+t 1 t a). We assume the distribution of X 1 can be imbedded in an exponential family of distributions df θ (x) = e θx Ψ(θ) df (x), θ Θ. (2) It is known that F θ has mean Ψ (θ) and variance Ψ (θ). Assume θ 0 = 0, i.e., X 1 F and there exists θ a Θ o such that Ψ (θ a ) = a. Example: X 1 N(0, 1), Ψ(θ) = θ2 2, θ a = a, F θa N(a, 1).

Assumption (2) is used in two places: 1 To obtain an accurate approximation to the marginal probability P(T 1 at) by change of measure. 2 Local limit theorem Diaconis and Freedman (1988): d TV (L(X 1,..., X m T 1 = at), L(X a 1,..., X a m)) Cm t where X a 1,..., X a m are i.i.d. and X a 1 F θ a. Let D k = k i=1 (X a i X i ). Let σ 2 a = Ψ (θ a ).

Theorem Under the assumption (2), for some constant C depending only on the exponential family (2), µ 0, and a, we have P(M n;t at) (1 e λ ) C( (log t)2 + t where if X 1 is nonlattice plus mild conditions, (log t log(n t)) )(λ 1), n t λ = (n t + 1)e [aθa Ψ(θa)]t θ a σ a (2πt) 1/2 exp[ k=1 1 k E(e θad+ k )], and if X 1 is integer-valued with span 1, λ = (n t + 1)e (aθa Ψ(θa))t e θa( at at) (1 e θa )σ a (2πt) 1/2 exp[ k=1 1 k E(e θad+ k )].

Remarks: We don t have an explicit expression for the constant C. The relative error 0 if t, n t. Let g(x) = Ee ixd 1 and ξ(x) = log{1/[1 g(x)]}. Woodroofe (1979) proved that for the nonlattice case, 1 k E(e θad+ k ) = log[(a µ0 )θ a ] 1 π k=1 + 1 π 0 0 θ 2 a[i ξ(x) π 2 ] x(θ 2 a + x 2 ) dx θ a {Rξ(x) + log[(a µ 0 )x]} θa 2 + x 2 dx where R and I denote real and imaginary parts. Tu and Siegmund (1999) proved that for the arithmetic case, 1 k E(e θad+ k ) = log(a µ0 ) k=1 + 1 2π 2π 0 { ξ(x)e θ a ix 1 e θa ix + ξ(x) + log[(a µ 0)(1 e ix } )] 1 e ix dx.

Example 1: Normal distribution. n t a p 1 p 2 1000 50 0.2 0.9315 0.9594 1000 50 0.4 0.2429 0.2624 1000 50 0.5 0.0331 0.0334 2000 50 0.5 0.0668 0.0672

Example 2: Bernoulli distribution. n t µ 0 a p 1 p 2 7680 30 0.1 11/30 0.14097 0.14021 7680 30 0.1 0.4 0.029614 0.029387 15360 30 0.1 0.4 0.058458 0.058003

Sketch of proof: Let m = C(log t log(n t)). Let Let Y i = I (T i at,t i+1 < T i,..., T i+m < T i T i 1 < T i,..., T i m < T i ). W = n t+1 i=1 P(M n;t at) P(W 1). Y i, λ 1 = EW = (n t + 1)EY 1. From the Poisson approximation theorem of Arratia, Goldstein and Gordon (1990), we have P(W 1) (1 e λ 1 ) C( 1 t + 1 )(λ 1). n t

Approximating λ 1 by λ: EY 1 =P(T 1 at, T 2 < T 1,..., T 1+m < T 1 ; T 0 < T 1,..., T 1 m < T 1 ) P(T 1 at)p 2 (T 1 T 2 > 0,..., T 1 T 1+m > 0 T 1 at) Note that T 1 T 2 = X 1 X t+1 and that given T 1 at, X 1 F θa approximately and X t+1 F. Thus, {T 1 T 2 > 0} {D 1 > 0} where D 1 = X a 1 X 1. Similarly, {T 1 T k+1 > 0} {D k > 0}, D k = k Therefore, i=1 (X a EY 1 P(T 1 at)p 2 (D k > 0, k = 1, 2,... ). i X i ). Recall λ = (n t + 1)e [aθa Ψ(θa)]t θ a σ a (2πt) 1/2 exp[ k=1 1 k E(e θad+ k )].

Corollary Let {X 1,..., X n } be i.i.d. random variables with distribution function F that can be imbedded in an exponential family, as in (2). Let EX 1 = µ 0. Assume X 1 is integer-valued with span 1. Suppose a = sup{x : p x := P(X 1 = x) > 0} is finite. Let b = at. Then we have, with constants C and c depending only on p a, P(M n;t b) (1 e λ ) C(λ 1)e ct where λ = (n t)p t a(1 p a ) + p t a.

The second scan statistic Recall that we want to test against the alternative H 0 : X 1,..., X n F θ0 ( ) H 1 : for some i < j, X i+1,..., X j F θ1 ( ) X 1,..., X i, X j+1,..., X n F θ0 ( ) Now assume j i is not given, and F θ0 and F θ1 are from the same exponential family of distributions df θ (x) = e θx Ψ(θ) df (x), θ Θ. Then the log likelihood ratio statistic is j max (θ 1 θ 0 )(X k Ψ(θ 1) Ψ(θ 0 ) ). 0 i<j n θ 1 θ 0 k=i+1

It reduces to the following problem: Let {X 1,..., X n } be independent, identically distributed random variables. Let EX 1 = µ 0 < 0. Let S 0 = 0 and S i = i j=1 X j for 1 i n. We are interested in the distribution of M n := max (S j S i ). 0 i<j n Iglehart (1972) observed that it can be interpreted as the maximum waiting time of the first n customers in a single server queue. Karlin, Dembo and Kawabata (1990) discussed genomic applications.

The limiting distribution was derived by Iglehart (1972): Assume the distribution of X 1 can be imbedded in an exponential family of distributions df θ (x) = e θx Ψ(θ) df (x), θ Θ. Assume EX 1 = Ψ (0) = µ 0 < 0 and there exists a positive θ 1 Θ such that Ψ (θ 1 ) = µ 1, Ψ(θ 1 ) = 0. When X 1 is nonlattice, we have lim n P(M n log n θ 1 + x) = 1 exp( K e θ 1x ).

Theorem Let h(b) > 0 be any function such that h(b), h(b) = O(b 1/2 ) as b. Suppose n b/µ 1 > b 1/2 h(b). We have, P(M n b) (1 e λ ) { } ( Cλ 1 + b/h2 (b) ) e ch2 (b) + b1/2 h(b) n b/µ 1 n b µ 1 where if X 1 is nonlattice plus mild conditions, λ = (n b µ 1 ) e θ 1b θ 1 µ 1 exp( 2 k=1 1 k E θ 1 e θ 1S + k ), and if X 1 is integer-valued with span 1 and b is an integer, λ = (n b e θ 1b 1 ) µ 1 (1 e θ exp( 2 1 )µ1 k E θ 1 e θ 1S + k ). k=1

Remarks: By choosing h(b) = b 1/2, we get P(M n b) (1 e λ ) Cλ{e cb + b n } By choosing h(b) = C(log b) 1/2 with large enough C, we can see that the relative error in the Poisson approximation goes to zero under the conditions b, (b log b) 1/2 n b/µ 1 = O(e θ 1b ), where n b/µ 1 = O(e θ 1b ) ensures that λ is bounded. For the smaller range (in which case λ 0) b, δb n b/µ 1 = o(e 1 2 θ 1b ) for some δ > 0, Siegmund (1988) obtained more accurate estimates by a technique different from ours.

Let G(z) = 0 p kz k + 1 q kz k, and let z 0 denote the unique root > 1 of G(z) = 1. For the case p k = 0 for k > 1, using the notation Q(z) = k q kz k, one can show for large values of n and b that λ nz0 b {[Q(1) Q(z 1 0 )] (1 z 1 0 )z 1 0 Q (z0 1 )}. For the case q k = 0 for k > 1, λ nz0 b (1 z 1 0 )) G (1) 2 /G (z 0 ). In particular if q 1 = q and p 1 = p, where p + q = 1, both these results specialize to λ n(p/q) b (q p) 2 /q.

Sketch of proof (for the case h(b) = b 1/2 ): Recall S i = ik=1 X k. Define T b := inf{n 1 : S n / [0, b)}. For a positive integer m, let ω + m be the m-shifted sample path of ω := {X 1,..., X n }. Let t = b µ 1 + b and m = cb such that m < t. For 1 i n t, let Y i = I ( S i < S i j, 1 j m; T b (ω + i ) t, S Tb (ω + i ) b ). That is, Y i is the indicator of the event that the sequence {S 1,... S n } reaches a local minimum at i and the i-shifted sequence {S i (ω + α )} exits the interval [0, b) within time t and the first exiting position is b. Let W = n t i=1 Y i.

Sketch of proof (cont.) P(M n b) P(W 1). P(W 1) (1 e λ 1 ) Cλe cb. λ 1 = (n t)ey 1 (n t)p(τ 0 = )P(S Tb b) where τ 0 := inf{n 1 : S n 0}. λ 1 λ.

Other statistics Recall again that we want to test H 0 : X 1,..., X n F θ0 ( ) against the alternative H 1 : for some i < j, X i+1,..., X j F θ1 ( ) X 1,..., X i, X j+1,..., X n F θ0 ( ) 1. If θ 0 is not given, we need to consider P(M n;t b S n ) and P(M n b S n ).

2. If θ 0 is given but θ 1 is a nuisance parameter, then the log likelihood ratio statistic is max max[θ(s j S i ) (j i)ψ(θ)]. 0 i<j n For normal distribution, it reduces to θ (S j S i ) 2 max 0 i<j n 2(j i). The limit of is only know for normal distribution and for n b 2 [Siegmund and Venkatraman (1995)].

3. Frick, Munk and Sieling (2014) proposed the following multiscale statistic: { Sj S i max 2 log ( n ) }. 0 i<j n j i j i The penalty term 2 log(n/(j i)) was first studied in Dümbgen and Spokoiny (2001) and motivated by Lévy s modulus of continuity theorem.

4. Let X 1,..., X m be an independent sequence of Gaussian random variables with mean EX i = µ i and variance 1. We are interested in testing the null hypothesis H 0 : µ 1 = = µ m against the alternative hypothesis that there exist 1 τ 1 < < τ K m 1 such that H 1 :µ 1 =... µ τ1 µ τ1 +1 = = µ τ2 = µ τk µ τk +1 = = µ m.

4. (cont.) If K = 1, the log likelihood ratio statistic is S m S t m t St t max. 1 t m 1 1 t + 1 m t If K > 1, an appropriate statistic is max 0 i<j<k m { S j S i j i S k S j k j 1 j i + 1 k j }.

Thank you!