Approximations for Multidimensional Discrete Scan Statistics

Size: px

Start display at page:

Download "Approximations for Multidimensional Discrete Scan Statistics"

Sheila Hines
5 years ago
Views:

Numéro d'ordre : 41498 Année 2014 THESE DE DOCTORAT présentée par Alexandru Am rioarei en

Mathématiques Appliquées Approximations for Multidimensional Discrete Scan Statistics

these: Cristian PREDA Université de Lille 1, France Rapporteurs: Joseph GLAZ University of

DERMOUNE Université de Lille 1, France Stéphane ROBIN AgroParisTech/INRA, France George

1 Numéro d'ordre : Année 2014 THESE DE DOCTORAT présentée par Alexandru Am rioarei en vue de l'obtention du grade de Docteur en Sciences de l'université de Lille 1 Discipline : Mathématiques Appliquées Approximations for Multidimensional Discrete Scan Statistics Soutenue publiquement le 15 Septembre, 2014 devant le jury composé de: Jury: Directeur de these: Cristian PREDA Université de Lille 1, France Rapporteurs: Joseph GLAZ University of Connecticut, USA Claude LEFEVRE Université Libre de Bruxelles, Belgique Examinateurs: Azzouz DERMOUNE Université de Lille 1, France Stéphane ROBIN AgroParisTech/INRA, France George HAIMAN Université de Lille 1, France Manuela SIDOROFF National Institute of R&D for Biological Sciences, Roumanie

3 To my parents

5 Acknowledgements I am using this opportunity to express my gratitude to everyone who supported me throughout the course of this work. I would like to express my special appreciation and thanks to my advisor, Prof. Cristian PREDA, who has been and still is a tremendous mentor for me. I would like to thank him for introducing me to the wonderful subject of scan statistics, for encouraging my research and for allowing me to grow as a research scientist. His advices on research, career as well as on life in general are priceless. It is a pleasure to recognize my dissertation committee members, Prof. Joseph GLAZ, Prof. Claude LEFEVRE, Prof. Azzouz DERMOUNE, Dr. Stéphane ROBIN, Prof. George HAIMAN and Dr. Manuela SIDOROFF, for taking interest in my research and accepting to examine my work. Special thanks to Prof. Joseph GLAZ and Prof. Claude LEFEVRE for their thorough reviews, fruitfull discussions and insightful comments. I am very grateful to Prof. George HAIMAN from whom I had the opportunity to learn many interesting aspects about the subject of the present work, his taste of details and mathematical rigour, which improved my knowledge in the eld of scan statistics. I want to emphasize the role of Dr. Manuela SIDOROFF, who guided me during my rst years of research at the National Institute of R&D for Biological Sciences in Bucharest, and without whom I would never had the chance to meet my advisor, Prof. Cristian PREDA and complete this work. I would also like to thank the members of INRIA/MΘDAL team for welcoming me among them and for making the last years more enjoyable, for their support, discussions and encouragements. I've learned a lot from them. Many of my thanks and my gratitude go to Acad. Ioan CUCULESCU, the rst person who showed me the depths of research and guided me on this path. I express my gratitude to all of my friends who supported me in writing, and encouraged me to strive towards my goal. A special thanks goes to my family. Words cannot express how grateful I am to my mother and my father for all of the sacrices that they've made on my behalf. I dedicate this thesis to them. Last, but not least, I would like express appreciation to my beloved Raluca who spent sleepless nights with and was always my support in the moments when there was no one to answer my queries.

7 Résumé Dans cette thèse nous obtenons des approximations et les erreurs associées pour la distribution de la statistique de scan discrète multi-dimensionnelle. La statistique de scan est vue comme le maximum d'une suite de variables aléatoires stationnaires 1-dépendante. Dans ce cadre, nous présentons un nouveau résultat pour l'approximation de la distribution de l'extremum d'une suite de variables aléatoire stationnaire 1-dépendante, avec des conditions d'application plus larges et des erreurs d'approximations plus petites par rapport aux résultats existants en littérature. Ce résultat est utilisé ensuite pour l'approximation de la distribution de la statistique de scan. L'intérêt de cette approche par rapport aux techniques existantes en littérature est du à la précision d'une erreur d'approximation, d'une part, et de son applicabilité qui ne dépend pas de la distribution du champ aléatoire sous-adjacent aux données, d'autre part. Les modèles considérés dans ce travail sont le modèle i.i.d et le modèle de dépendance de type block-factor. Pour la modélisation i.i.d. les résultats sont détaillés pour la statistique de scan uni, bi et tri-dimensionnelle. Un algorithme de simulation de type "importance sampling" a été introduit pour le calcul eectif des approximations et des erreurs associées. Des études de simulations démontrent l'ecacité des résultats obtenus. La comparaison avec d'autres méthodes existantes est réalisée. La dépendance de type block-factor est introduite comme une alternative à la dépendance de type Markov. La méthodologie développée traditionnellement dans le cas i.i.d. est étendue à ce type de dépendance. L'application du résultat d'approximation pour la distribution de la statistique de scan pour ce modèle de dépendance est illustrée dans le cas uni et bi-dimensionnel. Ces techniques, ainsi que celles existantes en littérature, ont été implémentées pour la première fois à l'aide des programmes Matlab R et une interface graphique.

9 Abstract In this thesis, we derive accurate approximations and error bounds for the probability distribution of the multidimensional scan statistics. We start by improving some existing results concerning the estimation of the distribution of extremes of 1-dependent stationary sequences of random variables, both in terms of range of applicability and sharpness of the error bound. These estimates play the key role in the approximation process of the multidimensional discrete scan statistics distribution. The presented methodology has two main advantages over the existing ones found in the literature: rst, beside the approximation formula, an error bound is also established and second, the approximation does not depend on the common distribution of the observations. For the underlying random eld under which the scan process is evaluated, we consider two models: the classical model, of independent and identically distributed observations and a dependent framework, where the observations are generated by a block-factor. In the i.i.d. case, in order to illustrate the accuracy of our results, we consider the particular settings of one, two and three dimensions. A simulation study is conducted where we compare our estimate with other approximations and inequalities derived in the literature. The numerical values are eciently obtained via an importance sampling algorithm discussed in detail in the text. Finally, we consider a block-factor model for the underlying random eld, which consists of dependent data and we show how to extend the approximation methodology to this case. Several examples in one and two dimensions are investigated. The numerical applications accompanying these examples show the accuracy of our approximation. All the methods presented in this thesis leaded to a Graphical User Interface GUI software, implemented in Matlab R.

11 Contents Introduction 1 1 Existing methods for discrete scan statistics One dimensional scan statistics Exact results for binary sequences Approximations Bounds Two dimensional scan statistics Approximations Bounds Three dimensional scan statistics Product-type approximation Extremes of 1-dependent stationary sequences Introduction Denitions and notations Remarks about m-dependent sequences and block-factors Formulation of the problem and discussion Main results Haiman results New results Proofs Technical lemmas Proof of Theorem Proof of Corollary Proof of Theorem Proof of Corollary Proof of Theorem Proof of Theorem Proof of Proposition Scan statistics and 1-dependent sequences Denitions and notations Methodology Computation of the approximation and simulation errors Computation of the approximation error Computation of the simulation errors Simulation using Importance Sampling Generalities on importance sampling Importance sampling for scan statistics

12 ii Contents Computational aspects Related algorithms: comparison for normal data Examples and numerical results One dimensional scan statistics Two dimensional scan statistics Three dimensional scan statistics Scan statistics over some block-factor type models Block-factor type model Approximation and error bounds The approximation process The associated error bounds Examples and numerical results Example 1: A one dimensional Bernoulli model Example 2: Length of the longest increasing run Example 3: Moving average of order q model Example 4: A game of minesweeper Conclusions and perspectives 119 A Supplementary material for Chapter A.1 Supplement for Section A.2 Supplement for Section A.2.1 Supplement for Section A.2.2 Supplement for Section A.3 Supplement for Section B Supplementary material for Chapter B.1 Proof of Lemma B.2 Proof of Lemma B.3 Validity of the algorithm in Example B.4 Proof of Lemma C Matlab GUI for discrete scan statistics 141 C.1 How to use the interface C.2 Future developments Bibliography 145

13 List of Figures 2.1 The behaviour of the function Kp The relation between H 2 and 2 when 1 q 1 = and n varies Illustration of the coecient function 2 for dierent values of p 1 and n Illustration of Z k1 in the case of d = 3, emphasizing the 1-dependence Illustration of the approximation of Q 2 in three dimensions The evolution of simulation error in MC and IS methods Illustration of the run time using the cumulative counts technique The evolution of simulation error in IS Algorithm 1 and IS Algorithm The empirical cumulative distribution function for the binomial and Poisson models in Table The empirical cumulative distribution function for the binomial and Poisson models in Table The empirical cumulative distribution functions AppH = Our Approximation, AppPT = Product Type Approximation for the Gaussian model in Table Illustration of the block-factor type model in two dimensions d = The dependence structure of X s1,s 2 in two dimensions Illustration of Z k1 emphasizing the 1-dependence Illustration of the approximation process for d = Cumulative distribution function for approximation, simulation and limit law Cumulative distribution function for approximation and simulation along with the corresponding error under MA model A realization of the minesweeper related model Cumulative distribution function for blockfactor and i.i.d. models Probability mass function for blockâˆ -factor and i.i.d. models C.1 The Scan Statistics Simulator GUI

15 List of Tables 2.1 Selected values for the error coecients in Theorem and Corollary Selected values for the error coecients in Theorem and Corollary A comparison of the p values as evaluated by simulation using [Genz and Bretz, 2009] algorithm Genz, importance sampling Algo 1 and the relative eciency between the methods Rel E A comparison of the p values as evaluated by naive Monte Carlo MC, importance sampling Algo 1 and the relative eciency between the methods Rel E A comparison of the p values as evaluated by the two importance sampling algorithms Algo 2 and Algo 1 and the relative eciency between the methods Rel E Approximations for PS m1 T 1 n in the Bernoulli B0.05 case: m 1 = 15, T 1 = 1000, IT ER = Approximations for PS m1 T 1 n for binomial and Poisson cases: m 1 = 50, T 1 = 5000, IT ER app = 10 5, IT ER sim = Approximations for PS m1 T 1 n in the Gaussian N 0, 1 case: m 1 = 40, T 1 = 800, IT ER app = 10 5, IT ER sim = Approximations for PS m T n for binomial and Poisson models: m 1 = 20, m 2 = 30, T 1 = 500, T 2 = 600, IT ER app = 10 4, IT ER sim = Approximations for PS m T n in the Gaussian N 1, 0.5 model: m 1 = 10, m 2 = 20, T 1 = 400, T 2 = 400, IT ER app = 10 4, IT ER sim = Approximations for PS m T n in the Bernoulli model: m 1 = m 2 = m 3 = 5, T 1 = T 2 = T 3 = 60, IT ER app = 10 5, IT ER sim = Approximation for PS m T n over the region R 3 with windows of the same volume by dierent sizes: T 1 = T 2 = T 3 = 60, p = , IT ER app = 10 5, IT ER sim = Approximation for PS m T n based on Remark 3.2.1: m 1 = m 2 = m 3 = 10, T 1 = T 2 = T 3 = 185, L 1 = L 2 = L 3 = 20, IT ER app = 10 5, IT ER sim = Approximation for PS m T n in the binomial and Poisson models: m 1 = m 2 = m 3 = 4, T 1 = T 2 = T 3 = 84, IT ER app = 10 5, IT ER sim = Approximations for PS m T n in the Gaussian N 0, 1 model: m 1 = m 2 = m 3 = 10, T 1 = T 2 = T 3 = 256, IT ER app = 10 5, IT ER sim = One dimensional Bernoulli block-factor model: m 1 = 8, T 1 = The distribution of the length of the longest increasing run: T1 = 10001, IT ER sim = 10 4, IT ER app =

16 vi List of Tables 4.3 MA2 model: m 1 = 20, T 1 = 1000, X i = 0.3 X i X i X i+2, IT ER app = 10 6, IT ER sim = Block-factor: m 1 = m 2 = 3, T1 = T 2 = 44, T 1 = T 2 = 42, p = 0.1, IT ER = Independent: m 1 = m 2 = 3, T 1 = T 2 = 42, Br = 8, p = 0.1, IT ER = Block-factor: m 1 = m 2 = 3, T1 = T 2 = 44, T 1 = T 2 = 42, p = 0.3, IT ER = Independent: m 1 = m 2 = 3, T 1 = T 2 = 42, Br = 8, p = 0.3, IT ER = Block-factor: m 1 = m 2 = 3, T1 = T 2 = 44, T 1 = T 2 = 42, p = 0.5, IT ER = Independent: m 1 = m 2 = 3, T 1 = T 2 = 42, Br = 8, p = 0.5, IT ER = Block-factor: m 1 = m 2 = 3, T1 = T 2 = 44, T 1 = T 2 = 42, p = 0.7, IT ER = Independent: m 1 = m 2 = 3, T 1 = T 2 = 42, Br = 8, p = 0.7, IT ER = C.1 Scan Dimension versus Random Field C.2 Relations used for estimating the distribution of the scan statistics

17 Introduction There are many elds of application where an observed cluster of events could have a great inuence on the decision taken by an investigator. To know if such an agglomeration of events is due to hazard or not, plays an important role in the decision-making process. For example, an epidemiologist observes over a predened period of time a week, a month, etc. an accumulation of cases of an infectious disease among the population of a certain region. Under some model for the distribution of events, if the probability to observe such an unexpected cluster is small, with respect to a given threshold value, then the investigator can conclude that an atypical situation occurred and can take the proper measures to avoid a pandemic crisis. The problem of identifying accumulations of events that are unexpected or anomalous with respect to the distribution of events belongs to the class of cluster detection problems. Depending on the application domain, these anomalous agglomeration of events can correspond to a diversity of phenomena: for example one may want to search for clusters of stars, deposits of precious metals, outbreaks of disease, batches of defective pieces, brain tumors and many other possibilities. A general class of testing procedures, used by practitioners to evaluate the likelihood of such clusters of events, are the tests based on scan statistics. These statistics, considered for the rst time in the work of Naus in the 60s, are random variables dened as the maximum number of observations in a scanning window of predened size and shape that is moved in a continuous fashion over all possible locations of the region of study. The tests based on scan statistics are usually employed when one wants to detect a local change a hot spot in the distribution of the underlying random eld via testing the null hypothesis of uniformity against an alternative hypothesis which favors clusters of events. The importance of the tests based on scan statistics have been noted in many scientic and technological elds, including: DNA sequence analysis, brain imaging, distributed target detection in sensors networks, astronomy, reliability theory and quality control among many other domains. To implement these testing procedures, one needs to nd the distribution of the scan statistics. The main diculty in obtaining the distribution of the scan random variable, under the null hypothesis, resides in the high dependent structure of the observations over which the maximum is taken. As consequence, several approximations have been proposed, especially for the case of one, two and three dimensional scan statistics. In this thesis, we consider the multidimensional discrete scan statistics into a general framework. Viewed as maximum of some 1-dependent sequences of random variables, we derive accurate approximations and error bounds for its distribution. Our methodology applies to a larger class of distributions and extends the i.i.d. case to some new dependent models based on block-factor constructions. This manuscript is organized into four chapters as follows.

18 2 Introduction In Chapter 1, we review some of the existing approaches used to develop exact and approximate results for the distribution of the unconditional discrete scan statistics. We consider, separately, the cases of one, two and three dimensional scan statistics. In the one dimensional setting, we include, along various approximations and bounds, three general methods used for determining the exact distribution of the scan statistics over a sequence of i.i.d. binary trials: the combinatorial method, the nite Markov chain imbedding technique and the conditional probability generating function method. A new upper bound for the distribution of the two dimensional scan statistics is presented. We should mention that most of the results presented in this chapter are given in their general form, extending thus their corresponding formulas that appear in the literature. Chapter 2 introduces a series of results concerning the approximation of the distribution of the extremes of 1-dependent stationary sequences of random variables. We improve some existing results in terms of error bounds and range of applicability. Our new approximations will constitute the main tools in the estimation of the distribution of the multidimensional discrete scan statistics derived in the subsequent chapters. The general case of d dimensional discrete scan statistics, d 1 for independent and identically distributed observations, is considered in Chapter 3. Employing the results derived in Chapter 2, we present the methodology used for obtaining the approximation of the probability distribution function of the multidimensional dimensional discrete scan statistics. The main advantage of the described approach is that, beside the approximation formula, we can also establish sharp error bounds. Since the quantities that appear in the approximation of the scan statistics formula are usually evaluated by simulation, two types of errors are considered: the theoretical error bounds and the simulation error bounds. We give detailed expressions, based on recursive formulas, for the computation of these bounds. Due to the simulation nature of the problem, we also include a general importance sampling procedure to increase the eciency of the proposed estimation. We discuss dierent computational aspects of the procedure and we compare it with other existing algorithms. We conclude the chapter with a series of examples for the special cases of one, two and three dimensional scan statistics. In these frameworks, we explicit the general formulas obtained for the d dimensional setting and we investigate their accuracy via a simulation study. In Chapter 4, we consider the multidimensional discrete scan statistics over a random eld generated by a block-factor model. This dependent model generalizes the i.i.d. model presented in Chapter 3. We extend the approximation methodology developed for the i.i.d. case to this model. We provide recurrent formulas for the computation of the approximation, as well as for the associated error bounds. In the nal section, we present several examples for the special cases of one and two dimensional scan statistics to illustrate the method. In particular, we give an estimate for the length of the longest increasing run in a sequence of i.i.d. random variables and we investigate the scan statistics for moving average of order q models. Numerical results are included in order to evaluate the eciency of our results.

19 Introduction 3 To illustrate the eciency and the accuracy of the methods presented in this thesis, we developed a Graphical User Interface GUI software, implemented in Matlab R. This software application provides estimates for the distribution of the discrete scan statistics for dierent scenarios. In this GUI, the user can choose the dimension of the problem and the distribution of the random eld under which the scan process is performed. We consider the cases of one, two and three dimensional scan statistics over a random eld distributed according to a Bernoulli, binomial, Poisson or Gaussian model. In the particular situation of one dimensional scan statistics, we have also included a moving average of order q model. A more detailed description of this GUI application is given in Appendix C.

21 Chapter 1 Existing methods for nding the distribution of discrete scan statistics In this chapter, we review some of the existing methods used in the study of the unconditional discrete scan statistics. In Section 1.1, we consider the one dimensional case and describe some of the approaches used to determine the exact distribution of scan statistics along with various approximations and bounds. In Section 1.2 and Section 1.3, we focus on the two and three dimensional scan statistics, respectively. Contents 1.1 One dimensional scan statistics Exact results for binary sequences The combinatorial approach Markov chain imbedding technique Probability generating function methodology Approximations Bounds Two dimensional scan statistics Approximations Bounds Three dimensional scan statistics Product-type approximation One dimensional scan statistics There are many situations when an investigator observes an accumulation of events of interest and wants to decide if such a realisation is due to hazard or not. These type of problems belong to the class of cluster detection problems, where the basic idea is to identify regions that are unexpected or anomalous with respect to the distribution of events. Depending on the application domain, these anomalous agglomeration of events can correspond to a diversity of phenomena: for example one

22 6 Chapter 1. Existing methods for discrete scan statistics may want to nd clusters of stars, deposits of precious metals, outbreaks of disease, mineeld detections, defectuous batches of pieces and many other possibilities. If such an observed accumulation of events exceeds a preassigned threshold, usually determined from a specied signicance level corresponding to a normal situation the null hypothesis, then it is legitimate to say that we have an unexpected cluster and proper measures has to be taken accordingly. Searching for unusual clusters of events is of great importance in many scientic and technological elds including: DNA sequence analysis [Sheng and Naus, 1994], [Hoh and Ott, 2000], brain imaging [Naiman and Priebe, 2001], target detection in sensors networks [Guerriero et al., 2009], [Guerriero et al., 2010b], astronomy [Darling and Waterman, 1986], [Marcos and Marcos, 2008], reliability theory and quality control [Boutsikas and Koutras, 2000] among many other domains. One of the tools used by practitioners to decide on the unusualness of such agglomeration of events is the scan statistics. Basically, the tests based on scan statistics are looking for events that are clustered amongst a background of those that are sporadic. Let 2 m 1 T 1 be two positive integers and X 1,..., X T1 be a sequence of independent and identically distributed random variables with the common distribution F 0. The one dimensional discrete scan statistics is dened as S m1 T 1 = max Y i 1, i 1 T 1 m 1 +1 where the random variables Y i1 are the moving sums of length m 1 given by Y i1 = i 1 +m 1 1 i=i 1 X i. 1.2 Usually, the statistical tests based on the one dimensional discrete scan statistics are employed when one wants to detect a local change in the signal within a sequence of T 1 observations via testing the null hypothesis of uniformity, H 0, against a cluster alternative, H 1 see [Glaz and Naus, 1991] and [Glaz et al., 2001]. Under H 0, the random observations X 1,..., X T1 are i.i.d. distributed as F 0, while under the alternative hypothesis, there exists a location 1 i 0 T 1 m where X i, i {i 0,..., i 0 + m 1 1}, are distributed according to F 1 F 0 and outside this region X i are distributed as F 0. We observe that whenever S m1 T 1 exceeds the threshold τ, where the value of τ is computed based on the relation P H0 S m1 T 1 τ = α and α is a preassigned significance level of the testing procedure, the generalized likelihood ration test rejects the null hypothesis in the favor of the clustering alternative see [Glaz and Naus, 1991]. It is interesting to note that most of the research has been done for F 0 being binomial, Poisson or normal distribution see [Naus, 1982], [Glaz and Naus, 1991], [Glaz and Balakrishnan, 1999], [Glaz et al., 2001] or [Wang et al., 2012]. In Chapter 3, we present a new approximation method for the distribution of the discrete scan statistics, following the work of [Haiman, 2000], that can be evaluated no matter what distribution we have under the null hypothesis.

23 1.1. One dimensional scan statistics 7 In this section, we revisit some of the existing methods used to obtain the exact values or the approximation of the distribution of the one dimensional discrete scan statistics Exact results for binary sequences In this section, we consider that the random variables X 1, X 2,..., X T1, are i.i.d. binary trials Bernoulli model. From the best of our knowledge, there are three main approaches used for investigating the exact distribution of the one dimensional discrete scan statistics: the combinatorial method, the Markov chain imbedding technique MCIT and the conditional probability generating function method. We will give a short description of each method in the subsequent sections The combinatorial approach Let X 1,..., X T1 be a sequence of i.i.d. 0 1 Bernoulli random variables of parameter p. The combinatorial method is based on a consequence of a Markov process result of [Karlin and McGregor, 1959], used for solving the ballot problem see [Barton and Mallows, 1965, page 243] and is briey described in the following. The probability distribution function of the one dimensional discrete scan statistics can be obtained, using the law of total probability, from the relation T 1 T1 P S m1 T 1 n = p k 1 p T1 k T 1 P S m1 T 1 n X i = k, 1.3 k k=0 where in the last relation we used the fact that X 1 {0, 1} with PX 1 = 1 = p and T 1 X i follows a binomial distribution with parameters T 1 and p. i=1 In [Naus, 1974, Theorem 1] see also [Glaz et al., 2001, Chapter 12], the author presented a combinatorial formula for the conditional distribution of S m1 T 1 given T 1 the total number of realisations in T 1 trials, i.e. X i = k. Assuming that T 1 = m 1 L and partitioning the total number of trials into L disjoint groups of size m 1, we have T 1 P S m1 T 1 n X i = k = m 1! L T1 det d i,j, 1.4 i=1 k σ Γ n where σ denote a partition of the k realisations successes into L numbers n 1,..., n L such that n i 0 represents the number of realisations in the ith group. The set Γ n denote the collection of all the partitions σ such that for each i {1,..., L} we have n i n. The L L matrices d i,j are determined based on the formulas { 0, if ci,j < 0 or c i,j > m 1 d i,j = 1 c i,j!n+1 c i,j!, otherwise 1.5 i=1 i=1

24 8 Chapter 1. Existing methods for discrete scan statistics with c i,j = j 1 j in + 1 n k + n i, for i < j j in l=1 i n k, for i j. l=j 1.6 It is important to emphasize that the evaluation of P S m1 T 1 n via the combinatorial method is complex and requires excessive computational time. The problem arises from the large number of terms in the set Γ n and from the fact that for each element in such a set one needs to evaluate the determinant of a L L matrix. As [Naus, 1982] noted, the expression in Eq.1.3 can only be evaluated for small window sizes and moderate L. We include, for completeness, the formulas for the particular cases of T 1 = 2m 1 and T 1 = 3m 1. [Naus, 1982] using the relations in Eq.1.3 and Eq.1.4, gave the following closed form expressions for the distribution of the discrete scan statistics: with P S m1 2m 1 n = F 2 n; m 1, p nbn + 1; m 1, pf n 1; m 1, p + m 1 pbn + 1; m 1, pf n 2, m 1 1, 1.7 P S m1 3m 1 n = F 3 n; m 1, p A 1 + A 2 + A 3 A 4, 1.8 A 1 = 2bn + 1; m 1, pf n; m 1, p[nf n 1; m 1, p m 1 pf n 2; m 1 1, p], A 2 = 0.5b 2 n + 1; m 1, p [nn 1F n 2; m 1, p 2n 1m 1 F n 3; m 1 1, p +m 1 m 1 1p 2 F n 4; m 1 2, p ], n A 3 = b2n + 1 r; m 1, pf 2 r 1; m 1, p, A 4 = r=1 n b2n + 1 r; m 1, pbr + 1; m 1, p [rf r 1; m 1, p r=2 m 1 pf r 2; m 1 1, p], 1.9 and where bs; t, p and F s; t, p are the probability mass function and cumulative distribution function of the binomial random variable Bt, p, that is t bs; t, p = p s 1 p t s 1.10 s s F s; t, p = bi; t, p i=0 As we will see in Section 1.1.2, the foregoing relations are successfully used in the product type approximation of the scan statistics over a Bernoulli sequence.

25 1.1. One dimensional scan statistics Markov chain imbedding technique In this section, we present succinctly the second approach used for nding the exact distribution of the one dimensional discrete scan statistics for binary trials, namely the Markov Chain Imbedding Technique for short MCIT. The method was developed by [Fu, 1986] and [Fu and Koutras, 1994] and successfully employed to derive the exact distribution associated with several runs statistics in either independent and identically distributed or Markov dependent trials. The two monographs of [Fu and Lou, 2003] and [Balakrishnan and Koutras, 2002] give a full account of the development and the applications of the method and also provide a lot of references in this research area. The main idea behind the MCIT method, as the name suggests, is to imbed the random variable of interest into a Markov chain with an appropriate state space. Using the Chapman-Kolmogorov equation, the desired probability can be computed via multiplication of the transition probability matrices of the imbedded Markov chain. To know if the studied random variable is Markov chain embeddable is not an easy task. For some particular classes of runs in multistate trials number of occurrences of pattern, waiting time variables associated with a pattern, etc. a general procedure, called the forward-backward principle, was introduced in [Fu, 1996] see also [Fu and Lou, 2003, Chapter 4]. This procedure systematically shows how to imbed the random variable of interest into a Markov chain that carry all the necessary information. In [Koutras and Alexandrou, 1995, Section 4c], the authors showed that the scan statistics S m1 T 1 is Markov chain embeddable, thus, its distribution function, P S m1 T 1 < n, can be evaluated in terms of the transition probability matrix of the imbedded chain. The imbedded Markov chain is dened on a state space that, at each step t > m 1, keeps track of the number of the occurrences of the event {S m1 t < n} in the rst t trials and of the last m 1 realisations of the sequence X t m1 +1,..., X t. A similar methodology was adopted by [Wu, 2013], where the author dened the imbedded chain on the state space of all tuples containing the locations of the successes 1's counting backward in a window of size m 1 at each time t. The drawback of both of their approaches is that the state space of the imbedded chain is rather large is equal with 2 m 1, so the transition matrix becomes quickly intractable even for moderate window sizes. [Fu, 2001] see also [Fu and Lou, 2003, Section 5.9] suggested a dierent approach for nding the distribution of the scan statistics which involves a smaller state space for the imbedded chain. The author expressed the distribution function P S m1 T 1 < n as the probability of the tail of a waiting time variable associated to a compound pattern. This type of random variable was extensively studied and has been shown to be Markov chain embeddable see [Fu and Chang, 2002] or [Fu and Lou, 2003, Chapter 5]. We will describe succinctly both the approach proposed by [Fu, 2001] and the one in [Wu, 2013], in order of publication. For simplicity, we consider only the case of i.i.d. two state Bernoulli trials, the

26 10 Chapter 1. Existing methods for discrete scan statistics case of homogeneous Markov trials being similar. Let X 1,..., X T1 be i.i.d. 0 1 Bernoulli random variables with success probability PX 1 = 1 = p. For a given window size m 1 and n, with 0 n m 1, we consider the following set of simple patterns: {}}{ F m1,n = {Λ i Λ 1 = } 1.{{.. 1}, Λ 2 = 10 1 }.{{.. 1},..., Λ l = 1 }.{{.. 1} } 1.12 n n 1 n 1 that is the set of all the simple patterns that start and end with a success 1, contain exactly n symbols of 1 and with length at most m 1. As [Fu, 2001] showed, the number of elements in F m1,n is given by the formula: l = m 1 n j=0 Considering the compound pattern Λ m1,n = n 2 + j j m l Λ i, we observe that the scan statistics S m1 T 1 and the waiting time W Λ m1,n to observe one of the simple patterns Λ 1..., Λ l, are related by the following relation: i=1 P S m1 T 1 < n = P W Λ m1,n > T The state space of the imbedded chain can be written as Ω = {0, 1} l SΛ i, 1.15 where SΛ i is the set of all sequential subpatterns of Λ i. [Chang, 2002] see also [Fu and Chang, 2002] showed that the transition matrix of the Markov chain that corresponds to the waiting time W Λ m1,n variable has the form M m1,n = i=1 Ω A N m1,n C A O I d d 1.16 where Ω is the state space of the chain and has d elements, A = {α 1,..., α l } denote the set of the absorbing states corresponding to the simple patterns Λ 1..., Λ l respectively, N m1,n is an d l d l matrix called essential matrix, O is an l d l zero matrix and I is an l l identity matrix. Given u, v Ω, the transition probabilities of the imbedded chain {Z t } are computed via q, if X t = 0, u Ω A and v = [u, 0] Ω, p, if X p u,v = P Z t = v Z t 1 = u = t = 1, u Ω A and v = [u, 1] Ω, 1, if u A and v = u, 0, otherwise, 1.17

27 1.1. One dimensional scan statistics 11 where the notation v = [u, a] Ω, a {0, 1}, means that v is the longest ending subsequence of observations that determines the status of forming the next pattern Λ i from X t m1 +1,..., X t see [Fu and Lou, 2003, Theorem 5.2]. It follows from [Fu, 2001, Theorem 2] that the probability of the tail of the waiting time W Λ m1,n is given by: P W Λ m1,n > T 1 = ξn T 1 m 1,n1, 1.18 where ξ = q, p, 0,..., 0 is the initial distribution q = 1 p, N m1,n is the essential matrix and 1 is the transpose of the vector 1,..., 1. The methodology proposed by [Wu, 2013] is dierent from the foregoing procedure and is worth mentioning since it leads to a recurrence relation for the distribution of the one dimensional scan statistics. As pointed out at the beginning of the section, the idea is to construct a Markov chain that keeps track of the locations of the successes counting backward in the window of size m 1 at each moment t. Hence, the state space of the imbedded chain {Z t } is Ω = {0} {w 1... w j w j = 1,..., m 1, w 1 < < w j, j = 1,..., m 1 } Assume that at time t the Markov chain takes the value Z t = w 1... w j, then the m 1 w s +1, s {1,..., j}, element of the realisation of X t m1 +1,..., X t is a success. Employing a base-2 relabeling of the state space Ω, v = w 1... w j v L = j 2 wi 1 + 1, 1.20 the author showed [Wu, 2013, Theorem 1] that the distribution function of the scan statistics variable S m1 T 1 can be evaluated via P S m1 T 1 < n = P Z = v = T1 a T1 v L, 1.21 v v L v L A n v L A n where A n is the set of labels associated with the states that corresponds to less than n successes within the m 1 trailing observations and where a T1 v L are computed by the following recurrence relations: { vl +1 qat 1 a t v L = 2 + qat 1 2 m v L+1 2, vl odd, pa vl2 t 1 + pat 1 2 m v 1.22 L 2, vl even, with t {1,..., T 1 }, v L {1,..., 2 m 1 } and the initial conditions a 0 1 = 1, a 0 v L = 0 if v L 1, and a t v L = 0 if v L / A n. We note that in the above relations both the size of the scanning window and the size of the sequence play an important role. In the approach presented by [Fu, 2001], only the size of the window inuenced the transition matrix. Nevertheless, due to the recursive aspect, the second method seems to compute faster the distribution of the scan statistics for small-moderate values of m 1 m 1 30 and moderate values of T 1 T For long sequences T and small-moderate window sizes, the rst method is desirable. i=1

28 12 Chapter 1. Existing methods for discrete scan statistics Remark Recently, [Nuel, 2008a] and [Nuel, 2008b] proposed a new approach for constructing the optimal state space of the imbedded Markov chain. The method, called Pattern Markov Chain, is based on the theory of formal language and automata deterministic nite automata and was successfully applied in the context of biological data Probability generating function methodology Another method used for determining the exact distribution of the one dimensional discrete scan statistics is the conditional probability generating function pgf approach. This method provides a way of deriving the probability generating function for the variable of interest and then, by dierentiating it a number of times, yield the probability distribution function. The pgf method was introduced by [Feller, 1968, Chapter XIII] in the context of the theory of recurrent events and has been intensively used in research and education since then. This method is also one of the main tools for investigating the exact distribution of runs and scans statistics and, as in the case of the Markov chain imbedding technique, can be applied in both the case of independent and identically distributed or Markov dependent trials. For example, [Ebneshahrashoob and Sobel, 1990] applied the method for determining the exact distribution of the sooner and later waiting time of some succession of events in Bernoulli trials, [Aki, 1992] see also [Uchida, 1998] used the same approach to generalize this problem to independent nonnegative integer value random variables and [Uchida and Aki, 1995] extended the results to the Markov dependent binary case. [Balakrishnan et al., 1997] successfully applied the method for start-up demonstration tests under Markov dependence. The basic idea behind the pgf approach is to construct a system of recurrent relations of conditional probability generating functions of the desired random variable by considering the condition of one-step ahead from every condition. Then, by solving the resulted system of equations with respect to the conditional generating functions, we can get the unconditional probability generating function see for example [Han and Hirano, 2003] or [Chaderjian et al., 2012]. It is interesting to note that [Feller, 1968, Chapter XIV] exploited this idea for nding the generating function for the rst passage time in the classical ruin problem. Even if in some cases the system of recurrent relations cannot be solved symbolically, due to the large dimension of the problem, in many cases the conditional probability generating functions can be written explicitly for example in [Aki, 1992] or [Uchida and Aki, 1995]. Recently, [Ebneshahrashoob et al., 2004] and [Ebneshahrashoob et al., 2005], by utilizing sparse matrix computational tools, enlarged the range of applicability of the pgf method. In [Ebneshahrashoob et al., 2005], the authors applied the method of conditional probability generating functions to derive the exact distribution of the one dimensional discrete scan statistics over a sequence of i.i.d. or Markov binary trials. In what follows, we give an outline of their results. We consider, for simplicity, the case of i.i.d. observations. As we will see, both the pgf approach of

29 1.1. One dimensional scan statistics 13 [Ebneshahrashoob et al., 2005] and the MCIT given by [Fu, 2001] require roughly the same computational eort. Let 2 m 1 T 1 be positive integers and X 1,..., X T1 be a sequence of i.i.d. 0 1 Bernoulli of parameter p random variables. Let W m1,n, n m 1, denote the waiting time until we rst observe at least n successes in a window of size m 1. Clearly see [Glaz et al., 2001, Chapter 13], the random variables S m1 T 1 and W m1,n satisfy the relation P S m1 T 1 n = P W m1,n T Thus, nding the exact distribution of the waiting time W m1,n, automatically gives the exact distribution of the scan statistics S m1 T 1 via Eq To derive the distribution of W m1,n, we employ the pgf methodology. Let Gt be the probability generating function of the waiting time variable, Gt = E [ t Wm 1,n] = and denote with G 1 t and G 0 t, G 1 t = G 0 t = t k P W m1,n = k 1.24 k=0 t k P W m1,n X 1 = 1 = k, 1.25 k=0 t k P W m1,n X 1 = 0 = k, 1.26 k=0 the pgf's of the conditional distribution of the waiting time given that the rst trial was a success or a failure, respectively. We immediately observe see [Feller, 1968, Chapter XIV, Section 4], that between the probability generating functions Gt, G 1 t and G 0 t the following recurrence relation holds Gt = ptg 1 t + qtg 0 t, 1.27 where q = 1 p is the failure probability. To take into account the realisations of W m1,n, we condition on the positions of the successes in the last m 1 trials. If k n and 1 x 1 < x 2 < < x k < m 1, let G x1,...,x k t denote the probability generating function of the waiting time given that there was a success x 1, x 2,..., x k steps back, respectively and no other in the last m 1 trials. As in the case of Eq.1.27, the pgf's G x1,...,x k t verify the following recurrences G x1,...,x k t = ptg 1,x1 +1,...,x k +1t + qtg x1 +1,...,x k +1t We should note that the above approach resembles the procedure proposed by [Wu, 2013] in the Markov chain imbedding context. Note that the Eqs.1.27 and 1.28 lead to a system of linear equations consisting of conditional probability generating functions whose solution determines the pgf of

30 14 Chapter 1. Existing methods for discrete scan statistics the waiting time W m1,n. If the pgf Gt can be computed, then the distribution of the waiting time can be evaluated by dierentiation via P W m1,n = k = Gk 0, k 1, 2, k! We remark that among the conditional pgf's that satisfy Eq.1.28 there are several which are redundant and need to be eliminated. For example, if m 1 x 1 < n 1, then G x1 t = G 0 t, since the success occurred x 1 steps back can no longer inuence the outcome of the waiting time variable from the denition of W m1,n. Similarly we have G x1,...,x k t = G x1,...,x k 1 t, if m 1 x k < n k, 1.30 G x1,...,x k t = 1, if k = n Since for large values of the scanning window m 1 the resulted system of equations cannot be solved symbolically, [Ebneshahrashoob et al., 2005] proposed an alternative method to overcome this diculty. The idea is to write the system in matrix form and to employ sparse matrix tools for the evaluation. Let Gt be the N 1 vector of pgf's Gt = Gt, G 0 t, G 1 t,..., G m1 n+1,...,m 1 1t, 1.32 where N is the number of non constant pgf's G x1,...,x k t after applying the reduction rules in Eqs.1.30 and It is not hard to verify see for example [Ebneshahrashoob et al., 2005] or [Gao and Wu, 2006], that this number is equal with m1 N = n 1 Note that in the expression of Gt, the pgf G x1,...,x k t, k 1, occupy the position given by the index m1 n + k + 1 m1 n + k x 1 m1 n + k x 2 N x1,...,x k = 1 + k k k 1 m1 n + k x k We observe that the system of recurrence relations given by Eqs.1.27 and 1.28, can be written in matrix form Gt = tagt + tb, 1.35 where A is a N N matrix and b is a N 1 vector. The entries in the matrix A and the vector b are determined based on the recurrence relations in Eq.1.28 and the elimination rules from Eqs.1.30 and 1.31, used in order to verify if a generated pgf is new, constant or equivalent with a previous one. For example, in Eq.1.28, consider N x1,...,x k the position of the pgf G x1,...,x k t in

31 1.1. One dimensional scan statistics 15 the vector Gt. If after applying the reduction rules G 1,x1 +1,...,x k +1t = 1, then the N x1,...,x k 's component of the vector b is equal with p. By contrary, if none of the pgf's G 1,x1 +1,...,x k +1t and G x1 +1,...,x k +1t is constant and after the application of the reduction rules are transformed in G y1,...,y l t and G z1,...,z r t, respectively, then A N x1,...,x k, N y1,...,y l = p, 1.36 A N x1,...,x k, N z1,...,z r = q, 1.37 and Ai, j = 0 otherwise. We note that the matrix A has at most two non zero values on each row thus its sparse character. It is interesting to observe that, as the authors pointed out, the matrix A and the matrix N m1,n obtained from the MCIT by [Fu, 2001] are similar. Simple calculation show that G k 0 k! = A k 1 b, 1.38 hence P W m1,n = k is the rst component of A k 1 b. Remark [Shinde and Kotwal, 2008] derived the exact distribution of the one dimensional scan statistics in a sequence of binary trials in a dierent fashion. They employed the conditional probability generating functions method to determine the joint distribution of M T1,m 1,1,..., M T1,m 1,m 1, where M T1,m 1,k is the number of overlapping m 1 -tuples which contain at least k successes in a sequence of T 1 trials. Then the distribution of the scan statistic S m1 T 1 can be obtained from the relation P S m1 T 1 = n = P M T1,m 1,n+1 = 0 P M T1,m 1,n = Since their formulas are rather long we will not include them here Approximations Due to the high complexity and the limited range of application of the exact formulas, a considerable number of approximations and bounds have been developed for the estimation of the distribution of the one dimensional discrete scan statistics, for example [Naus, 1982], [Glaz and Naus, 1991], [Chen and Glaz, 1997] or [Wang et al., 2012]. A full treatment of these results is presented in the reference books of [Glaz and Balakrishnan, 1999] and [Glaz et al., 2001]. In this section, we choose to describe the class of product type approximations since, in most cases, are the most accurate ones. Let X 1,..., X T1 be a sequence of independent and identically distributed random variables with the common distribution F 0. Assume that T 1 = Lm 1 + l, with L 3, m 1 2 and 0 l m 1 1, and denote the distribution function of the scan statistics by Q m1 T 1 = P S m1 T 1 n. 1.40

32 16 Chapter 1. Existing methods for discrete scan statistics [Naus, 1982] showed that the following product type approximation [ ] Qm1 3m 1 L 2 P S m1 T 1 n Q m1 2m 1 + l, 1.41 Q m1 2m 1 is highly accurate for the entire range of the distribution. A slightly dierent product type approximation is given by [ ] Qm1 2m 1 T1 2m 1 +1 P S m1 T 1 n Q m1 2m Q m1 2m 1 1 The foregoing relation is especially useful when one can evaluate Q m1 T only for T < 3m 1. The intuition behind the approximation formula given by Eq.1.41 arise from writing the distribution of the scan statistics as the intersection of L 1-dependent events. Applying the product rule, we have L L 1 P S m1 T 1 n = P E i = P E 1 P E 2 E 1 P E L E i where i=1 L 1 t 1 L 1 = P E 1 E 2 P E t E i P E L E i, 1.43 t=3 { } max Y i 1 n, for 1 t L 1 t 1m E t = { 1 +1 i 1 tm 1 +1 } max Y i 1 n, for t = L. L 1m 1 +1 i 1 L 1m 1 +l+1 i=1 i=1 i= As in [Naus, 1982], we use the Markov like approximation for the conditional probabilities of the form t 1 P E t E i P E t E t 1, 3 t L i=1 Since for 3 t L 1 the intersections E t E t 1 are stationary, due to exchangeability, we conclude that [ ] PE1 E 2 L 3 PE L 1 E L P S m1 T 1 n PE 1 E 2 PE 1 PE L 1 [ ] Qm1 3m 1 L 3 Q m1 2m 1 + l = Q m1 3m Q m1 2m 1 Q m1 2m 1 We observe that in order to use the product type approximations in Eqs.1.41 and 1.42 we need to evaluate Q m1 2m 1 1, Q m1 2m 1 and Q m1 3m 1. For the particular case when F 0 is a Bernoulli distribution, [Naus, 1982] derived exact formulas for Q m1 2m 1 and Q m1 3m 1 as we saw in Section Eqs For the case of binomial and Poisson distribution, [Karwe and Naus, 1997], based on a generating function approach, gave recursive formulas for the computation of Q m1 T up to T 3m 1. For completeness, we include in Section A.1 of Appendix A the algorithm of [Karwe and Naus, 1997] for the evaluation of Q m1 2m 1 1 and Q m1 2m 1.

33 1.1. One dimensional scan statistics 17 Remark Other approximations for the distribution of the one dimensional discrete scan statistics have been developed by [Haiman, 2007]. These results are presented in detail in Chapter Bounds Several bounds have been proposed for the estimation of P S m1 T 1 n. Basically these inequalities can be divided into two classes, depending on the method employed in their derivation: product type bounds and Bonferroni type bounds. It is important to note that the inequalities in the rst category are tighter than the ones provided by the Bonferroni approach, since the method of proof in this case takes into account the dependence structure of the moving sums Y i1. A detailed comparison between the two methodologies was given in [Glaz, 1990]. [Glaz and Naus, 1991], in the case of discrete variables, and later [Wang et al., 2012], for the continuous case, developed the following product type bounds for the distribution of the one dimensional scan statistics see also [Glaz et al., 2001, Chapter 13]: a Lower bounds P S m1 T 1 n [ [ Q m1 2m Qm 1 2m 1 1 Q m1 2m 1 Q m1 2m 1 1Q m1 2m 1 Q m1 3m Qm 1 2m 1 1 Q m1 2m 1 Q m1 3m 1 1 ] T1 2m 1, T 1 2m ] T1 3m 1, T 1 3m b Upper bounds P S m1 T 1 n Q m1 2m 1 [1 Q m1 2m Q m1 2m 1 ] T 1 2m 1, T 1 2m Q m1 3m 1 [1 Q m1 2m Q m1 2m 1 ] T 1 3m 1, T 1 3m The quantities that appear in the above lower and upper bounds can be evaluated, in the case of discrete random variables, via the recurrence relations of [Karwe and Naus, 1997] see Section A.1. We now describe briey the Bonferroni type bounds for Q m1 T 1. Since, in general, Bonferroni inequalities deal with union of events, the rst step is to express the event of interest {S m1 T 1 n} as a union of events. Employing the notations used in Section we can write L L P S m1 T 1 n = P E i = 1 P, 1.51 i=1 i=1 E c i

34 18 Chapter 1. Existing methods for discrete scan statistics where T 1 = Lm 1 + l and E i are given in Eq [Hunter, 1976] see also [Worsley, 1982], using graph theory arguments spanning tree structure, derived the following upper bound for the union of events: P L i=1 E c i L i=1 L 1 P Ei c P Ei c Ei+1 c i=1 = L 1 [1 PE 1 ] L 2 [1 2PE 1 + PE 1 E 2 ] + PE c L P E c L 1 E c L, 1.52 which substituted into Eq.1.51 gives the following lower bound for the distribution of S m1 T 1 : P S m1 T 1 n 1 L 2 [Q m1 2m 1 Q m1 3m 1 ] + Q m1 2m 1 + l In Eq.1.52 and Eq.1.53, based on the stationarity of the events E i, we used the relations PE i = PE 1 = Q m1 2m 1, PE i E i+1 = PE 1 E 2 = Q m1 3m 1 for 1 i L 1 and PE L 1 E L = Q m1 2m 1 + l. For the upper bound, we employ the inequality in [Dawson and Sanko, 1967] to get P S m1 T 1 n 1 2S 1 u + 2S 2 uu 1, 1.54 where S 1 = S 2 = = L PEi c = 1 + L 2 [1 Q m1 2m 1 ] Q m1 m 1 + l, 1.55 i=1 1 i<j L j=2 i=1 PE c i E c j = L j 1 [1 PE i PE j + PE i E j ] j=2 i=1 L j 1 L 1 [1 2PE 1 + PE i E j ] + [1 PE 1 PE L + PE i E L ] = 1 2 L 1L 2 [1 2Q m 1 2m 1 ] + L 1 [1 Q m1 2m 1 Q m1 m 1 + l] j 1 L 1 L 1 + PE i E j + PE i E L j=2 i=1 i=1 = 1 2 L 1L 2 [1 2Q m 1 2m 1 ] + L 1 [1 Q m1 2m 1 Q m1 m 1 + l] i= L 2L 3 [Q m 1 2m 1 ] 2 + L 2 [Q m1 2m 1 Q m1 m 1 + l + Q m1 3m 1 ] + Q m1 2m 1 + l 1.56 and where u is the integer part of S 2 S 1. In the derivation of S 1 and S 2 we used the one dependence and stationarity property of the events E i. In [Kwerel, 1975] see also [Galambos and Simonelli, 1996, Inequality I7] it is shown that the bound in Eq.1.54 is the best possible in the class of linear inequalities of

35 1.2. Two dimensional scan statistics 19 the form a 1 S 1 + a 2 S 2. A slightly better result can be attained if one employs the recent inequality of [Kuai et al., 2000] L P S m1 where T 1 n = 1 P i=1 L θ i PEi c 1 2 L i=1 P + Ei c Ej c + 1 θi PEi c j=1 θ i = j i E c i P Ei c Ej c PE c i 1 θ i PEi c 2 L P, Ei c Ej c θi PEi c j=1 P Ei c E c j j i PEi c and x is the integer part of x. [Kuai et al., 2000] showed that the bound in Eq.1.57 is tighter than the one in Eq.1.54 of [Dawson and Sanko, 1967] and involves basically the same computational complexity. As before, we have { PEi c 1 Qm1 2m = 1, if i L Q m1 m 1 + l, if i = L and L P Ei c Ej c = j=1 L 3 [Q m1 2m 1 ] 2 + Q m1 2m 1 [1 + Q m1 m 1 + l] +Q m1 3m 1, for i = 1 L 4 [Q m1 2m 1 ] 2 + Q m1 2m 1 [1 + Q m1 m 1 + l] +2Q m1 3m 1, for 2 i L 1 L 2Q m1 2m 1 Q m1 m 1 + l + Q m1 m 1 + l +Q m1 2m 1 + l, for i = L All the unknown quantities that appear in Eqs.1.55, 1.56, 1.59 and 1.60 can be evaluated via [Karwe and Naus, 1997] algorithm, for discrete variables, or by simulation. 1.2 Two dimensional scan statistics The two dimensional discrete scan statistics was introduced in [Chen and Glaz, 1996] and extends, in a natural way, the one dimensional case.

36 20 Chapter 1. Existing methods for discrete scan statistics Let T 1, T 2 be positive integers, R = [0, T 1 ] [0, T 2 ] be a rectangular region and {X i,j 1 i T 1, 1 j T 2 } be a family of independent and identically distributed random variables. In many applications, the practitioners have at their disposal only the counts of the observed events of interest in smaller subregions within the studied region. In such situations, the random variables X ij take nonnegative integer values and one can interpret them as representing the number of events observed in the elementary square sub-region [i 1, i] [j 1, j]. Let m 1, m 2 be positive integers such that 2 m 1 T 1, 2 m 2 T 2. For 1 i 1 T 1 m 1 + 1, 1 i 2 T 2 m dene Y i1,i 2 = Y i1,i 2 m 1, m 2 = i 1 +m 1 1 i=i 1 i 2 +m 2 1 j=i 2 X i,j 1.61 to be the random variables which counts the number of the observed events in the rectangular region Ri 1, i 2 = [i 1 1, i 1 + m 1 1] [i 2 1, i 2 + m 2 1], comprised of m 1 m 2 adjacent elementary sub-regions. The two dimensional discrete scan statistic is dened as the largest number of events in any rectangular scanning window Ri 1, i 2, within the rectangular region R, i.e. S m1,m 2 T 1, T 2 = max 1 i 1 T 1 m i 2 T 2 m 2 +1 Y i1,i We denote the distribution of two dimensional scan statistic over the region [0, T 1 ] [0, T 2 ] by Q m1,m 2 T 1, T 2 = P S m1,m 2 T 1, T 2 n. When the parameters m 1, m 2 and n are clearly understood, we abbreviate the notation to QT 1, T 2. The aim of this section is to review some of the existing formulas used in the estimation of Q m1,m 2 T 1, T 2. We will consider the particular cases of i.i.d. Bernoulli, binomial and Poisson observations. For an overview of the methods and the potential application of the two dimensional scan statistics one can refer to the monographs of [Glaz et al., 2001, Chapter 16] and more recently the one of [Glaz et al., 2009, Chapter 6] Approximations In this section, we present a series of product-type approximations for the distribution of two dimensional scan statistics. These approximations are accurate and can be employed for all parameters T 1, T 2, m 1, m 2 and n. Although many of the formulas that appear bellow were given in the particular situation when T 1 = T 2 and m 1 = m 2, we include them here in their general form. Consider the case when X i,j are i.i.d. 0 1 Bernoulli random variables of parameter p. In [Boutsikas and Koutras, 2000], the authors derived, using the Markov

37 1.2. Two dimensional scan statistics 21 Chain Imbedding Technique, the following approximation for the distribution of two dimensional scan statistics: P S m1,m 2 T 1, T 2 n Qm 1, m 2 T 1 m 1 1T 2 m 2 1 Qm 1, m T 1 m 1 1T 2 m 2 Qm 1 + 1, m T 1 m 1 T 2 m 2 Qm 1 + 1, m 2 T 1 m 1 T 2 m We employ the same notation as in Section for the pmf and cdf of the binomial with parameters k, m, p, namely n bs; n, p = p s 1 p n s s s F s; n, p = bi; n, p. i=0 The quantities that appear in the approximation given by Eq.1.63 can be computed via Qm 1, m 2 = F n; m 1 m 2, p, 1.64 n Qm 1 + 1, m 2 = F 2 n s; m 2, p b s; m 1 1m 2, p, 1.65 Qm 1 + 1, m = s=0 n n 1 s 1,s 2 =0 t 1,t 2 =0 i 1,i 2,i 3,i 4 =0 where in the last relation u is given by bs 1 ; m 1 1, pbs 2 ; m 1 1, p bt 1 ; m 2 1, pbt 2 ; m 2 1, pp i j 1 p 4 i j F u; m 1 1m 2 1, p, 1.66 u = min {n s 1 t 1 i 1, n s 2 t 1 i 2, n s 1 t 2 i 3, n s 2 t 2 i 4 }. In the case of independent and identically distributed nonnegative integer valued observations, based on [Boutsikas and Koutras, 2000] approach and a Markov type approximation see [Glaz et al., 2001, Section ] for the Bernoulli case, [Chen and Glaz, 2009] proposed the following product-type approximation: P S m1,m 2 T 1, T 2 n Qm 1 + 1, m T 1 m 1 T 2 m 2 Qm 1 + 1, m 2 T 1 m 1 T 2 m 2 1 Qm 1, 2m 2 1 T 1 m 1 1T 2 2m 2 Qm 1, 2m 2 T 1 m 1 1T 2 2m In the framework of binomial and Poisson model for the underlying random eld, the probabilities Qm 1, 2m 2 1 and Qm 1, 2m 2 can be evaluated by adapting the algorithm developed by [Karwe and Naus, 1997] see Section A.1. The other two

38 22 Chapter 1. Existing methods for discrete scan statistics unknown quantities, Qm 1 + 1, m 2 and Qm 1 + 1, m 2 + 1, can be computed via a conditioning argument as the one presented in [Guerriero et al., 2009] for the Poisson distribution. For completeness we have included in Section A.2.1 of Appendix A these formulas. Remark In [Chen and Glaz, 1996] and [Glaz et al., 2001, Section ], the authors included two Poisson type approximations and a compound Poisson approximation. Their simulation study showed that the most accurate estimate, between the four compared, was the product type approximation. We should mention that despite the fact that these approximations are accurate, none of them give an order of this accuracy, that is there are no error bounds corresponding to these formulas. Expressing the scan statistics S m1,m 2 T 1, T 2 as the maximum of a 1-dependent sequence of random variables, [Haiman and Preda, 2006] derived an accurate approximation formula as well as associated error bounds. We present these results in a larger context in Chapter Bounds Bounds for the distribution of the two dimensional discrete scan statistics can be found only for particular situations. In [Boutsikas and Koutras, 2003], based on specic techniques used in reliability theory, the authors developed a series of bounds for the special case of Bernoulli observations. Their results are summarized in the following. If X i,j are i.i.d. Bernoulli random variables of parameter p, then a Lower bound QT 1, T 2 1 Q 1 T 1 m 1 T 2 m2 1 Q 2 T 1 m 1 1 Q 3 T 2 m 2 1 Q b Upper bound ] QT 1, T 2 1 A 1 [1 q m 1 13m 2 2+2m 1 1m 2 1 T1 m 1 1T 2 m 2 1 A 1 ] [1 q m 1m 2 1 T2 m 2 1 ] A 1 [1 q m 1 12m 2 1+m 1 1m 2 1 T1 m 1 1 A 1 ] [1 q m 1 12m 2 1+m 1 m 2 1 T1 m 1 ] A 2 [1 q m 1 12m 2 1+m 1 m 2 1 A 4 [ ] 1 q m 1 13m 2 2+m 1 m 2 1+m 1 1m 2 1 T2 m 2 A 3, 1.69 where q = 1 p and A 1 = F c k+1,m 1 m 2 q m 2 F c k+1,m 1 1m 2 q m 1 F c k+1,m 1 m q m 1+m 2 1 F c k+1,m 1 1m 2 1, 1.71 A 2 = F c k+1,m 1 m 2 q m 2 F c k+1,m 1 1m 2, 1.72 A 3 = F c k+1,m 1 m 2 q m 1 F c k+1,m 1 m 2 1, 1.73 A 4 = F c k+1,m 1 m 2, 1.74 F c i,m = 1 F i 1; m, p.

39 1.2. Two dimensional scan statistics 23 We should note that similar bounds were obtained by [Akiba and Yamamoto, 2005]. In the general case, [Chen and Glaz, 1996] proposed a Bonferroni type inequality for the lower bound of the distribution of the scan statistics. As their simulations showed, these bounds are not as sharp as one expect. Using [Hoover, 1990] Bonferroni type inequality of order r 3 P T1 m 1 +1 i 1 =1 T 2 m 2 +1 i 2 =1 T 1 m 1 +1 i 1 =1 T 1 m 1 +1 i 1 =1 A i1,i 2 T 2 m 2 i 2 =1 r 1 l=2 T 1 m 1 +1 i 1 =1 T 2 m 2 +1 i 2 =1 P A i1,i 2 T 1 m 1 P A i1,i 2 A i1,i 2 +1 T 2 m 2 +1 l i 2 =1 with A i1,i 2 = {Y i1,i 2 > n} and r = 4, we have i 1 =1 P A i1,1 A i1 +1,1 P A i1,i 2 A c i 1,i A c i 1,i 2 +l 1 A i 1,i 2 +l P S m1,m 2 T 1, T 2 n T 1 m 1 [Qm 1 + 1, m 2 2Qm 1, m 2 ] T 1 m 1 + 1T 2 m 2 3Qm 1, m T 1 m 1 + 1T 2 m 2 2Qm 1, m The unknown probabilities in Eq.1.76: Qm 1, m 2, Qm 1 + 1, m 2, Qm 1, m 2 + 2, Qm 1, m are evaluated via a conditional argument see Section A.2.1 or an adaptation of the algorithm of [Karwe and Naus, 1997]. For the upper bound we propose to adapt the inequality of [Kuai et al., 2000] to the two dimensional framework. Using the events A i1,i 2, dened above, we can write P S m1,m 2 T 1, T 2 n = 1 P 1 T1 m 1 +1 i 1 =1 T 1 m 1 +1 i 1 =1 T 2 m 2 +1 i 2 =1 T 2 m 2 +1 i 2 =1 A i1,i θ i 1,i 2 PA i1,i 2 2 Σi 1, i 2 θ i1,i 2 PA i1,i 2 [ θ i1,i 2 PA i1,i 2 2 Σi 1, i θ i1,i 2 PA i1,i 2 ], 1.77 where and θ i1,i 2 Σi 1, i 2 = can be computed from T 1 m 1 +1 j 1 =1 T 2 m 2 +1 j 2 =1 θ i1,i 2 = Σi 1, i 2 PA i1,i 2 P A i1,i 2 A j1,j Σi1, i PA i1,i 2

40 24 Chapter 1. Existing methods for discrete scan statistics We observe that if i 1 j 1 m 1 or i 2 j 2 m 2, then the events A i1,i 2 and A j1,j 2 are independent, thus P A i1,i 2 A j1,j 2 = [1 Qm 1, m 2 ] On the other hand, if i 1 j 1 < m 1 and i 2 j 2 < m 2, then P A i1,i 2 A j1,j 2 = 1 2Qm 1, m 2 + P Y i1,i 2 n, Y j1,j 2 n The last term in Eq.1.81 can be evaluated via a conditioning argument. For example, in the case of a binomial model with parameters r and p, denoting with Z the random variable Z = i 1 +m 1 1 j 1 +m 1 1 s=i 1 j 1 i 2 +m 2 1 j 2 +m 2 1 t=i 2 j 2 X s,t, 1.82 we have, due to the independence of Y i1,i 2 Z and Y j1,j 2 Z, that P Y i1,i 2 n, Y j1,j 2 n = n PZ = kpy i1,i 2 Z n k k=0 In Eq.1.83, the random variables Z, Y i1,i 2 Z are binomially distributed with parameters rm 1 i 1 j 1 m 2 i 2 j 2, p and r i 1 j 1 i 2 j 2 and p, respectively. Hence, to calculate the upper bound in Eq.1.77, it is enough to pre compute the m 1 m 2 matrix with the entries given by the probabilities P A i1,i 2 A j1,j 2, found above, for i 1 j 1 < m 1 and i 2 j 2 < m 2. Remark A dierent upper bound can be obtained from Eq.1.77 if one consider the events E i1,i 2 given by E i1,i 2 = max Y s1,s 2 n 1.84 i 1 1m 1 +1 s 1 i 1 m 1 +1 i 2 1m 2 +1 s 2 i 2 m 2 +1 instead of the events A i1,i 2. This approach generalizes the upper bound in the one dimensional case presented in Section Details about this alternative margin are given in Section A.2.2 of Appendix A. 1.3 Three dimensional scan statistics In this section, we extend the notion of the two dimensional discrete scan statistics dened in Section 1.2, to the three dimensional case. The three dimensional discrete scan statistics was introduced in [Guerriero et al., 2010a], where the authors derived a product-type approximation and three Poisson approximations for the 0 1 Bernoulli model. As the authors mention in the cited article, the most accurate estimate within the four is the product-type approximation. Based on

41 1.3. Three dimensional scan statistics 25 this observation, we include bellow the formula for the product type estimate in a somewhat general framework. Detailed expressions of the unknown quantities that appear in this formula are presented in Section A.3 of Appendix A. Let T 1, T 2, T 3 be positive integers, R = [0, T 1 ] [0, T 2 ] [0, T 3 ] be a rectangular region and {X i,j,k 1 i T 1, 1 j T 2, 1 k T 3 } be a family of independent and identically distributed integer valued random variables from a specied distribution. For each l {1, 2, 3}, consider the positive integers m l such that 1 m l T l, and dene the random variables Y i1,i 2,i 3 = i 1 +m 1 1 i=i 1 i 2 +m 2 1 j=i 2 i 3 +m 3 1 k=i 3 X i,j,k, 1 i l T l m l If the random variables X i,j,k are interpreted as the number of events observed in the elementary square subregion [i 1, i] [j 1, j] [k 1, k], then Y i1,i 2,i 3 counts the events observed in the rectangular region Ri 1, i 2, i 3 = [i 1 1, i 1 + m 1 1] [i 2 1, i 2 + m 2 1] [i 3 1, i 3 + m 3 1], comprised of m 1 m 2 m 3 adjacent elementary square subregions and with the southwest corner at the point i 1 1, i 2 1, i 3 1. The three dimensional discrete scan statistic is dened as the maximum number of events in any rectangle Ri 1, i 2, i 3 within the region R, S m1,m 2,m 3 T 1, T 2, T 3 = The distribution of the scan statistic, max 1 i j T j m j +1 j {1,2,3} Q m1,m 2,m 3 T 1, T 2, T 3 = P S m1,m 2,m 3 T 1, T 2, T 3 n, Y i1,i 2,i was successfully used in astronomy [Darling and Waterman, 1986], image analysis [Naiman and Priebe, 2001], reliability theory [Boutsikas and Koutras, 2000] and many other domains. An overview of the potential application of space-time scan statistics can be found in the monograph [Glaz et al., 2009, Chapter 6]. From a statistical point of view, the scan statistic S m1,m 2,m 3 T 1, T 2, T 3 is used for testing the null hypothesis of randomness that X ijk 's are independent and identically distributed according to some specied distribution. Under the alternative hypothesis there exists one cluster location where the X ijk 's have a larger mean than outside the cluster. As an example, in the Poisson model, the null hypothesis, H 0, assumes that X ijk 's are i.i.d. with X ijk P oisλ whereas the alternative hypothesis of clustering, H 1, assumes the existence of a rectangular subregion Ri 0, j 0, k 0 such that for any i 0 i i 0 +m 1 1, j 0 j j 0 +m 2 1 and k 0 k k 0 +m 3 1, X ijk are i.i.d. Poisson random variables with parameter λ > λ. Outside the region Ri 0, j 0, k 0, X ijk are i.i.d. distributed according to the distribution specied by the null hypothesis. The generalized likelihood ratio test rejects H 0 in favor of the local change alternative H 1, whenever S m1,m 2,m 3 T 1, T 2, T 3 exceeds the threshold τ determined from P S m1,m 2,m 3 T 1, T 2, T 3 τ H 0 = α and where α represents the signicance level of the testing procedure [Glaz et al., 2001, Chapter 13].

42 26 Chapter 1. Existing methods for discrete scan statistics Product-type approximation Consider that X i,j,k are independent and identically distributed nonnegative integer valued random variables. When the parameters involved in the distribution of the three dimensional scan statistics are clearly understood we abbreviate the notation to QT 1, T 2, T 3. Following the approach proposed in [Guerriero et al., 2010a] for the i.i.d. 0 1 Bernoulli model, we obtain the following product-type estimate: P S m1,m 2,m 3 T 1, T 2, T 3 n Qm 1 + 1, m 2 + 1, m T 1 m 1 T 2 m 2 T 3 m 3 Qm 1, m 2, m 3 T 1 m 1 1T 2 m 2 1T 3 m 3 1 Qm 1 + 1, m 2, m 3 T 1 m 1 1T 2 m 2 T 3 m 3 Qm 1, m 2 + 1, m T 1 m 1 T 2 m 2 1T 3 m 3 1 Qm 1, m 2 + 1, m 3 T 1 m 1 T 2 m 2 1T 3 m 3 Qm 1 + 1, m 2, m T 1 m 1 1T 2 m 2 T 3 m 3 1 Qm 1, m 2, m T 1 m 1 T 2 m 2 T 3 m 3 1 Qm 1 + 1, m 2 + 1, m 3 T 1 m 1 1T 2 m 2 1T 3 m Particularizing, for T 1 = T 2 = T 3 = N and m 1 = m 2 = m 3 = m we get the same approximation formula as in [Guerriero et al., 2010a, Eq. 6]. Explicit expressions for the unknown quantities in Eq.1.87 are given in Section A.3 of Appendix A for the binomial and Poisson model. Remark As far as we know, the only result on bounds for the distribution of the three dimensional scan statistics is the one of [Akiba and Yamamoto, 2004]. The authors, using techniques from reliability theory, obtained closed form lower and upper bounds for the Bernoulli model. Their formulas are complex and, even for moderate scanning window sizes, requires excessive computational time. We should mention that alternative bounds can be obtained extending the approach presented in Section to the three dimensional setting.

43 Chapter 2 Extremes of 1-dependent stationary sequences of random variables In this chapter, we present some results concerning the approximation of the distribution of the maximum and minimum of the rst n terms of a 1-dependent stationary sequence of random variables. These approximations extend the original ones developed in [Haiman, 1999], both in terms of range of applicability and in sharpness of the error bounds. We begin in Section 2.1 with some denitions and remarks concerning the m-dependent sequences of random variables. Next, we give the formulation of the problem and the intuition behind the proposed method. The description of the main results and some numerical aspects are presented in Section 2.2. We conclude the chapter with the proofs of the results in Section 2.3. Parts of the work considered in this chapter appeared in [Am rioarei, 2012] and make the object of an article submitted for publication. Contents 2.1 Introduction Denitions and notations Remarks about m-dependent sequences and block-factors Formulation of the problem and discussion Main results Haiman results New results Proofs Technical lemmas Proof of Theorem Proof of Corollary Proof of Theorem Proof of Corollary Proof of Theorem Proof of Theorem Proof of Proposition

44 28 Chapter 2. Extremes of 1-dependent stationary sequences 2.1 Introduction We consider the problem of approximating the distribution of the extremes maxima and minima of 1-dependent stationary sequences of random variables Denitions and notations We say that a sequence of random variables is m-dependent if observations separated by m units are stochastically independent and is stationary in the strong sense whenever its nite dimensional distributions are invariant under time shifts. To be more precise we have the following denitions: Denition The sequence W k k 1 of random variables is m-dependent, m 1, if for any h 1 the σ-elds generated by {W 1,..., W h } and {W h+m+1,... } are independent. Denition The sequence W k k 1 of random variables is stationary in the strong sense if for all n 1, for all h 0 and for all t 1,..., t n the families {W t1, W t2,..., W tn } and {W t1 +h, W t2 +h,..., W tn+h} have the same joint distribution. A large class of m-dependent sequences is constructed based on the block-factor terminology. The following denition describes the notion of l block-factor see also [Burton et al., 1993]: Denition The sequence W k k 1 of random variables with state space S W is said to be l block-factor of the sequence Y k k 1 with state space S Y if there is a measurable function f : SY l S W such that for all k. W k = f Y k, Y k+1,..., Y k+l Remarks about m-dependent sequences and block-factors The study of m-dependent sequences can be regarded as a dierent model for dependence besides the well known Markov model. Even if the later has been investigated thoroughly for a long time, not so much can be said about the former model. Maybe the most common examples of m-dependent sequences are obtained from m + 1 block-factors of independent and identically distributed sequences of random variables. In [Ibragimov and Linnik, 1971], the authors conjectured that the converse is not true; they armed without giving an example that there are m- dependent sequences which are not m + 1 block-factor of any i.i.d. sequence. Progress in this direction was made by [Aaronson et al., 1989], who showed that there exist two valued 1-dependent processes which cannot be expressed as a 2 block-factor of an i.i.d. sequence. In [Aaronson and Gilat D., 1992] it is shown that

45 2.1. Introduction 29 a stationary 1-dependent Markov chain with no more than four states can be represented by a 2 block-factor. Their result is sharp in the sense that there is an example of such a sequence with ve states which is not a 2 block-factor. In particular, [de Valk, 1988] proved that a two valued Markov chain becomes an i.i.d. sequence if in addition it is stationary and 1-dependent. This result was extended by [Matus, 1998, Lemma2] who gave a characterization of binary sequences which are Markov of order n and m-dependent. The author also presented a necessary and sucient condition for the existence of a m-dependent Markov chain with a nite state space see [Matus, 1998, Proposition 1]. In [Burton et al., 1993], the authors gave an example of a four state 1-dependent process which is not a l block-factor for any l 2, thus conrming the conjecture. More details about 1-dependent processes can be found in [de Valk, 1988] and [Goulet, 1992]. To ask whether a sequence of m-dependent random variables is a m + 1 blockfactor or not, seems to be a natural question. From the best of our knowledge there is no general result concerning this problem. Nevertheless, some partial answers can be found in the literature. In [de Valk and Ruschendorf, 1993] was obtained a regression representation for a particular class of m-dependent sequences. In the same paper, the authors presented a constructive method to check if such a sequence can be viewed as a monotone m+1 block-factor. In [Broman, 2005] it is shown that 1-dependent trigonometric determinantal processes are 2 block-factors. Another class of processes that can be expressed as a m + 1 block-factor is represented by stationary m-dependent Gaussian sequences. The following proposition justies this armation: Proposition Let W k k Z be a stationary m-dependent Gaussian sequence of random variables such that EW 0 = µ. Then there exists a sequence a k m k=0 such that m d W k k Z = µ + a i η k i, 2.1 i=0 k Z where η j j Z are i.i.d. standard normal random variables. Generalizations of the foregoing result, concerning m-dependent stationary innite divisible sequences, can be found in [Harrelson and Houdre, 2003] Formulation of the problem and discussion The problem that we treat in this chapter can be formulated as follows. Let X n n 1 be a strictly stationary sequence of 1-dependent random variables with marginal distribution function F x = PX 1 x and take x such that For each n 1, we dene the sequences inf{u F u > 0} < x < sup{u F u < 1}. p n = p n x = Pmin{X 1, X 2,..., X n } > x, 2.2 q n = q n x = Pmax{X 1, X 2,..., X n } x. 2.3

46 30 Chapter 2. Extremes of 1-dependent stationary sequences Our aim in this work is to nd good estimates for p n and q n together with the corresponding error margins. Asymptotic results for extremes of m-dependent sequences of random variables were obtained by many authors among which we can mention [Watson, 1954], [Newell, 1964], [Galambos, 1972] and [Flak and Schmid, 1995]. The approach we are using to attain these estimates diers from the ones presented in the cited papers and it has its origin in the paper of [Haiman, 1999]. To get the intuition behind the method presented in this work, consider q 1 = q 0 = 1 and take Dz = D x z = q k 1 z k 2.4 the generating function of the sequence q k k 0. We observe that this generating function depends on x, since the sequence q k k 0 does. Suppose that, for example, the function Dz has the following expression: Then it is easy to see that k=0 Dz = 1 + Az 1 z, A λ q k z k+1 = k=0 k=0 A λ k zk and therefore q n = A λ n. The following example shows that there are situations where the assumption made in Eq.2.5 is valid. Example Let X n n 1 be a sequence of i.i.d. random variables. In particular, the sequence is 1-dependent. Then one can easily show that z Dz = p 1 z which gives λ = 1 1 p 1 and q n = 1 p 1 n = 1 λ n. Of course that the situation presented in Example is an extreme one and such an assumption is not feasible in the general context. Perhaps, the next best thing that we can hope for is to have an asymptotic relation in the neighborhood of the singularity λ, namely Dz 1 + Az 1 z. 2.7 λ In which case, one can wish to obtain a relation of the form q n A λ n. 2.8 The above observations lead to the problem of showing that the generating function Dz has a nice singularity. It is obvious that the problem of studying the singularities of D is related to the analysis of the zeros of the function 1 D. Now, if we consider the generating function associated to the sequence { 1 k p k 1 } k 0, Cz = C x z = k p k 1 z k 2.9 k=1

47 2.1. Introduction 31 where p 1 = p 0 = 1, one can prove that see Lemma Dz = 1 Cz This last relation gives us the means to analyze the singularity of D by focusing our attention on the zeros of the generating function Cz. As it turns out, to analyze the zeros of the function C is much easier because of two aspects: an elementary upper bound for the probabilities p n given in Eq.2.25 and the fact that C is an alternating series. We will nish this section with an example that illustrates the above ideas: Example Let U n n 1 be a sequence of i.i.d. 0 1 Bernoulli random variables with PU n = 1 = p. We dene for each n 1 the random variables X n = U n U n+1. It is clear that the sequence X n n 1 is stationary and 1-dependent, since is dened as a 2 block-factor of an i.i.d. sequence. Clearly p n = PX 1 = 1, X 2 = 1,..., X n = 1 = PU 1 = 1, U 2 = 1,..., U n+1 = 1 = p n+1, so the generating function in Eq.2.9 becomes Cz = z + pz pz. Solving the equation Cz = 0 we obtain the value of the zero λ = p. 2p 1 p Notice that λ belongs to an interval of the form 1, 1+u, which will be in accordance with the result presented in Theorem To get the value of q n, we rst notice that according to Lemma we have n q n = 1 n k p n k q k 1, n 0, p 0 = q 1 = q 0 = 1. k=0 By writing b n = q n p n and using the fact that p n = p n+1, we obtain the following second order recurrence: b n+1 = 1 p p which, after solving for q n, leads to the solution where b n + b n 1, b 0 = 1, b 1 = 1 p2 p q n = c c λn η n, η = 1 2p p 1 p and c = η+1 λ+η. For p small enough, it can be seen that the value of η 1 decrease much faster than of λ 1, so the main contribution to the value of q n is made by the rst term.

48 32 Chapter 2. Extremes of 1-dependent stationary sequences In the next section we will see that there exists a zero λ x of C x in an interval of the form 1, 1+u and the result in Theorem will give an estimate of this zero. We will conclude the section with an approximation and the corresponding error bound for the value of q n. 2.2 Main results Starting from the results obtained by [Haiman, 1999] and presented in Section bellow, we extend these results by enlarging their range of applicability and providing sharper error bounds. These new expressions will constitute the main tools in nding the approximation of the distribution of the scan statistics as it will be seen in the next chapters Haiman results In [Haiman, 1999], the author presented a series of results concerning the approximation of the distribution of the maxima of a stationary 1-dependent sequence of random variables. He started by showing that if the tail probability p 1 is small enough, then the series given in Eq.2.9 has an unique zero in the interval 1, 1+2p 1 and gave explicit an estimate of its value. The next theorem illustrates this result: Theorem For x such that 0 < p 1 x 0.025, C x z has a unique zero λx, of order of multiplicity 1, inside the interval 1, 1 + 2p 1, such that λ 1 + p 1 p 2 + p 3 p 4 + 2p p 2 2 5p 1 p 2 87p The relation between the zero λ found in the foregoing result and the probability of the maximum of the rst n elements of the sequence X n n 1 is given by the following theorem: Theorem We have q 1 = 1 p 1, q 2 = 1 2p 1 + p 2, q 3 = 1 3p 1 + 2p 2 + p 2 1 p 3 and for n > 3 if p , q n λ n 1 p 2 + 2p 3 3p 4 + p p 2 2 6p 1 p 2 561p These theorems were successfully employed in a series of applications: the distribution of the maximum of the increments of the Wiener process [Haiman, 1999], extremes of Markov sequences [Haiman et al., 1995], the distribution of scan statistics, both in one dimensional see [Haiman, 2000, Haiman, 2007, Haiman and Preda, 2013] and two dimensional case see [Haiman and Preda, 2002, Haiman and Preda, 2006].

49 2.2. Main results New results In this section, we extend the theorems presented in Section but we keep their initial form. The reason behind this approach is that we want to have a reference for comparison. The results presented in subsequent will have parameterized error coecients depending on the tail probability p 1, which will prove to be much smaller than the initial ones. The advantage of the parameterized error coecients over the xed ones presented above become clear when the value of p 1 decreases toward zero and lead to sharper bounds. The following statement gives a parametric form of the Theorem 2.2.1, improving both the range of p 1 x, from to 0.1, and the error coecient: Theorem For x such that 0 < p 1 x 0.1, C x z has an unique zero λ = λx, of order of multiplicity 1, inside an interval of the form 1, 1 + lp 1, such that λ 1 + p 1 p 2 + p 3 p 4 + 2p p 2 2 5p 1 p 2 Kp 1 p 3 1, 2.13 where l = lp 1 > t 3 2 p 1, t 2 p 1 is the second root in magnitude of the equation p 1 t 3 t + 1 = 0 and Kp 1 is given by Kp 1 = 11 3p 1 1 p 1 + 2l1 + 3p lp 1 p 1 2 lp 1 1+lp 1 2 [1 p 1 1+lp 1 2 ] 3 1 2p lp 1 [1 p 1 1+lp 1 2 ] 2 Using the stationarity and the one dependence of the sequence X n n 1, we have the following: Corollary Let λ be dened as in Theorem 2.2.3, then λ 1 + p 1 p 2 + 2p 1 p p 1 Kp 1 p The choice between the estimate of λ given in Theorem and in Corollary is made according to the information available about the random sequence. If one does not have enough information to compute or simulate the values of p 3 and p 4, then the obvious estimate is the one given in Corollary To get a better grasp of the bounds in Theorem and Corollary 2.2.4, we present in Table 2.1, for selected values of p 1, the values taken by the coecients in Eq.2.13 and Eq.2.15: Notice that for p 1 = we are in the hypothesis of Theorem and Theorem The corresponding value for K in Table 2.1 shows that the error bound in Eq.2.13 is almost ve times smaller then the one in Eq Figure 2.1 presents the behaviour of the error coecient function Kp 1 as p 1 varies between 0 and 0.1. The xed value of the coecient in Theorem = 87 is also plotted, but only between 0 and Before presenting the parametric extension of Theorem 2.2.2, we should stop for a little to illustrate the estimate of the zero λ in the context of Example

50 34 Chapter 2. Extremes of 1-dependent stationary sequences p 1 l Kp p 1 Kp Table 2.1: Selected values for the error coecients in Theorem and Corollary Haiman error coefficient = 87 Our error coefficient Value of the error coefficient Tail probability p 1 Figure 2.1: The behaviour of the function Kp 1 Example Continuation of Example As we have shown previously, the value of the zero λ is equal with λ = p 1 + 4p 1 p and it belongs to an interval of the form 1, 1 + u. It is not hard to see that if one takes u = tp with t > p, then λ 1, 1 + tp. 1 p 2 We will verify only the estimate in Corollary 2.2.4, as it involves less computations. First, observe that the approximation of λ in Eq.2.15 can be rewritten as since p n = p n+1 for n 1. ν = 1 + p 1 p 2 + 2p 1 p 2 2 = 1 + p 2 1 p + 2p 4 1 p 2,

51 2.2. Main results 35 Using Taylor's formula with Lagrange form of the remainder for the function 1 + x, we have 1 + x = x 1 8 x x x ξ 9/2 x 5. Substituting the above expansion in the expression of λ for x = 4p 1 p, we obtain λ = 1 1 p p 1 p 2 + 2p2 1 p 3 5p3 1 p p4 1 p ξ 9/2, 4p where ξ 0, 1 p. Elementary calculations give us that [ λ ν = p p ξ 9/2 21 p p 5p2 + p 3 ] 1 p 4. We can see that λ ν is a multiple of p 2 1 = p4, which is in accordance with our result. Notice that the dierence between the estimate in Theorem and the one in Corollary is of the order p 4 1 p 2, but the rst result gives an accuracy of order p 6, while the second only of p 4. An analogue result to the one presented in Theorem is given bellow: Theorem Assume that x is such that 0 < p 1 x 0.1 and dene η = 1 + lp 1 with l = lp 1 > t 3 2 p 1 and t 2 p 1 the second root in magnitude of the equation p 1 t 3 t + 1 = 0. If λ = λx is the zero obtained in Theorem 2.2.3, then the following relation holds: q n λ n 1 p 2 + 2p 3 3p 4 + p p 2 2 6p 1 p 2 Γp 1 p 3 1, 2.16 where Γp 1 = p 1 2 P p 1 + Ep 1, Kp 1 is given by Eq.2.14 and P p 1 = 3Kp p 1 + 3p 2 1[1 + p 1 + 3p Kp 1 p 3 1] + p 6 1K 3 p 1 + 9p p 1 + 3p , 2.17 Ep 1 = η5 [ p 1 η] 4 [1 + p 1 η 2] [ 1 + η + 1 3p 1 η 2] 21 p 1 η 2 4 [1 p 1 η 2 2 p 1 η η 2p 1 η ] As an immediate consequence of Theorem we have the following corollary which involves only the values of p 1 and p 2 : Corollary In the conditions of Theorem we have q n λ n 1 p p 1 Γp 1 p From Table 2.2 we can see that for p 1 = , the coecient in the error bound given by Eq.2.16 is about three times smaller then the one from Eq We continue the exposition with two theorems that answer the problem proposed in Section 2.1.3, namely nding an estimate for the maxima q n. The rst result

52 36 Chapter 2. Extremes of 1-dependent stationary sequences p 1 Γp p 1 Γp Table 2.2: Selected values for the error coecients in Theorem and Corollary presents an estimate for q n in terms of q 1, q 2, q 3 and q 4, while for the second only the expressions of q 1 and q 2 are involved. These two statements, especially the second one, constitutes the cornerstone for the approximation developed in the following chapter. As we will see, the estimates are very sharp for those values of x for which the tail probability is small. Combining the results obtained in Theorem and Theorem 2.2.6, we get the following approximation: Theorem Let x such that q 1 x 1 p 1 x 0.9. If Γ and K are the same as in Theorem 2.2.6, then q 6q 1 q q 3 3q 4 n 1 + q 1 q 2 + q 3 q 4 + 2q q2 2 5q 1q 2 n n 11 q 1 3, 2.20 with 1 = 1 q 1, n = K1 q 1 + Γ1 q n In the same fashion, combining the results from Corollary and Corollary 2.2.7, we get Theorem If x is such that q 1 x 1 p 1 x 0.9, then q 2q 1 q 2 n [1 + q 1 q 2 + 2q 1 q 2 2 ] n n 21 q 1 2, 2.22 with 2 = 2 q 1, n = [ n + K1 q 1 + Γ1 q ] 1 1 q n and Γ and K are as in Theorem We should mention that in [Haiman, 1999, Theorem 4], the author obtained a similar formula for the error bound in Eq.2.22, the only dierence being that 2 is replaced by H 2, where H 2 = 9 n n 1 q [ n1 q 1 2]. 2.24

53 2.2. Main results H 2 Value of the coefficients Length of the sequence n Figure 2.2: The relation between H 2 and 2 when 1 q 1 = and n varies As can be seen from Figure 2.2, when 1 q 1 is xed and takes the value the upper limit in Haiman's results and the length of the sequence n increases, the value of H 2 increases with n, while the value of 2 decreases. Also, one can note that the new error coecient is much smaller than the old one. In Figure 2.3 we present the behaviour of the coecient function 2 ; in Figure 2.3a we illustrate the three dimensional plot and in Figure 2.3b we include the level curves for dierent values of 1 q q 1 = q 1 = q = q = q = Value of the coefficient Value of the coefficient Probability 1 q Length of the sequence n Length of the sequence n a b Figure 2.3: Illustration of the coecient function 2 for dierent values of p 1 and n

54 38 Chapter 2. Extremes of 1-dependent stationary sequences 2.3 Proofs In this section, we present the proofs for the results described in this chapter. The proofs for the main results described in Section rely on a series of technical lemmas which will be stated in Section We conclude the section by giving the proof of Proposition presented at the end of Section Technical lemmas We consider the framework described in Section We begin by giving the following key upper bound for the probabilities of the minimum of the rst n terms of the sequence X n n 1. Using the stationarity and 1-dependence properties of the sequence we have p n = PX 1 > x, X 2 > x,..., X n > x PX 1 > x, X 3 > x,..., X n > x which gives the basic inequality = PX 1 > xpx 3 > x,..., X n > x = p 1 p n 2, p n p [ n+1 2 ] This bound will be used over and over again in our results. Recall that in Eq.2.9 we have dened the generating function Cz = C x z = k p k 1 z k The following lemmas will provide various estimates on the function Cz. Lemma The function C x z has a zero λ = λx in the interval 1, 1 + lp 1, where l = lp 1 = t 3 2 p 1 + ε, ε > 0 arbitrarily small and t 2 p 1 is the second root in magnitude of the equation p 1 t 3 t + 1 = 0. Proof. To show that Cz has a zero in the interval 1, 1+lp 1 it is enough to verify that C1 > 0 and C1 + lp 1 < 0. It easy to see that C1 > 0, since C1 = k p k 1 = p 1 p 2 + p }{{} 3 p }{{} k=1 0 0 For C1 + lp 1 we have: C1 + lp 1 = 1 + k=1 1 k p k lp 1 k k=1 = lp 1 + lp lp 1 2k [p 2k 1 p 2k 1 + lp 1 ] }{{} p 2k 1 p k 1 [1 + lp 1 2 p 1 ] k k=1 k=1

55 2.3. Proofs 39 From the denition of l and the relation 1 < t 2 p 1 < t 2 p 1 + ε < 1 3p1, we obtain t 2 = t 2 p 1 t 2 3 t 2 + t 2 t ε ε 2 t t 2 t 2 + ε + t 2 + ε 2 < 1 3p p p 1 = 1 p 1, which implies t 2 3 t ε + p 1ε < 0. Combining this last relation with the fact that t 2 is a root of the equation p 1 t 3 t + 1 = 0, we conclude that p 1 t ε 3 t ε + 1 < 0. The foregoing relation can be rewritten as 1 + lp 1 3 < l and since p lp 1 2 < lp 1 1+lp 1 < 1, the series in Eq.2.28 is convergent and we have C1 + lp 1 < 0. Notice that the choice of l in the statement of the lemma was made such that the inequality 1 + lp 1 3 < l holds. Lemma Consider the series R = The following inequality holds: 2kp 2k 1 2k + 1p 2k k=2 R 2p 2 1 k=2 [ 1 1 p ] p 1 Proof. To approximate R, we observe that since p 2k 1 p 2k 0, the series can be bounded by p 2k R 2k p 2k 1 p 2k. k=2 With the help of Eq.2.25, we obtain p2 1 1 p 1 = p k 1 R 2p 1 k=2 k=2 k=2 which implies the desired estimate for R. [ kp k 1 1 = 2p p p 1 ], 2.31 Lemma Take C z, the second derivative of the function Cz. Then for all z 1, 1 + lp 1, we have 2lp p lp [1 p lp 1 2 ] 3 4p lp 1 [1 p lp 1 2 ] p lp 1 2 C z 2p 1 [1 p lp 1 2 ] 3, where l = lp 1 is dened in Lemma

56 40 Chapter 2. Extremes of 1-dependent stationary sequences Proof. After derivation we have C z = 2p 1 2 3p 2 z + 3 4p 3 z 2 4 5p 4 z p 5 z 4... = 2k + 12k + 2p 2k+1 z 2k 2k + 22k + 3p 2k+2 z 2k k=0 Based on the estimate in Eq.2.25 and using the fact that z 1, 1 + lp 1, we nd the upper bound k=0 C z p 1 2k + 12k + 2z p 1 2k k=0 k=0 2p p lp 1 2 [1 p lp 1 2 ] Similarly, we obtain the following lower bound: C z lp 1 2k + 12k + 2p 2k+2 z 2k 4 k + 1p 2k+2 z 2k+1 k=0 k=0 lp 2 1 2k + 12k + 2z p 1 2k 4p 1 z k + 1p 1 z 2 k 2lp p lp [1 p lp 1 2 ] 3 4p lp 1 [1 p lp 1 2 ] From the upper and lower bounds in Eq.2.34 and Eq.2.35, respectively, we have the estimate C z 2lp p lp [1 p lp 1 2 ] 3 + 4p lp 1 [1 p lp 1 2 ] Lemma Let C z be the rst derivative of the function Cz. For all z 1, 1 + lp 1, we have the bounds C z 1 + 2p 1 1 p p lp 1 2 2lp2 1 [1 p lp 1 2 ] 3, 2.37 [ ] C z 1 2p lp [1 p lp 1 2 ] 2, 2.38 where l = lp 1 is dened in Lemma Proof. We will derive rst the upper bound for C z given by Using Lagrange theorem on the interval [1, 1 + a] with a lp 1, we get for θ 0, 1: We observe that k=0 C 1 + a = C 1 + ac 1 + θa C 1 = 1 + 2p 1 3p 2 + 4p 3 5p 4 + 6p 5 7p = 1 + 2p 1 3p 2 + R, 2.40

57 2.3. Proofs 41 where R is dened in Lemma Applying the estimate in Lemma for R and the upper bound for C z, derived in Lemma 2.3.3, we deduce C z 1 + 2p 1 1 p p lp 1 2 2lp2 1 [1 p lp 1 2 ] Notice that for the derivation of the foregoing expression we have also used the fact that a lp 1. To derive the second estimate, observe that C z 1 = 1 2p 1 z 3p 2 z 2 + 4p 3 z 3 5p 4 z p 1 z 3p 2 z 2 + 4p 3 z 3 5p 4 z , 2.42 where we used the inequality 1 x 1 x. Taking into account that z 1, 1 + lp 1 and denoting the expression inside the second absolute value in the denominator of Eq.2.42 with T, we have the upper margin In the same way, T = 2kp 2k 1 p 2k zz 2k 1 p 2k z 2k k=1 2p 1 p 2 z T kp 1 z 2 k 1 k=1 k=1 p 2k z 2k 2lp 1 kp 2k z 2k 1 k=1 k=1 k=1 p k 1z 2k 2lp 2 1z kp 1 z 2 k 1 k=1 2p 1 z 1 p 1 z p 1 z 2lp 1 + z1 p 1 z 2 1 p 1 z 2 2 > 2p 1z 1 p 1 z Combining the bounds in Eq.2.43 and Eq.2.44 along with z 1 + lp 1, leads to T 2p lp 1 [1 p lp 1 2 ] Substitute the bound from Eq.2.45 in Eq.2.42 to obtain the estimate C z p lp 1 [1 p 1 1+lp 1 2 ] 2 The following results will be used in the proof of Theorem Recall, from Section 2.1.3, that Dz = D x z = q k 1 z k k=0

58 42 Chapter 2. Extremes of 1-dependent stationary sequences The next lemma presents the relation between the generating functions Dz and Cz. As we saw in Section 2.1.3, this formula will constitute the key idea behind our approach. Lemma Let Cz and Dz be the generating functions dened by Eq.2.26 and Eq.2.47, respectively. The following relation holds: Proof. We dene the following series: Dz = 1 Cz Dz = 1 Cz = d k z k, 2.49 which exists since the free term in the series C is equal to 1. Based on the relation Cz Dz = 1, we deduce that d 0 = 1 and k=0 n 1 n j p n j 1 d j = 0, n j=0 Dene for each k = {0, 1,..., n} the events A k = {X 1 x,..., X k x, X k+1 > x,..., X n > x}. We observe that PA 0 = p n, PA n = q n and PA k + PA k 1 = p n k q k 1, k Multiplying Eq.2.51 with 1 k 1 and summing over k gives q n = n 1 n k p n k q k 1, n 0, q 1 = q 0 = k=0 Notice that from the above relation and Eq.2.50, after using mathematical induction, we obtain d n+1 = q n. From the denition of the series D we conclude that Dz = q k 1 z k, q 1 = q 0 = 1, 2.53 k=0 which is exactly the generating function Dz and the proof is complete. In the following lemma, we assume that the result in Theorem is known. Lemma We have q 3 λ 3 = 1 p 2 + 2p 3 3p 4 + p p 2 2 6p 1 p 2 + OLp 1 p 3 1, 2.54 where Ox is a function such that Ox x, Lp 1 = p 1 2 P p 1 and P p 1 has the form P p 1 = 3Kp p 1 + 3p 2 1[1 + p 1 + 3p Kp 1 p 3 1] + p 6 1K 3 p 1 + 9p p 1 + 3p

59 2.3. Proofs 43 Proof. From Eq.2.13 of Theorem we can write λ = 1 + p 1 p 2 + p 3 p 4 + 2p p 2 2 5p 1 p 2 + OKp 1 p and raising to the third power we get λ 3 = 1 + ζ 1 + ζ ζ 1 + ζ 2 2 OKp 1 p OK 3 p 1 p 6 1p ζ 1 + ζ 2 OK 2 p 1 p 3 1p In the above formulas we have used the notations ζ 1 = p 1 p 2, 2.58 ζ 2 = p 3 p 4 + 2p p 2 2 5p 1 p From the denitions of ζ 1 and ζ 2, it is not hard to see that ζ 1 = Op 1 and ζ 2 p 1 p 1 p 2 + 2p 1 p 2 2 p 2 p 1 p 2 = O3p 2 1. From these observations we deduce that λ ζ 1 + ζ 2 3 = OSp 1 p 3 1, 2.60 with Sp 1 given bellow by Sp 1 = 31 + p 1 + 3p Kp 1 + 3p p 1 + 3p 2 1K 2 p 1 + p 6 1K 3 p Now observe that by expanding 1 + ζ 1 + ζ 2 3, we can write 1 + ζ 1 + ζ 2 3 = 1 + 3ζ 1 + ζ 2 + ζ ζ 1 ζ 2 + 3ζ ζ 1 ζ ζ 2 1ζ 2 + ζ ζ 3 2 = 1 + 3ζ 1 + ζ 2 + ζ O18p O27p 1 p O27p 2 1p O9p 1 p Op O27p 3 1p 3 1, 2.62 which along with Eq.2.60 and Eq.2.61 gives an expression for λ 3, λ 3 = 1 + 3ζ 1 + ζ 2 + ζ OP p 1 p The function P p 1, in Eq.2.63 above, is computed by the formula P p 1 = 3Kp p 1 + 3p 2 1[1 + p 1 + 3p Kp 1 p 3 1] + p 6 1K 3 p 1 + 9p p 1 + 3p Recall, from the previous lemma, that we have the recurrence q n = n 1 n k p n k q k 1, n 0, q 1 = q 0 = 1, k=0

60 44 Chapter 2. Extremes of 1-dependent stationary sequences which for n = 3 implies that q 3 = 1 p 1 2p 1 p 2 + p 2 1 p 3 = O1 p Now it is clear that from Eq.2.63 and Eq.2.65 we obtain q 3 λ 3 = q 3 [1 + 3ζ 1 + ζ 2 + ζ 2 1] + OP p 1 1 p 1 2 p The last step in our proof is to nd an approximation for the rst term on the right in Eq If we write q 3 [1 + 3ζ 1 + ζ 2 + ζ 2 1] = 1 p 2 + 2p 3 3p 4 + p p 2 2 6p 1 p 2 + H, 2.67 then H veries H = 3p 1 p 2 { p 2 1 p 3 [3p 1 p p 2 ] + p 2 2 9p 1 p 2 2} 3p 3 p 4 [p 1 + 2p 1 p 2 p 2 1 p 3 ] = O36p Finally, combining Eq.2.66, Eq.2.67 and Eq.2.68 leads to q 3 λ 3 = 1 p 2 + 2p 3 3p 4 + p p 2 2 6p 1 p 2 + OLp 1 p 3 1, 2.69 where we used the notation Proof of Theorem Lp 1 = p 1 2 P p We saw from Lemma that the series Cz has a zero λ = λx in the interval 1, 1 + lp 1, where l = lp 1 = t 3 2 p 1 + ε, ε > 0 arbitrarily small and t 2 p 1 is the second root in magnitude of the equation p 1 t 3 t + 1 = 0. To show that this zero is unique, we will prove that Cz is strictly decreasing on the interval 1, 1 + lp 1, i.e. C z < 0. In Lemma 2.3.4, we found the following upper bound for C z: C z 1 + 2p 1 1 p p lp 1 2 2lp2 1 [1 p lp 1 2 ] 3. We observe that the expression of the bound in the above relation is increasing in p 1 and since for p 1 = 0.1, l , we get C z < 0.7, which proves that C is strictly decreasing. Now we try to approximate the zero λ. From Lagrange theorem applied on the interval [1, λ], we have Cλ C1 = λ 1C u, with u 1, λ 1, 1 + lp 1. Since Cλ = 0, we get λ 1 = C1 C 2.71 u

61 2.3. Proofs 45 and taking µ = p 1 p 2 + p 3 p 4 + 2p p2 2 5p 1p 2 as in [Haiman, 1999], we obtain λ 1 + µ = C1 + µc u C u Applying Lagrange theorem one more time for C, as in the proof of Lemma 2.3.4, on the interval [1, 1 + a] with a lp 1, we get for θ 0, 1 Recall that with R dened by Lemma Eq.2.72 becomes C 1 + a = C 1 + ac 1 + θa. C 1 = 1 + 2p 1 3p 2 + R, After the proper substitutions, the relation in λ 1 + µ = C1 + µ 1 + 2p 1 3p 2 + µr + ac 1 + θa C, 2.73 u where a = u 1 and θ 0, 1. If we denote A = C1 + µ 1 + 2p 1 3p 2, then A = p 1 p 2 + p 3 p 4 + p 5 p µ + 2p 1 3p 2 µ = p 5 p p 1 p 2 2p 1 3p p 1 p 2 p 3 p 4 p 2 p 3 p Applying the inequality from Eq.2.25 in Eq.2.74, we have which gives us the estimate Notice that µ 0 and p 3 1 p 2 p 3 p 4 A k=2 C1 + µ 1 + 2p 1 3p 2 p 3 1 p k p p 3 1, µ = 1 p 2 p 1 p 2 + p 3 p 4 + 2p 1 p p 2 p 1 p 2 + p 1 p 1 p 2 + 2p 1 p 2 2 [ ] p 1 = 3p 1 p p 1 p 2 p p Recall that, from Lemma 2.3.4, we have the following bound for C z 1 : [ ] C z 1 2p lp [1 p lp 1 2 ] 2. Also, the bounds in Lemma lead to the estimate C z 2lp p lp [1 p lp 1 2 ] 3 + 4p lp 1 [1 p lp 1 2 ]

62 46 Chapter 2. Extremes of 1-dependent stationary sequences Substituting the estimate for R derived in Lemma 2.3.2, along with the bounds in Eqs.2.75, 2.76, 2.38 and 2.77 in Eq.2.73 and using that a lp 1, gives the approximation λ 1 + µ C1 + µ 1 + 2p 1 3p 2 + µ R + a C 1 + θa C u Kp 1 p where Kp 1 = 11 3p 1 1 p 1 + 2l1 + 3p lp 1 p 1 2 lp 1 1+lp 1 2 [1 p 1 1+lp 1 2 ] 3 1 2p lp 1 [1 p 1 1+lp 1 2 ] Proof of Corollary From Theorem we have λ 1 + p 1 p 2 + p 3 p 4 + 2p p 2 2 5p 1 p 2 Kp 1 p 3 1. To prove the statement in Corollary 2.2.4, we rst observe that 1+p 1 p 2 +p 3 p 4 +2p 2 1+3p 2 2 5p 1 p 2 = 1+p 1 p 2 +2p 1 p 2 2 +[p 3 p 4 p 2 p 1 p 2 ]. Using the 1-dependence of the sequence X n, we notice that which leads to the bound p 3 p 4 = PX 1 > x, X 2 > x, X 3 > x, X 4 x p 1 PX 1 > x, X 2 x = p1p 1 p 2, 2.80 p 3 p 4 p 2 p 1 p 2 p 3 p 4 + p 2 p 1 p 2 p 2 1. Combining the above relations, we conclude that λ 1 + p 1 p 2 + 2p 1 p 2 2 p Kp 1 p p 1 Kp 1 p Proof of Theorem From Lemma we know that Dz = 1 Cz. Taking λ to be zero dened in Theorem 2.2.3, we can write Cz = Uz 1 z λ n and observe that if we put Uz = u k z k, then k=0 k=0 n u k z k 1 z = k p k 1 z k, 2.81 λ k=1

63 2.3. Proofs 47 which shows that u n u n 1 λ = 1n p n 1, u 0 = 1, n Multiplying Eq.2.82 with λ n and summing over n gives u n = 1 + n 1 k p k 1 λ k k=1 If we denote with T z = 1 Uz = t k z k, then k=0 Dz λ n, n z λ = T z 2.84 and using the same arguments as in Lemma we get t 0 = 1 and so t n = q n 1 q n 2, n 1, 2.85 λ q n λ n+1 = t 0 + t 1 λ + + t n λ n + t n+1 λ n To obtain the desired result, we begin by giving an approximation of u n : n k p k 1 λ k 1 k p k 1 λ k k=1 k=n+1 u n = λn+1 λ n Since λ 1, 1 + lp 1, we have λ n Cλ=0 = λ n p n p n+1 λ + p n+2 λ 2 p n+3 λ λ p n p n+1 λ + p n+2 p n+3 λ λ p n p n lp 1 p n p n+1 λ p n p n+1, 2.88 which shows that p n p n+1 λ p n p n+1 1 lp Let h = 1 lp 1. Using in Eq.2.87 the fact that the sequence of probabilities p n n is decreasing and the bound from Eq.2.89, we obtain u n λ p n p n+1 h + p n+2 p n+3 hλ = p n + p n+2 λ 2 + p n+4 λ hp n+1 + p n+3 λ 2 + p n+5 λ W hp n+2 + p n+4 λ 2 + p n+6 λ = W h λ 2 W p n = W 1 hλ 2 + h λ 2 p n, 2.90

64 48 Chapter 2. Extremes of 1-dependent stationary sequences where, based on the estimate from Eq.2.25, W = p n + p n+2 λ 2 + p n+4 λ p [ n+1 2 ] 1 + p [ n+1 2 ] 1 p 1 λ 2 + p [ n+1 2 ] 1 p 2 1λ = p [ n+1 2 ] p 1 λ 2 + p 2 1λ = From Eq.2.90 and Eq.2.91, we conclude that [ n+1 u n 2 ] 1 p[ 1 λ 1 p 1 λ 2 + h λ 2 1 p[ n+1 2 ] 1 1 p 1 λ ] 1 1 p 1 λ 2 = 1 p n+1 1h 2 ] p[ 1 p 1 λ Until now we have an approximation for u n, but we still need one for t n and to solve this aspect let us write T z = 1 Uz = Uz = 1 n U 1 n, 2.93 n 0 which is true since the convergence of Cz implies z < 1 p1 and in turn gives 1 U λ1 p 1h 1 p 1 λ 2 p n 2 1 z n z λ p 1 [1 p 1 1 lp 1 ] 1 p 1 λ 2 1 z p 1 n 1 p1 1 + lp lp 1 2 [1 p lp 1 2 ] [ 1 ] < p lp 1 Since u 0 = 1, we have U 1 = n 1 u n z n and U 1 k = l 1 i 1 + +i k =l i j 1,j=1,k Combining Eq.2.93 with Eq.2.95, we obtain where t n z n = n 0 b l,k = 1 k k=0 i 1 + +i k =l i j 1,j=1,k Identifying the coecients in Eq.2.96 shows that t 0 = 1 and t n = n 1 k k=1 i 1 + +i k =n i j 1,j=1,k u i1... u ik z l l=1 b l,k z l, 2.96 u i1... u ik u i1... u ik, k

65 2.3. Proofs 49 Notice that the coecients b n,k can be bounded by b n,k δ k i 1 + +i k =n i j 1,j=1,k [ ] i1 +1 [ ] ik +1 2 where δ = λ1 p 1h 1 p 1. λ 2 By induction, one can easily check the validity of the identity [ ] [ ] [ ] i1 + 1 ik + 1 i1 + + i k = p1... p1, 2.99 [ n ] We observe that the number of terms in the sum of Eq.2.99 is equal with the number of dierent positive integers solutions of the equation i i k = n and is given by n 1 k 1 see for example [Yaglom and Yaglom, 1987, Problem 31]. This shows that b n,k p [ n+1 2 ] n 1 1 δ k k 1 Now, from Eq.2.98 and Eq.2.101, we have p [ n+1 2 ] 1 δ [ n 1 2 ] k=0 which lead to the bound n 1 δ 2k t n p [ n+1 2k t n δ 2 n+1 2 ] p[ 1 2 ] 1 δ [ n 2 ] 1 k=0 n 1 δ 2k+1, k + 1 [ 1 + δ n δ n 1] We observe that from Eq.2.86 and Eq.2.103, the dierence q n λ n q 3 λ 3 can be bounded by n+1 4 t s λ s t s λ s n+1 q n λ n q 3 λ 3 s=0 s=0 = = t λ s λ s 1 s=5 δ p [ s+1 2 ] [ δ s δ s 1] λ s s=5 If we denote by σ 1 = 1+δλ, σ 2 = 1 δλ and by V the upper bound in Eq.2.104, it is not hard to see that [ V = δp3 1 σ σ p 1 σ1 2 + σ σ ] 2 1 p 1 σ Recalling that h = 1 lp 1, λ 1, 1 + lp 1 and p 1 0.1, we observe that σ 2 is bounded by lp 1[1 + 2p lp 1 ] 1 p lp 1 2 σ 2 lp lp 1, 1 p 1

66 50 Chapter 2. Extremes of 1-dependent stationary sequences which gives σ 2 < 0.5 and δσ4 2 1+σ 2 21 p 1 σ 2 2 < 0.1. Substituting the last relation in Eq.2.105, we can rewrite the bound in Eq as q n λ n q 3 λ 3 Ep 1 p where, if we denote by η = 1 + lp 1, Ep 1 = η5 [ p 1 η] 4 [1 + p 1 η 2] [ 1 + η + 1 3p 1 η 2] 21 p 1 η 2 4 [1 p 1 η 2 2 p 1 η η 2p 1 η ] We saw in Lemma that q 3 λ 3 = 1 p 2 + 2p 3 3p 4 + p p 2 2 6p 1 p 2 + OLp 1 p 3 1, where Lp 1 = 36+1 p 1 2 P p 1 and P p 1 is given by Eq Using Eq and the above estimate, we conclude that q n λ n 1 p 2 + 2p 3 3p 4 + p p 2 2 6p 1 p 2 Γp 1 p 3 1, with Γp 1 = p 1 2 P p 1 + Ep 1 and this ends the proof of the theorem. Remark that in the statement of Theorem we gave Ep 1 without the 0.1 term, but we have added it to the constant term Proof of Corollary We know from Theorem that the following relation holds: q n λ n 1 p 2 + 2p 3 3p 4 + p p 2 2 6p 1 p 2 Γp 1 p 3 1. Based on this bound, we can write Observe that q n λ n 1 p 2 Γp 1 p p 3 3p 4 + p p 2 2 6p 1 p and that 0 p p 3 3p 4 = p 2 1 p }{{} 4 +2p 3 p 4 3p 2 }{{} 1 p 2 p p 2 1 6p 2 2 6p 1 p 2 0. By adding these two set of inequalities, we have the bound which substituted in Eq implies p p 3 3p 4 + 6p 2 2 6p 1 p 2 3p 2 1, q n λ n 1 p p 1 Γp 1 p

67 2.3. Proofs Proof of Theorem Denoting by we observe that and that µ 1 = 1 p 2 + 2p 3 3p 4 + p p 2 2 6p 1 p 2, µ 2 = 1 + p 1 p 2 + p 3 p 4 + 2p p 2 2 5p 1 p 2 0 µ 1 µ 2 < 1 µ 2 = 1 + p 1 p 2 1 p 2 + p 3 p 4 + 2p 1 p }{{} 0 With the help of Eq.2.13 and Eq.2.16, we get q n µ 1 µ n qn µ 1 + µ 1 2 λ n λ n µ 1 µ n 2 Γαp µ 1 1 λ 1 1 µ 2 λ n λ n j µ j }{{} 1 µ n 1 [Γp 1 + nkp 1 ] p To express µ 1 and µ 2 in terms of q's, we have to observe rst that p 1 = 1 q 1 p 2 = 1 2q 1 + q 2 p 3 = 1 3q 1 + 2q 2 + q 2 1 q 3 p 4 = 1 4q 1 + 3q 2 2q 1 q 2 + 3q 2 1 2q 3 + q 4 and after the proper substitutions, we get µ 1 = 1 + q 1 q 2 + q 3 q 4 + 2q q 2 2 5q 1 q µ 2 = 6q 1 q q 3 3q To conclude the proof of Theorem 2.2.8, it is enough to replace the above relations in Eq We obtain q 6q 1 q q 3 3q 4 n 1 + q 1 q 2 + q 3 q 4 + 2q q2 2 5q 1q 2 n n 11 q 1 3, where 1 is given by 1 = 1 q 1, n = K1 p 1 + Γ1 q 1. n

68 52 Chapter 2. Extremes of 1-dependent stationary sequences Proof of Theorem For the proof of Theorem 2.2.9, we use the same approach as in Theorem From Eq.2.15 and Eq.2.19, we get q n ν 1 qn ν 1 + ν 1 λ n λ n ν 1 ν n 2 ν n p 1 Γp 1 p nν 1 λ ν 2 λν 2 [3 + Γp 1 p 1 + n1 + p 1 Kp 1 ] p 2 1, where ν 1 = 1 p 2 and ν 2 = 1 + p 1 p 2 + 2p 1 p 2 2. We express ν 1 and ν 2 in terms of q's using the relations for p 1 to p 4 from the proof of Theorem 2.2.8, so ν 1 = 2q 1 q ν 2 = 1 + q 1 q 2 + 2q 1 q Substituting these formulas in Eq.2.116, we obtain q 2q 1 q 2 n [1 + q 1 q 2 + 2q 1 q 2 2 ] n n 21 q 1 2, where 2 = 2 q 1, n = [ n + K1 q 1 + Γ1 q ] 1 1 q 1. n Proof of Proposition From the stationarity of the sequence W k k Z, we know that the autocovariance function cj = CovW 0, W j is nonnegative denite. Using that E[W 0 ] = 0 and the m-dependence property, we obtain that cj = 0 for all j > m. We associate to the sequence of autocovariances {cj} j=m j= m the trigonometric polynomial m tθ = cje ijθ j= m From the above observations we have that tθ takes only positive real values for all θ. This remark will constitute the key behind the proof. The following lemma, due to Fejér and Riesz see [Riesz and Nagy, 1990, pag 117] for a proof, will elucidate this statement: Lemma Fejér-Riesz. Let the trigonometric polynomial px = n c k e ikx. k= n

69 2.3. Proofs 53 If px 0 for all values of x, then there exists a sequence {b j } j=n j=0 n px = b j e ijx j=0 2. such that Since the trigonometric polynomial t dened in Eq veries the hypothesis of Fejér-Riesz lemma, there is a sequence {a j } j=m j=0 such that we can write m j= m cje ijθ m 2 = a k e ikθ By expanding Eq and identifying the coecients of e ijθ, we obtain for each j {0, 1,..., m} cj = m j k=0 k=0 Now consider the sequence of moving averages of order m Y j = a k a k+j m a k η j k, k=0 where η j N 0, 1 are i.i.d. standard normal random variables. Clearly Y j j Z is a stationary, m-dependent Gaussian sequence. Since W k k Z is in addition a Gaussian sequence, it remains to verify that CovY 0, Y j = m j k=0 a k a k+j But it is an exercise to observe that CovY 0, Y j = E k,l a k a l η k η j l = k,l a k a l E[η k η j l ] = m j k=0 a k a k+j and the proof is complete.

71 Chapter 3 Distribution of scan statistics viewed as maximum of 1-dependent stationary sequences In this chapter we consider the general case of the discrete scan statistics in d dimensions, d 1. After introducing the principal notions in Section 3.1, we present in Section 3.2 the methodology used for nding the approximation for the distribution of the d dimensional discrete scan statistics. The advantage of the described method is that we can also establish sharp error bounds for the estimation. Section 3.3 shows how to evaluate these errors. Since the simulation process plays an important role in our approach, in Section 3.4 we present a general importance sampling algorithm that increase the eciency of the proposed approximation. To investigate the accuracy of our estimation, in Section 3.5 we explicit the formulas for the approximation and the error bounds for the special cases of one, two and three dimensional scan statistics and perform a series of numerical applications. We also compare our approximation with some of the existing ones presented in Chapter 1. The work presented in this chapter, for the particular case of three dimensional scan statistics, makes the object of the article [Am rioarei and Preda, 2013a]. Contents 3.1 Denitions and notations Methodology Computation of the approximation and simulation errors Computation of the approximation error Computation of the simulation errors Simulation using Importance Sampling Generalities on importance sampling Importance sampling for scan statistics Computational aspects Related algorithms: comparison for normal data Examples and numerical results One dimensional scan statistics Two dimensional scan statistics Three dimensional scan statistics

72 56 Chapter 3. Scan statistics and 1-dependent sequences 3.1 Denitions and notations Let T 1, T 2,..., T d be positive integers with d 1 and consider the d-dimensional rectangular region, R d, dened by R d = [0, T 1 ] [0, T 2 ] [0, T d ]. 3.1 For 1 s j T j, j {1, 2,..., d}, we can associate to each elementary rectangular region r d s 1, s 2,..., s d = [s 1 1, s 1 ] [s 2 1, s 2 ] [s d 1, s d ] 3.2 a real valued random variable X s1,s 2,...,s d. Notice that one can imagine the region R d as a d-dimensional lattice characterized by the centers of the elementary subregions, such that to each point of the lattice it corresponds a random variable. Let 2 m j T j, 1 j d, be positive integers and dene the random variables Y i1,i 2,...,i d = i 1 +m 1 1 s 1 =i 1 i 2 +m 2 1 s 2 =i 2 i d +m d 1 s d =i d X s1,s 2,...,s d, 3.3 where 1 i l T l m l + 1, 1 l d. The random variables Y i1,i 2,...,i d associate to each rectangular region comprised of m 1 m 2... m d adjacent elementary subregions, Ri 1, i 2,..., i d = [i 1 1, i 1 + m 1 1] [i d 1, i d + m d 1], 3.4 a numerical value equal with the sum of all the attributes in that region. For example, consider the case when the random variables X s1,s 2,...,s d are 0 1 Bernoulli of parameter p. The value 1 can correspond to the situation when in the elementary rectangular region r d s 1, s 2,..., s d, an event of interest has been observed and the value 0, otherwise. In this framework, the random variables Y i1,i 2,...,i d give the number of events observed in the rectangular region Ri 1, i 2,..., i d. We dene the d-dimensional discrete scan statistics as the maximum number of events in any rectangular region Ri 1, i 2,..., i d within the region R d, S m T = max 1 i j T j m j +1 j {1,2,...,d} where m = m 1, m 2,..., m d and T = T 1, T 2,..., T d. Y i1,i 2,...,i d, 3.5 Remark The random variable S m T dened in Eq.3.5 can be viewed as an extension of the one dimensional scan statistics presented in [Glaz et al., 2001] d = 1, the two dimensional scan statistics introduced in [Chen and Glaz, 1996] d = 2 and of the three dimensional scan statistics studied in [Guerriero et al., 2010a] and [Am rioarei and Preda, 2013a] d = 3. The distribution of the scan statistics, Q m T = P S m T n, 3.6

73 3.2. Methodology 57 has been used in [Chen and Glaz, 1996] and [Guerriero et al., 2009] for the two dimensional case and in [Guerriero et al., 2010a] and [Am rioarei and Preda, 2013a] for the three dimensional one, to test the null hypothesis of randomness against an alternative of clustering. Under the null hypotheses, H 0, it is assumed that the random variables X s1,s 2,...,s d are independent and identically distributed according to some specied distribution. For the alternative hypothesis of clustering, one can specify a region Ri 1, i 2,..., i d where the random variables X s1,s 2,...,s d have a larger mean than outside this region. As an example, consider d = 2 and the binomial model. The null hypothesis, H 0, assumes that X s1,s 2 's are i.i.d. with X s1,s 2 Binr, p whereas the alternative hypothesis of clustering, H 1, assumes the existence of a rectangular subregion Ri 0, j 0 such that for any i 0 s 1 i 0 +m 1 1 and j 0 s 2 j 0 + m 2 1, X s1,s 2 are i.i.d. binomial random variables with parameters r and p > p. Outside the region Ri 0, j 0, X s1,s 2 are i.i.d. distributed according to the distribution specied by the null hypothesis. The generalized likelihood ratio test rejects H 0 in favor of the local change alternative H 1, whenever S m T exceeds the threshold τ, determined from P S m T τ H 0 = α and where α represents the signicance level of the testing procedure [Glaz et al., 2001, Chapter 13]. 3.2 Methodology As noted in Chapter 1 see also [Cressie, 1993, pag 313], nding the exact distribution of the d-dimensional scan statistics for d 2 has proved elusive. In this section, we present an alternative method, based on the results derived in Chapter 2, for nding accurate approximations for the distribution Q m T of the d-dimensional scan statistics generated by an i.i.d. sequence. One of the main features of this approach is that, besides the approximation formula, it also provides sharp error bounds. It is also important to mention that the method can be applied for any distribution, discrete or continuous, of the random variables X s1,s 2,...,s d. In Section 3.5, we include numerical results to emphasize this remark. This approach is not new and was successfully used to approximate the distribution of scan statistics, both in discrete and continuous cases, in a series of articles: for one-dimensional case in [Haiman, 2000] and [Haiman, 2007], for two-dimensional case in [Haiman and Preda, 2002] and [Haiman and Preda, 2006] and for the threedimensional case in [Am rioarei and Preda, 2013a]. Consider the framework described in Section 3.1 and let the sequence of random variables X s1,s 2,...,s d be independent and identically distributed according to a specied distribution. The key idea behind the approximation method is to express the scan statistic random variable S m T as maximum of a 1-dependent stationary sequence of random variables and to employ the estimate from Theorem 2.2.9, Chapter 2. The approximation will be carried out in a series of steps. Assume that L j = T j m j 1, j {1, 2,..., d}, are positive integers and dene for each

74 58 Chapter 3. Scan statistics and 1-dependent sequences k 1 {1, 2,..., L 1 1} the random variables Z k1 = max k 1 1m i 1 k 1 m i j L j 1m j 1 j {2,...,d} Y i1,i 2,...,i d. 3.7 We observe that the random variable Z k1 dened in Eq.3.7 corresponds, in fact, to the d-dimensional discrete scan statistics over the multidimensional rectangular strip [k 1 1m 1 1, k 1 + 1m 1 1] [0, T 2 ] [0, T d ]. We claim that the set of random variables {Z 1,..., Z L1 1} forms a 1-dependent stationary sequence according to Denition and Denition Indeed, from Eq.3.7 we notice that for k 1 1 σz 1,, Z k1 σ {X s1,s 2,...,s d 1 s 1 k 1 + 1m 1 1, 1 s j T j, j 2} and σz k1 +2, σ {X s1,s 2,...,s d k 1 + 1m s 1, 1 s j T j, j 2}. The independence of the sequence X s1,s 2,...,s d implies that the σ-elds σ, Z k1 and σz k1 +2, are independent, so the 1-dependence of the set of random variables {Z 1,..., Z L1 1} is veried. The stationarity is immediate since the random variables X s1,s 2,...,s d are also identically distributed. For a better understanding, we consider the three dimensional setting d = 3 to exemplify our approach. In Figure 3.1 we illustrate the sequence Z k1 L 1 1 k 1 =1, emphasizing its 1-dependent structure. Notice that from Eq.3.7 and the denition of the d-dimensional scan statistics in Eq.3.5 we have the following identity S m T = max Y i1,i 2,...,i d 1 i j T j m j +1 j {1,2,...,d} = max 1 k 1 L 1 1 max Y i1,i 2,...,i d k 1 1m i 1 k 1 m i j L j 1m j 1 j {2,...,d} = max 1 k 1 L 1 1 Z k The above relation shows that the scan statistic random variable can be expressed as the maximum of a 1-dependent stationary sequence and gives us the proper setting for applying the estimates developed in the previous chapter. Recall from Chapter 2 that given a 1-dependent stationary sequence of random variables W k k 1, if the tail distribution PW 1 > x is small enough, then we have the following estimate for the distribution of the maximum of the rst m terms: q 2q 1 q 2 m [1 + q 1 q 2 + 2q 1 q 2 2 ] m mf q 1, m1 q 1 2, 3.9

75 3.2. Methodology 59 m1 m2 m3 T 1 4m1 1 3m1 1 Z3 2m1 1 2m1 1 m1 Z1 Z2 1 T 2 T 3 Figure 3.1: Illustration of Z k1 in the case of d = 3, emphasizing the 1-dependence where q m = P max{w 1,..., W m } x, F q 1, m = [ m + K1 q 1 + Γ1 q ] 1 1 q 1, 3.10 m and Γ and K are as in Theorem see Theorem Notice that F x, n, described above, corresponds to the error coecient 2 x, n, dened by Eq Dene for t 1 {2, 3}, Q t1 = Q t1 n = P t 1 1 k 1 =1 {Z k1 n} = P max 1 i 1 t 1 1m i j L j 1m j 1 j {2,...,d} Y i1,i 2,...,i d n It is clear that Q t1 coincides with Q m t 1 m 1 1, T 2,..., T d, the distribution of the d-dimensional scan statistics over the rectangular region [0, t 1 m 1 1] [0, T 2 ] [0, T d ] see also Figure 3.1 when d = 3. For n such that Q 2 n 0.9, we can apply the result in Theorem enounced above to obtain the rst step approximation Q 2Q 2 Q 3 mt [1 + Q 2 Q 3 + 2Q 2 Q 3 2 ] L 1 1 L 1 1F Q 2, L 1 11 Q

76 60 Chapter 3. Scan statistics and 1-dependent sequences In order to evaluate the approximation in Eq.3.12, one has to nd suitable estimates for the quantities Q 2 and Q 3. To simplify the results of the presentation, in what follows we abbreviate the approximation formula by Hx, y, m = 2x y [1 + x y + 2x y ] m 1 The second step in our approximation consists in dening two new sequences of 1-dependent stationary random variables, such that each Q t1 can be expressed as the distribution of the maximum of the variables in the corresponding sequence. We dene, as in Eq.3.7, for t 1 {2, 3} and k 2 {1, 2,..., L 2 1} the random variables Z t 1 k 2 = max 1 i 1 t 1 1m 1 1 k 2 1m i 2 k 2 m i j L j 1m j 1 j {3,...,d} Y i1,i 2,...,i d Observe that the random variables Z t 1 k 2 coincide with the d-dimensional scan statistics across the overlapping strips of size t 1 m 1 1 2m 2 1 T 3 T d [0, t 1 m 1 1] [k 2 1m 2 1, k 2 + 1m 2 1] [0, T d ]. Based on similar arguments as in the case of Z k1 L 1 1 k 1 =1, the random variables in the sets {Z t 1 1, Z t 1 1,..., Z t 1 L 2 1 } are 1-dependent and stationary. Moreover, for each t 1 {2, 3} we have Q t1 = P max 1 k 2 L 2 1 Zt 1 k 2 n Set for t 1, t 2 {2, 3}, Q t1,t 2 = Q t1,t 2 n = P t 2 1 k 2 =1 k 2 n} = P {Z t 1 max 1 i 1 t 1 1m i 2 t 2 1m i j L j 1m j 1 j {3,...,d} Y i1,i 2,...,i d n 3.16 and observe that Q t1,t 2 = Q m t 1 m 1 1, t 2 m 2 1,..., T d. Whenever n is such that Q t1,2n 0.9, we are in the hypothesis of Theorem and for each t 1 {2, 3}, we approximate Q t1 with Q t1 H Q t1,2, Q t1,3, L 2 L 2 1F Q t1,2, L 2 11 Q t1, Combining the estimate in Eq.3.12 with the ones in Eq.3.17, we obtain an expression for the approximation of Q m T depending on the four distribution functions Q 2,2, Q 3,2, Q 2,3 and Q 3,3. Depending on the dimension of the problem, the above procedure can be repeated for a number of steps at most d to get simpler terms in the approximation formula. In

77 3.2. Methodology 61 general, at the s step, 1 s d, the problem is to approximate for each t j {2, 3}, j {1,..., s 1}, the distribution function of the d-dimensional scan statistics over the rectangular region [0, t 1 m 1 1]... [0, t s 1 m s 1 1] [0, T s ] [0, T d ]. We adopt the following notation for these distribution functions: Q t1,t 2,...,t s 1 = Q t1,t 2,...,t s 1 n = P max 1 i l t l 1m l 1 l {1,...,s 1} 1 i j L j 1m j 1 j {s,...,d} Y i1,i 2,...,i d n As described in the rst two steps, the idea is to dene for each point t 1,..., t s 1 {2, 3} s 1 a set of stationary and 1-dependent random variables such that Q t1,t 2,...,t s 1 corresponds to the distribution of the maxima of these variables. We will focus on reducing the size of the s-th dimension. Dene for t l {2, 3}, l {1,..., s 1} and k s {1, 2,..., L s 1} the random variables Z t 1,t 2,...,t s 1 k s = max 1 i l t l 1m l 1 l {1,2,...,s 1} k s 1m s 1+1 i s k sm s 1 1 i j L j 1m j 1 j {s+1,...,d} Y i1,i 2,...,i d Based on{ the same arguments as for the} rst and the second step, the random variables Z t 1,t 2,...,t s 1 1,..., Z t 1,t 2,...,t s 1 L s 1 form a 1-dependent stationary sequence and verify the following relation Q t1,t 2,...,t s 1 = P max 1 k Zt 1,t 2,...,t s 1 k s L s 1 s n Clearly, from Eq.3.18 and Eq.3.19 we have Q t1,t 2,...,t s = Q t1,t 2,...,t s n = P t s 1 k s=1 {Z t 1,t 2,...,t s 1 k s n} If we take n such that Q t1,t 2,...,t s 1,2n 0.9, that is we can apply Theorem 2.2.9, then the s step approximation is given by Q t1,...,t s 1 H Q t1,...,t s 1,2, Q t1,...,t s 1,3, L s L s 1F Q t1,...,t s 1,2, L s 1 1 Q t1,...,t s 1, Substituting for each s {2,..., d}, the Eqs in Eq. 3.12, we get an approximation formula for the distribution of the d-dimensional scan statistics depending on the 2 d quantities Q t1,...,t d, that we propose to be evaluated by simulation. To get a better lling of the approximation process described above, we include in Fig. 3.2 a diagram that illustrates the steps involved in the approximation of Q 2 for the three dimensional scan statistics.

78 62 Chapter 3. Scan statistics and 1-dependent sequences Q 2 2m 1 1 Q 22 L 2 L 3 Q 23 2m 1 1 2m 1 1 L 2m L 3 3m 2 1 Q 222 Q 223 Q 232 Q 233 2m 1 1 2m 1 1 2m 1 1 2m 1 1 2m 2 1 2m 3 1 2m 2 1 3m 3 1 3m 2 1 2m 3 1 3m 2 1 3m 3 1 Figure 3.2: Illustration of the approximation of Q 2 in three dimensions Remark If there are indices j {1, 2,..., d} such that T j are not multiples. Based on the inequalities of m j 1, then we take L j = with Tj m j 1 PS m M 1 n Q m T PS m M 2 n, 3.23 M 1 = L 1 + 1m 1 1,..., L d + 1m d 1, M 2 = L 1 m 1 1,..., L d m d 1, we can approximate the distribution of the scan statistics Q m T by the following multi-linear interpolation procedure: Suppose that we have a function y = fx 1,..., x d and we are given 2 d points x 1,s x 2,s, s {1,..., d}, that are the vertices of a right rectangular polytope. We also assume that the values of the function in the vertices, v i1,...,i d = fx i1,1,..., x id,d, i j {1, 2}, j {1,..., d}, are known. Given a point x 1,..., x d in the interior of the polytope, we can approximate the value of v = f x 1,..., x d based on the following formula: v = v i1,...,i d V i1,...,i d, where the normalized volume V i1,...,i d V i1,...,i d = i 1,...,i d {1,2} is given by αi 1, 1 αi d, d x 2,1 x 1,1 x 2,d x 1,d,

79 3.3. Computation of the approximation and simulation errors 63 and αi s, s = { x2,s x s, if i s = 1 x s x 1,s, if i s = 2. Observe that in the particular case of one dimension d = 1, if we are given x 1 x x 2, y 1 = fx 1 and y 2 = fx 2 then, for ȳ = f x, the foregoing relations reduces to x 2 x x x 1 ȳ = y 1 + y 2, x 2 x 1 x 2 x 1 which is exactly the linear interpolation formula. 3.3 Computation of the approximation and simulation errors In this section, we present the derivation of the errors resulted from the approximation process described in Section 3.2. To see from where the errors appear, it is convenient to introduce some notations. Let γ t1,...,t d = Q t1,...,t d, with t j {2, 3}, j {1,..., d}, and dene for 2 s d γ t1,...,t s 1 = H γ t1,...,t s 1,2, γ t1,...,t s 1,3, L s Since there are no exact formulas available for the computation of Q t1,...,t d for d 2, these quantities will be estimated by Monte Carlo simulation. Denote with ˆQ t1,...,t d the estimated value of Q t1,...,t d and dene for 2 s d ˆQ t1,...,t s 1 = H ˆQt1,...,t s 1,2, ˆQ t1,...,t s 1,3, L s, 3.25 the estimated value of Q t1,...,t s 1. Our goal is to approximate Q m T with H ˆQ2, ˆQ 3, L 1 and to nd the corresponding error bounds. We observe that Q m T H ˆQ2, ˆQ 3, L 1 Qm T H γ 2, γ 3, L 1 + H γ 2, γ 3, L 1 H ˆQ2, ˆQ 3, L 1, 3.26 which shows that the error bound has two components: an approximation error resulted from the rst term in the right hand side of Eq.3.26 and a simulation error associated with the second term the simulation error corresponding to the approximation formula. Notice also that the approximation error is of a theoretical interest since it involves only the true values of the quantities Q t1,...,t d. Due to the simulation nature of the problem, this error is also bounded by what we call: the simulation error corresponding to the approximation error.

80 64 Chapter 3. Scan statistics and 1-dependent sequences Computation of the approximation error To simplify the presentation and the derivation of the approximation error formula, we dene for 2 s d, F t1,...,t s 1 = F Q t1,...,t s 1,2, L s and F = F Q 2, L 1 1. In practice, since the function F q, m dened by Eq.3.10 is slowly decreasing in q, the values F t1,...,t s 1 will be computed by F ˆQt1,...,t s 1,2, L s 1, and we treat them as known values in the derivation process. Rigorously, these quantities can be bounded above by F ˆQt1,...,t s 1,2 ε sim, L s 1, where ε sim is the simulation error that corresponds to ˆQ t1,...,t s 1,2. The goal of this section is to nd an error bound for the rst term in the right hand side of Eq.3.26, namely for the dierence Q m T H γ 2, γ 3, L 1. We have the following lemma a proof is given in Appendix B: 2x y Lemma Let Hx, y, m =. If y [1+x y+2x y 2 ] m 1 i x i, i {1, 2}, then: { m 1 [ x1 x Hx 1, y 1, m Hx 2, y 2, m 2 + y 1 y 2 ], 3 m 5 m 2 [ x 1 x 2 + y 1 y 2 ], m Hereinafter, we employ the result from Lemma rst branch without restrictions whenever is necessary. This is in agreement with the numerical values considered in Section 3.5. We begin by observing that Q m T H γ 2, γ 3, L 1 Q m T H Q 2, Q 3, L 1 + H Q 2, Q 3, L 1 H γ 2, γ 3, L 1 L 1 1F 1 Q L 1 1 Similarly, Q t1 γ t1 for t 1 {2, 3} is bounded by t 1 {2,3} Q t1 γ t Q t1 γ t1 Q t1 H Q t1,2, Q t1,3, L 2 + H Q t1,2, Q t1,3, L 2 H γ t1,2, γ t1,3, L 2 L 2 1F t1 1 Q t1,2 2 + L 2 1 Q t1,t 2 γ t1,t t 2 {2,3} Continuing this process, at the s step, with 2 s d, we have Q t1,...,t s 1 γ t1,...,t s 1 Q t1,...,t s 1 H Q t1,...,t s 1,2, Q t1,...,t s 1,3, L s + H Q t1,...,t s 1,2, Q t1,...,t s 1,3, L s H γt1,...,t s 1,2, γ t1,...,t s 1,3, L s 2 L s 1 F t1,...,t s 1 1 Qt1,...,t s 1,2 + Q t1,...,t s γ t1,...,t s. t s {2,3} 3.31

81 3.3. Computation of the approximation and simulation errors 65 Substituting at each step s {2,..., d}, the corresponding Eq.3.31 into Eq.3.29, we obtain Q m T H γ 2, γ 3, L 1 d L 1 1 L s 1 s=1 where we adopt the convention that t 1,...,t s 1 {2,3} t 1,t 0 {2,3} F t1,...,t s 1 1 Qt1,...,t s 1,2 2, 3.32 x = x, F t1,t 0 = F and Q t1,t 0,2 = Q 2, which implies that the rst term in the sum is L 1 1F 1 Q 2 2. To express the bound in Eq.3.32 only in terms of γ t1,...,t s, with 1 s d, we introduce the following variables. Let B t1...,t d = 0, 2 B t1...,t d 1 = L d 1F t1,...,t d 1 1 Qt1,...,t d 1,2 = L d 1F t1,...,t d 1 1 γt1,...,t d 1,2 + B t1...,t d 1, and for 2 s d 1 dene 2 B t1...,t s 1 = L s 1 F t1,...,t s 1 1 γt1,...,t s 1,2 + B t1...,t s 1,2 + It can be easily veried that t s {2,3} B t1...,t s Q t1,...,t d γ t1,...,t d = 0 B t1...,t d 3.35 and Qt1,...,t d 1 γ t1,...,t d 1 Bt1...,t d 1, 3.36 from the denition of B t1...,t d and B t1...,t d 1. Since 1 Q t1,...,t d 1 1 γ t1,...,t d 1 + Q t1,...,t d 1 γ t1,...,t d 1, 3.37 we deduce, by substituting Eq.3.36 and Eq.3.37 into Eq.3.31 and from the recurrence given in Eq.3.34, that [ Qt1,...,t d 2 γ t1,...,t 2 d 2 Ld 1 1 F t1,...,t d 2 1 γt1,...,t d 2,2 + B t1...,t d 2,2 + = B t1,...,t d t d 1 {2,3} B t1,...,t d 1 Using mathematical induction and noticing that the relation in Eq.3.37 remains valid for 2 s d, we can verify that Q t1,...,t s γ t1,...,t s B t1,...,t s, s {1,..., d}. 3.39

82 66 Chapter 3. Scan statistics and 1-dependent sequences A simple computation, similar with the one used to obtain Eq.3.32, lead us to B t1,...,t s 1 = d L s 1 L h 1 h=s t s,...,t h 1 {2,3} for 2 s d and with the convention: F t1,...,t h 1 1 γt1,...,t h 1,2 + B t1,...,t h 1,2 2, 3.40 t s,t s 1 {2,3} x = x. Substituting Eqs.3.37 and 3.39 in Eqs.3.32, we derive the formula for the theoretical approximation error E app d = d L 1 1 L s 1 s=1 where for s = 1: t 1,t 0 {2,3} t 1,...,t s 1 {2,3} F t1,...,t s 1 1 γt1,...,t s 1,2 + B t1,...,t s 1,2 2, x = x, F t1,t 0 = F, γ t1,t 0,2 = γ 2 and B t1,t 0,2 = B Computation of the simulation errors 3.41 In this section, we deal with the computation of the simulation errors that appear in our approximation. We employ the notations introduced in Eq.3.25 for the simulated values corresponding to Q t1,...,t s for s {1,..., d} and t j {2, 3}, j {1,..., d}. As we remarked before, there are two simulation errors resulting from the approximation process: the simulation error associated with the approximation formula bounding the second term of the right hand side of Eq.3.24 and the simulation error due to the theoretical approximation error E app d. We treat these errors separately. Suppose that we have a simulation method to estimate the values of Q t1,...,t d. Then, between the true and the estimated values, one can always nd a relation of the form Qt1,...,td ˆQ t1,...,td β t1,...,t d, t j {2, 3}, j {1,..., d} If, for example, IT ER is the number of iterations used in the Monte Carlo simulation algorithm for the estimation of Q t1,...,t d, then one can consider the naive bound provided by the Central Limit Theorem with a 95% condence level see [Fishman, 1996] ˆQ t1,...,t β t1,...,t d = 1.96 d 1 ˆQ t1,...,t d IT ER We assume in subsequent that these values are known. To nd the simulation error that corresponds to the approximation formula, we observe that, by applying

83 3.3. Computation of the approximation and simulation errors 67 successively Lemma to the second term of Eq. 3.30, we get H γ 2, γ 3, L 1 H ˆQ2, ˆQ 3, L 1 L1 1 γ t1 ˆQ t1 = L 1 1 t 1 {2,3} L 1 1L 2 1. t 1 {2,3} H γ t1,2, γ t1,3, L 2 H ˆQt1,2, ˆQ t1,3, L 2 t 1,t 2 {2,3} L L d 1 1 L L d 1 γ t1,t 2 ˆQ t1,t 2 t 1,...,t d 1 {2,3} t 1,...,t d {2,3} γ t1,...,t d 1 ˆQ t1,...,t d 1 Q t1,...,t d ˆQ. t1,...,t d 3.44 Combining Eq.3.42 and Eq.3.44, we obtain the simulation error associated with the approximation formula E sf d = L L d 1 β t1,...,t d t 1,...,t d {2,3} As we will see in the numerical section, this simulation error has the largest contribution to the total error. In order to nd the simulation error corresponding to the approximation error bound given by Eq.3.41, we need to introduce some notations. Set A t1,...,t d = β t1,...,t d and take for 2 s d A t1,...,t s 1 = L s 1... L d 1 β t1,...,t d t s,...,t d {2,3} Based on similar arguments as for obtaining Eq.3.44, we deduce that ˆQ t1,...,t s γ t1,...,t s At1,...,t s, s {1,..., d} Let C t1...,t d = 0 and for 2 s d, dene C t1...,t s 1 = L s 1 [F t1,...,t s 1 1 ˆQ 2 t1,...,t s 1,2 + A t1...,t s 1,2 +C t1...,t s 1,2 We observe that replacing s = d in Eq t s {2,3} C t1...,t s C t1...,t d 1 = L d 1F t1,...,t d 1 1 ˆQ t1,...,t d 1,2 + β t1...,t d 1,2 2, 3.49

84 68 Chapter 3. Scan statistics and 1-dependent sequences since C t1...,t d = 0 and A t1...,t d 1,2 = β t1...,t s 1,2. As in Eq.3.37, 1 γ t1,...,t d 1 1 ˆQ t1,...,t d 1 + ˆQ t1,...,t d 1 γ t1,...,t d 1, 3.50 so one can deduce from Eqs.3.33 and 3.49, that B t1...,t d 1 C t1...,t d From the denition of B t1...,t s 1 and C t1...,t s 1 in Eq.3.34 and Eq.3.48 respectively, we conclude, using mathematical induction, that B t1...,t s 1 C t1...,t s 1, s {2,..., d} Clearly, from Eqs.3.50, 3.47 and 3.52, we have 1 γ t1,...,t s 1 + B t1,...,t s 1 1 ˆQ t1,...,t s 1 + A t1,...,t s 1 + C t1,...,t s and the simulation error corresponding to the approximation error follows from Eq.3.51 and the foregoing equation E sapp d = d L 1 1 L s 1 s=1 where for s = 1: t 1,t 0 {2,3} x = x, F t1,t 0 t 1,...,t s 1 {2,3} F t1,...,t s 1 1 ˆQ t1,...,t s 1,2 +A t1,...,t s 1,2 + C t1,...,t s 1,2 2, 3.54 = F, ˆQt1,t 0,2 = ˆQ 2, A t1,t 0,2 = A 2 and C t1,t 0,2 = C 2. The total error is obtained by adding the two simulation error terms from Eq.3.45 and Eq.3.54 E total d = E sf d + E sapp d To eciently evaluate Eq.3.55, one needs to nd suitable values for the error bounds β t1,...,t d. The bounds from Eq.3.43, provided by the Central Limit Theorem, have been used in [Haiman and Preda, 2006] for the two dimensional case. As the authors pointed out, the main contribution to the total error is due to the simulation error E sf d, especially for small sizes of the window scan with respect to the scanning region. Our numerical study shows that these error bounds are not feasible for the scan problem in more than three dimensions, the simulation error associated with the approximation formula E sf d being too large with respect to the other error E sapp d. Thus, for the simulation of ˆQ t1,...,t d, we use an importance sampling technique introduced in [Naiman and Wynn, 1997]. Next section illustrates how to adapt this simulation method to our problem.

85 3.4. Simulation using Importance Sampling Simulation using Importance Sampling Usually, the scan statistic-based tests are employed when dealing with the detection of an unusually large cluster of events for example detection of a bioterrorist attack, brain tumor, mineeld reconnaissance etc.. Normally, in such problems, the practitioner wants to nd p values that involve small tail probabilities smaller than 0.1. Therefore, the problem is to estimate the high order quantiles of the d-dimensional scan statistics. To solve this problem, we propose the approximation developed in Section 3.2. To check the accuracy of our proposed approximation, we need to nd a suitable method to estimate the quantities involved in our formula, namely to estimate Q t1,...,t d, t j {2, 3} and j {1,..., d}. For comparison reasons, we also want to nd an alternative method to evaluate the scan statistics over the whole region. A direct approach to this problem { is to use the naive hit-or-miss Monte} Carlo, as described bellow. Let X i = X s i 1,s 2,...,s d, s j {1,..., T j }, j {1,..., d}, with 1 i IT ER, be IT ER independent realizations of the underlying random eld under the null hypothesis. For each realization, we compute the d-dimensional scan statistics S m i T and we dene and p MC = 1 IT ER 1 IT ER { } 3.56 S m i T τ s.e. MC = i=1 pmc 1 p MC IT ER, 3.57 the unbiased direct Monte Carlo estimate of p = P S m T τ and its consistent standard error estimate. As we will see from the numerical results, this approach is computationally intensive since just a fraction of the generated observations will cause a rejection and thus many replications are necessary in order to reduce the standard error estimate to an acceptable level especially for d 2. In general, there are many variance reduction techniques see for example [Ross, 2012, Chapters 9 and 10] that can be used to improve the eciency of the naive Monte Carlo approach. In the following, we present a variance reduction method based on importance sampling that will provide more accurate estimates Generalities on importance sampling In this subsection, we introduce some basic theoretical aspects of the importance sampling technique. This simulation method is usually employed when dealing with rare event probabilities and often lead to substantial variance reduction see [Rubino and Tun, 2009]. Expositions on importance sampling can be found in [Ross, 2012], [Rubinstein and Kroese, 2008] or [Fishman, 1996]. Suppose that we want to estimate, by Monte Carlo methods, the expectation of a

86 70 Chapter 3. Scan statistics and 1-dependent sequences function GW of a random vector W having a joint density function f. Let θ = E f [GW ] = Gxfxdx 3.58 be the expectation to be determined. Observe that in Eq.3.58 we used the notation E f with the subscript f to emphasizes that the expectation is taken with respect to the density f. The naive Monte Carlo approach suggests to draw N independent samples W 1, W 2,..., W N from the density fx and to take θ MC = 1 N N i=1 G W i, 3.59 as an estimate for θ. This approach may prove to be ineective in many situations. One possible cause can be that we cannot simulate random vectors from the distribution of W, another is that the variance of GW is too large see [Ross, 2012] for further discussion. To overcome such problems, it may be useful to introduce another probability density function g such that Gf is dominated by g, that is if gx = 0 then Gxfx = 0 to keep the estimator unbiased and to give the following alternative expression for θ: [ ] [ ] Gxfx GW fw θ = gxdx = E g gx gw Consequently, if now we draw N i.i.d. samples W 1, W 2,..., W N from the density gx, then θ IS = 1 N G W i f W i N g W i 3.61 i=1 is an unbiased estimator for θ. If the density function g can be chosen such that GW fw the random variable gw has a small variance the likelihood ratio f/g 1, then θ is an ecient estimator. In the particular case when one wants to estimate the probability θ = P W A, that is GW = 1 {W A}, a good choice of g is such that g f on the set A. To see this, we consider the comparison between the exact variances of the estimators in the direct θ MC and IS approach θ IS : [ ] V ar θis = 1 [ ] N V ar fw g 1 {W A} = 1 [ f [E 2 ] ] W g 1 gw N {W A} g 2 θ 2 W 1 [ ] ] fw [E g 1 N {W A} θ 2 = 1 [ ] 1 gw N {W A} fxdx θ 2 = 1 N V ar [ ] [ ] f 1{W A} = V ar θmc. In general, nding a good change of measure g that leads to an ecient sampling process can be dicult see [Rubino and Tun, 2009]. Fortunately, for our problem at hand, there are ecient change of measures and we discuss one of them in the next subsection.

87 3.4. Simulation using Importance Sampling Importance sampling for scan statistics As we saw in the foregoing section, the general idea behind the importance sampling technique is to change the distribution to be sampled from in such a way that the new estimator to remain unbiased. In this subsection, we present an importance sampling approach for the estimation of the signicance level of hypothesis tests based on d-dimensional scan statistics. This method was introduced by [Naiman and Priebe, 2001] and was successfully used to solve the problem of exceeding probabilities, that is, the probability that one or more tests statistics exceeds some given threshold. Their methodology builds upon a procedure introduced by [Frigessi and Vercellis, 1984] to solve the union count problem see also [Fishman, 1996, pag 261]. Assume that we are in the framework of the d-dimensional scan statistics described in Section 3.1. As we previously saw, the scan statistics S m T random variable is usually employed for testing the null hypothesis of randomness H 0 against a clustering alternative H 1. By the generalized likelihood ratio test, the null hypothesis H 0 is rejected in favor of the local change alternative H 1 whenever S m T is suciently large see [Glaz and Naus, 1991, Section 3] for an outline of the proof for d = 1 or [Glaz et al., 2001, Chapter 13]. If τ denotes the observed value of the test statistic from the actual data, then we want to nd the p-value p = P H0 S m T τ The main idea behind the importance sampling approach in the scan statistics setting is to sample only data such that the rejection of H 0 occurs, and then to determine the collection of all the locality statistics that generate a rejection. We describe this method in what follows. Let E i1,...,i d, for 1 i j T j m j + 1, j {1,..., d}, denote the event that Y i1,...,i d exceeds the threshold τ. We are interested in evaluating the probability P H0 S m T τ = P T 1 m 1 +1 i 1 =1 T d m d +1 i d =1 E i1,...,i d Under the notations made in the preceding section, the above equation can be rewritten as θ = Gxfxdx, 3.64 where θ = P H0 S m T τ, Gx = 1 E x, E = T 1 m 1 +1 i 1 =1 T d m d +1 i d =1 E i1,...,i d f is the joint density of Y i1,...,i d under the null hypothesis. Let T 1 m 1 +1 T d m d +1 Bd = P E i1,...,i d 3.65 i 1 =1 i d =1 and

88 72 Chapter 3. Scan statistics and 1-dependent sequences denote the usual Bonferroni upper bound for PE. Under the null hypothesis, due to stationarity, this bound becomes Bd = T 1 m T d m d + 1P E 1,..., The probability in Eq.3.63 can be expressed as T 1 m 1 +1 T d m d +1 P H0 S m T τ = P = = = = T 1 m 1 +1 j 1 =1 T 1 m 1 +1 j 1 =1 T 1 m 1 +1 j 1 =1 T 1 m 1 +1 i 1 =1 i 1 =1 1 E T d m d +1 T d m d +1 T 1 m 1 +1 = Bd j d =1 T d m d +1 j d =1 T d m d +1 j 1 =1 j d =1 i d =1 T d m d +1 i d =1 1 Ei1,...,i d T 1 m 1 +1 i 1 =1 E i1,...,i d T 1 m 1 +1 i 1 =1 1 E = T d m d +1 i d =1 1 CY 1 E E j1,...,j d dp H0 P E j1,...,j d j d =1 P E j1,...,j d Bd T 1 m 1 +1 T d m d +1 = Bd p j1,...,j d j 1 =1 j d =1 1 E dp H0 T d m d +1 i d =1 1 Ei1,...,i d dp H0 1 Ei1,...,i d 1 Ej1,...,jd dp H0 1 Ej1,...,j d 1 CY P E j1,...,j d dp H 0 1 CY dp H 0 E j1,...,j d 1 CY dp H 0 E j1,...,j d, 3.67 where, by stationarity, p j1,...,j d denes an uniform probability distribution over {1,..., T 1 m 1 + 1} {1,..., T d m d + 1}, p j1,...,j d = = T 1 m 1 +1 i 1 =1 P E j1,...,j d T d m d +1 i d =1 P E i1,...,i d 1 T 1 m T d m d and where CY represents the number of d-tuples i 1,..., i d for which exceedance Y i1,...,i d τ occurs, that is CY = T 1 m 1 +1 i 1 =1 T d m d +1 i d =1 1 Ei1,...,i d. 3.69

89 3.4. Simulation using Importance Sampling 73 Remark that from the above identity in Eq.3.67, the importance sampling change of measure g is a mixture of conditional distributions and is given by T 1 m 1 +1 T d m d +1 { } { } P Ej1,...,j gx = d 1 Ej1,...,j d fx Bd P E j1,...,j d j 1 =1 j d =1 From Eq.3.62 and Eq.3.67, we observe that our p-value can be expressed as the Boneroni conservative bound Bd times a correction factor ρd between 0 and 1, with ρd = T 1 m 1 +1 j 1 =1 T d m d +1 j d =1 p j1,...,j d 1 CY dp H 0 E j1,...,j d Notice that one can dene a larger class of importance sampling algorithms based on higher order inclusion-exclusion identities, generalizing the foregoing procedure, as provided in [Naiman and Wynn, 1997, Section 4]. The correction factor that appears Eq.3.71 can be estimated from the following algorithm: Algorithm 1 Importance Sampling Algorithm for Scan Statistics Begin Repeat for each k from 1 to IT ER iterations number 1: Generate uniformly the d-tuple i k 1,..., ik d from the set {1,..., T 1 m 1 + 1} {1,..., T d m d + 1}. 2: Given the d-tuple i k 1,..., ik d, generate a sample of the random eld X k = { Xk s 1,s 2,...,s d }, with s j {1,..., T j } and j {1,..., d}, from the conditional { } distribution of X given Y k i τ. 1,...,ik d 3: Take c k = C X k the number of all d-tuple i 1,..., i d for which Ỹi 1,...,i d τ and put ρ k d = 1 c k. End Repeat Return ρd = 1 IT ER End IT ER k=1 ρ k d. Clearly, ρd is an unbiased estimator for ρd with estimated variance V ar [ ρd] 1 IT ER 1 IT ER i=1 ρ i d 1 IT ER 2 ρ k d IT ER k=1 For IT ER suciently large, as a consequence of Central Limit Theorem, the error between the true and the estimated value of the tail P S m T τ, corresponding

90 74 Chapter 3. Scan statistics and 1-dependent sequences to a 95% condence level, is given by V ar [ ρd] β = 1.96Bd IT ER Notice that for the simulation of Q t1,...,t d, we substitute T 1,..., T d in the above algorithm with t 1 m 1 1,..., t d m d 1, respectively. Therefore, we obtain the corresponding values for β t1,...,t d as described by Eq In Figure 3.3, we present the dierence between the naive Monte Carlo and the Importance Sampling method in the particular case of a three dimensional scan statistics. We evaluate the simulation error corresponding to P S 5,5,5 60, 60, 60 2 in the Bernoulli model with p = For the Monte Carlo approach, we used replications in the range {10 6,..., 10 7 }, while for the Importance Sampling algorithm the range was {10 5,..., 10 6 }. We observe that in the case of hit and miss Monte Carlo approach the simulation error is rather large, even for 10 7 iterations Importance Sampling IS Naive Monte Carlo MC 0.1 P S5,5,560,60, Number of iterations - MC: Iter 10 6, IS: Iter 10 5 Figure 3.3: The evolution of simulation error in MC and IS methods Computational aspects The algorithm discussed in the previous section presents two implementation dif- culties: rst, it assumes that one is able to eciently generate the underlying random eld X from the required conditional distribution see Step 2 and second, the number of locality statistics that exceed the predetermined threshold is supposed to be found in a reasonable time. We address these problems separately. The problem of sampling from the conditional distribution in Step 2 of the algorithm depends on the initial distribution of the i.i.d. random variables X s1,s 2,...,s d, s j {1,..., T j } and j {1,..., d}. To illustrate the procedure we consider two

91 3.4. Simulation using Importance Sampling 75 examples for the random eld distribution: binomial of parameters ν and p and Gaussian with known mean µ and variance σ 2. Example Binomial model. In this example we consider that X s1,s 2,...,s d are i.i.d. binomial Bν, p random variables. Clearly, the random variables Y i1,...,i d are also binomially distributed with parameters w = νm 1 m d and p. The { main idea } is to generate a sample not from the conditional distribution given Y k i 1,...,ik d τ as described in Step 2 of the algorithm, but from the conditional { } distribution given Y k i = t, for some value t τ. This can be achieved 1,...,ik d since, if we dene the events G j1,...,j d t = {Y j1,...,j d = t}, for t {τ,..., w} and 1 j i T i m i + 1, i {1,..., d}, then we observe that the events E j1,...,j d can be expressed as w E j1,...,j d = G j1,...,j d t and the Eq.3.67 can be rewritten as t=τ T 1 m 1 +1 T d m d +1 P H0 S m T τ = Bd p j1,...,j d j 1 =1 T 1 m 1 +1 T d m d +1 = Bd j 1 =1 j d =1 T 1 m 1 +1 T d m d +1 = Bd j 1 =1 j d =1 j d =1 w p j1,...,j d t=τ w p j1,...,j d t=τ r j1,...,j d t r j1,...,j d t 1 CY dp H 0 E j1,...,j d 1 Gj1,...,j d t 1 CY P G j1,...,j d t dp H 0 1 CY dp H 0 G j1,...,j d t The weights r j1,...,j d t do not depend on the indices j 1,..., j d, due to the stationarity of the random eld and are computed from the relation r j1,...,j d t = P Y j 1,...,j d = t P Y j1,...,j d τ = P Y 1,...,1 = t P Y 1,...,1 τ To generate a sample from the conditional distribution that appears in Eq.3.74, reduces to show how one can sample uniformly a vector u = u 1,..., u l of size l = m 1 m d satisfying u u l = t and 0 u i ν, from the set of all such vectors denoted by Γl, t, ν. This later aspect can be achieved by the use of urn models. Assume that we have l urns with ν balls each and a null vector u of length l and we take t an integer draws without replacement. At the i-th step i t, we choose uniformly an urn from the remaining ones, that is the ones that are not empty at this step and draw a ball without replacement. We add one to the component of the vector u that corresponds to the index of the chosen urn. We repeat the procedure t steps and we obtain a vector u whose components take values between 0 and ν and have the sum equal with t. According to the above observations, the second step in Algorithm 1 can be rewritten in the following way

92 76 Chapter 3. Scan statistics and 1-dependent sequences Step 2a Step 2b Generate a threshold value t τ from the distribution r 1,...,1 t = P Y 1,...,1 = t P Y 1,...,1 τ. Conditionally, given t and the d-tuple i k 1,..., ik k d, generate X s 1,s 2,...,s d for i k j s j i k j +m j 1, j {1,..., d}, uniformly from the set Γm 1 m d, t, ν and take the remaining X s1,s 2,...,s d distributed according to the null distribution P H0. It is interesting to remark that the foregoing procedure can be applied, with small modications, to a larger class of discrete integer valued random variables: Poisson, geometric, negative binomial etc.. Example Gaussian model. In this example we consider that the underlying random eld is generated by independent and identically distributed normal random variables with known mean µ and variance σ 2 X s1,...,s d N µ, σ 2. Since the random variables Y i1,...,i d are sums of i.i.d. normals, clearly they follow a multivariate normal distribution with mean and covariance matrix, given by the next lemma whose proof can be found in Appendix B: Lemma Let X s1,...,s d N µ, σ 2 be i.i.d. random variables for all 1 s j T j, j {1,..., d}. The random variables Y i1,...,i d dened by Eq.3.3 follow a multivariate normal distribution with mean µ = m 1 m d µ and covariance matrix Σ = Cov [Y i1,...,i d, Y j1,...,j d ], given by m 1 i 1 j 1 m d i d j d σ 2, i s j s < m s Cov [Y i1,...,i d, Y j1,...,j d ] = s {1,..., d}, 0, otherwise Notice that Step 2 of Algorithm 1 requires one to sample Y k i 1,...,ik d from the tail distribution P Y k { i 1,...,ik d τ and, for the other indices, from the conditional distribution } given τ. Y i k 1,...,ik d For generating a sample from the tail of a normal variable, we use the acceptancerejection algorithm proposed in the classical paper of [Marsaglia, 1963] see also [Devroye, 1986, pag. 380] for a faster alternative based on exponential random variables. For the second part, we propose two alternative methods only the second one can be successfully applied for d 2. The rst variant, which can be viewed as a direct approach, is to use the standard method of sampling from the posterior normal distribution of the unobserved components given the observed ones, of a multivariate Gaussian vector partitioned in observed and unobserved elements. To apply this method in our context, we need rst to rewrite all the random variables Y i1,...,i d in an ordered sequence, that is to give a bijection between the set of all d-tuples i 1,..., i d and the set

93 3.4. Simulation using Importance Sampling 77 {1,..., T 1 m T d m d + 1}. 1 For such a bijection, we denote the formed sequence by Z = Z 1,..., Z N, where N = T 1 m T d m d + 1. If we consider that i k 1,..., ik d l, then to generate Z given Z l τ, we partition it into Z = W 1, Z l, W 2, where the observed component is Z l and the unobserved are W 1 = Z 1,..., Z l 1 and W 2 = Z l+1,..., Z N. It is easy to establish see for example [Tong, 1990, Chapter 3] that given Z l = t, for some t τ obtained by sampling from the tail distribution of Z l, we have W 1 = W 1 Z l = t N µ w1 t, Σ w1 t and W2 = W 2 Z l = t N µ w2 t, Σ w2 t 3.77 where for i {1, 2}, 1 µ wi t = E[W i ] + V ar[z l ] Cov[W i, Z l ]t E[Z l ], 3.78 Σ wi t = CovW i 1 V ar[z l ] Cov[W i, Z l ]Cov T [W i, Z l ] The covariance matrices that appear in the above equations can be computed using the result in Lemma Notice that, in order to sample from N µ wi t, Σ wi t, one has to consider the Cholesky decomposition of Σ wi t and take W i = µ wi t + CholΣ wi tu i, 3.80 with U i N 0, I vectors of independent standard normal random variables. Since computing the Cholesky decomposition at each iteration step can be time consuming, we propose an alternative method for sampling from the posterior of the Gaussian vector Z. This method was introduced by [Homan and Ribak, 1991] see also the note of [Doucet, 2010] and requires only to be able to simulate a random vector from the prior distribution. The algorithm can be summarized as follows Generate Z N µ, Σ Take W i = W i + 1 V ar[z l ] Cov[W i, Z l ]t Z l The validity of the foregoing algorithm is presented in Appendix B. We remark that in the above procedure we need to compute the Cholesky decomposition of Σ only once, thus reducing the execution time of Algorithm 1. The foregoing approach can be successfully applied in the case of one dimensional discrete scan statistics. Nevertheless, for higher dimensions d 2, the covariance matrix is very large and to store it will require a large memory space. Take, for example, the two dimensional setting with T 1 = T 2 = 200 and m 1 = m 2 = 10. Clearly, the covariance matrix will be of size and to store it in Matlab, for example, would be necessary almost 10 Gb of RAM space. Increasing the d 1 1 Such a bijection is given by fi1,..., i d = i s 1L s+1 L d + i d, L 0 = 1. s=1

94 78 Chapter 3. Scan statistics and 1-dependent sequences dimensions of the problem and/or the sizes of the region R d to be scanned, shows that this approach is unfeasible for d 2. To overcome the above diculties, we present a second method for generating the random eld X k = { Xk s 1,s 2,...,s d }, with s j {1,..., T j } and j {1,..., d}, from }. the conditional distribution given the d-tuple i k 1,..., ik d {Y and k i τ 1,...,ik d The idea behind this method is to directly generate the underlying random eld and not the random variables Y i1,...,i d, as in the previous approach and is based on the following result: Lemma In the usual d dimensional setting, let X s1,s 2,...,s d, for all s j {1,..., T j } and j {1,..., d}, be i.i.d. N µ, σ 2 random variables. If w } = m 1 m d, then conditionally given the d-tuple i k 1,..., ik d {Y and k i = t, 1,...,ik d the random variables X s1,s 2,...,s d, s 1,..., s d i k 1,..., ik d, are jointly distributed as the random variables X s1,s 2,...,s d, s 1,..., s d i k 1,..., ik d, where X s1,s 2,...,s d = t µ w w for s 1,..., s d Γ i k 1,...,ik d Y k w 1 w i X 1,...,ik k d i + X s1 1,...,ik,s 2,...,s d d 3.81, X s1,s 2,...,s d = X s1,s 2,...,s d, for s 1,..., s d / Γ i k 1,...,ik d 3.82 and and where Γ k i = 1,...,ik d X i k 1,...,ik d = t { r 1,..., r d i k s 1,...,s d Γ i k 1,...,ik d 1,..., ik ik d j r j i k j X s1,s 2,...,s d 3.83 } + m j 1, 1 j d. The result in Lemma leads to the following modication of the second step in Algorithm 1: Step 2a Step 2b Given the d-tuple i k 1,..., ik d, generate a value Y i k distribution P Y k i τ. 1,...,ik d 1,...,ik d = t from the tail Conditionally, given the d-tuple i k 1,..., ik d and Y = t, generate the i { } k 1,...,ik d random eld X = Xs1,s 2,...,s d, with s j {1,..., T j } and j {1,..., d}, based on the Eqs from Lemma The advantage of this method is that we don't need to compute any covariance matrix, the memory problem being solved. It is also important to notice the scalability of these formulas to higher dimensions.

95 3.4. Simulation using Importance Sampling 79 We conclude by mentioning that there are other methods to sample from the discussed conditional distribution: for example in [Malley et al., 2002], the authors use a discrete fast Fourier transforms approach to solve this problem. However, their methodology make use of the covariance matrix, so it cannot be scaled to higher dimensions due to the memory problem described above. Concerning the problem of eciently searching for the locality statistics that exceed a given threshold τ, we adopt a technique based on cumulative counts see [Neil, 2006]. This method searches the d-dimensional region R d in constant time see Figure 3.4a. We illustrate this method in the two dimensional setting, taking a square region and scanning the region with a square window d = 2, T 1 = T 2 = T and m 1 = m 2 = m. The idea is to precompute a matrix of cumulative counts using dynamic programming, step that takes about OT 2 operations such a function is implemented in almost all computational softwares: Matlab, R, Maple, Mathematica etc.. If we denote by M this matrix, then the element from the i-th line and j-th column is given by i j Mi, j = X k,l, so the locality statistic, Y i1,i 2 k=1 l=1 can be found by the relation Y i1,i 2 = Mi 1 + m 1, i 2 + m 1 Mi 1 + m 1, i 2 1 Mi 1 1, i 2 + m 1 + Mi 1 1, i 2 1. The last computation can be done in O1, which shows that if IT ER is the number of replicas made by our algorithm, then the required time for nding the number of locality statistics superior to a given τ is about OIT ER T 2. In Figure 3.4, we considered two scenarios: on the left see Figure 3.4a, we xed the region and the scanning window sizes at T 1 = T 2 = 2500 and m 1 = m 2 = 50, respectively and for 10 3 replicas we plotted the run time necessary to nd the number of locality statistics exceeding τ = 23; on the right side see Figure 3.4b, for the same values of m 1 = m 2 and τ, we illustrate the run time of the algorithm given that the size of the region increases from T 1 = T 2 = 300 to T 1 = T 2 = One advantage of this method is that can be easily scaled to d-dimensions 2. In this situation, the computational time is about OIT ER T d for d-dimensional hypercubes. We should also mention that there are other methods, some of them faster see for example [Neill et al., 2005] and [Neil, 2012], that can be used to solve our problem. Nevertheless, their implementations are more dicult and involve advanced programming skills. 2 In general, if My1,..., y d = then W y 1,...,y d x 1,...,x d = My 1,..., y d + y 1 s 1 =1 d y d X s1,...,s d s d =1 k=1 1 j 1 < <j k d y 1 and W y 1,...,y d x 1,...,x d = s 1 =x 1 y d s d =x d X s1,...,s d 1 k My 1,..., x j1 1,..., x jk 1,..., y d. To compute Y i1,...,i d it is enough to take x r = i r and y r = i r + m r 1, r {1,..., d} in the above formula.

96 80 Chapter 3. Scan statistics and 1-dependent sequences CPUtime CPUtime Number of Iterations a T 1 = T 2 = T b Figure 3.4: Illustration of the run time using the cumulative counts technique Related algorithms: comparison for normal data In this subsection, we present another importance sampling algorithm for the estimation of the signicance level of hypothesis tests based on d-dimensional scan statistics. The algorithm was introduced by [Shi et al., 2007] and applied in the context of genetic linkage analysis. The idea behind the algorithm is to imbed the probability measure under the null hypothesis into an exponential family. We describe the algorithm in the context of an underlying random eld generated by i.i.d. N µ, σ 2 random variables, but we should mention that it can be extended to any exponential families of distributions. Consider the usual d-dimensional setting in which X s1,...,s 2 N µ, σ 2, 1 s j T j, j {1,..., d}. As we saw in the Example 3.4.2, the random variables Y i1,...,i d, 1 i j T j m j + 1, j {1,..., d}, follow a multivariate normal distribution with mean and covariance matrix given by Lemma We dene the new probability measure dp ξ,r1,...,r d = e ξyr 1,...,r d E H0 [ e ξy r1,...,r d ]dp H for a given ξ and d-tuple r 1,..., r d. We observe that under the dened measure, the random eld remains Gaussian and has the mean and the covariance matrix given by E ξ,r1,...,r d [Y i1,...,i d ] = ξcov H0 [Y i1,...,i d, Y r1,...,r d ] + m 1 m d µ, 3.85 Cov ξ,r1,...,r d [Y i1,...,i d, Y j1,...,j d ] = Cov H0 [Y i1,...,i d, Y j1,...,j d ]. 3.86

97 3.4. Simulation using Importance Sampling 81 The importance sampling algorithm is build based on the following identity: P H0 S m T τ = 1 N k 1,...,k d E ξ,k1,...,k d with N = T 1 m T d m d + 1. This relation can be deduced from the following argument: P H0 S m T τ = = T 1 m 1 +1 = k 1,...,k d = = 1 N k 1,...,k d j 1 =1 T 1 m 1 +1 j 1 =1 k 1,...,k d 1 max 1 i j T j m j +1 j {1,2,...,d} T d m d +1 j d =1 T d m d +1 j d =1 e ξy k 1,...,k d e ξy j 1,...,j d N1 {SmT τ} [ e ξyj1,...,jd log E H0 e ξy ] j 1,...,j d j 1,...,j d Y i1,i 2,...,i d τ dp H0 e ξy j 1,...,j d 1 {SmT τ}dp H0 1 {SmT τ}dp H0 e ξyj1,...,jd j 1,...,j d j 1,...,j d e ξyj1,...,jd 1 {SmT τ} j 1,...,j d e ξyj1,...,jd log E H0 [ e ξy j 1,...,j d ] dp ξ,k1,...,k d where the log moment generating function is equal with N, 3.87 log E H0 [ e ξy j 1,...,j d ] 1 {SmT τ}dp ξ,k1,...,k d, 3.88 [ ] log E H0 e ξy j 1,...,j d = m 1 m d µξ + σ2 ξ 2, and where we have adopted the shorthand notation k 1,...,k d = T 1 m 1 +1 k 1 =1 T d m d +1. The choice of the parameter ξ in the above relation should satisfy the following k d =1

98 82 Chapter 3. Scan statistics and 1-dependent sequences equation see [Shi et al., 2007, Appendix B] for an heuristic argument: E ξ,r1,...,r d [Y r1,...,r d ] τ, 3.90 τ which leads to the solution ξ m 1 m µ d. σ 2 σ 2 Based on the identity in Eq.3.87, we have the following importance sampling algorithm: Algorithm 2 Second Importance Sampling Algorithm for Scan Statistics Begin Repeat for each k from 1 to IT ER iterations number 1: Generate uniformly the d-tuple i k 1,..., ik d from the set {1,..., T 1 m 1 + 1} {1,..., T d m d + 1}. 2: Given the d-tuple i k 1,..., ik d, generate a sample of the Gaussian process Y i1,...,i d according to the new measure dp k ξ,i 1,...,ik. d 3: Compute ρ k d based on ρ k d = j 1,...,j d e ξyj1,...,jd N log E H0 [ e ξy j 1,...,j d ] 1 {SmT τ}. End Repeat Return ρd = 1 IT ER End IT ER k=1 ρ k d. In order to illustrate the eciency of the foregoing algorithm and Algorithm 1 described in Section , we consider the problem of estimating the distribution of one dimensional scan statistics over a standard Gaussian random eld. In our simulation study we include, for comparison, another two approaches for estimating the desired p value: the direct Monte Carlo method given at the beginning of Section 3.4 and a method based on the quasi Monte Carlo algorithm of [Genz and Bretz, 2009] for numerically approximate the distribution of a multivariate normal. To evaluate the eciency of the algorithms, we dene the measure of relative eciency between two algorithms, to be equal with the ratio between the computation time that is required by each algorithm to achieve a given error variance. This measure of eciency has been used, in a slightly general form, in [Wu and Naiman, 2005] and [Priebe et al., 2001], where the authors considered the ratio between the computation time times the variance of the estimator. In Tables , we present a comparison study between the four methods. We took as the reference point the importance sampling algorithm presented in Section Actually, in the second step of the algorithm, we take the rst approach presented in Example for fair comparison, since both methods use the covariance matrix of the process.

99 3.5. Examples and numerical results 83 We have evaluated the distribution of the one dimensional discrete scan statistics for selected values of the length of the sequence T 1 {200, 500, 750, 800} and of the scanning window m 1 {15, 25, 30, 40}. The last column gives the value of the relative eciency measure for each compared method Genz, direct Monte Carlo and the new importance sampling algorithm with respect to Algorithm 1. It can be observed that Algorithm 1 is the most ecient between the four, in some cases the dierences being huge, especially in comparison with the direct Monte Carlo method and the algorithm proposed by [Genz and Bretz, 2009] in some cases the importance sampling method is 600 times faster. T 1 m 1 n Genz Err Genz IS Algo 1 Err Algo 1 Rel E Table 3.1: A comparison of the p values as evaluated by simulation using [Genz and Bretz, 2009] algorithm Genz, importance sampling Algo 1 and the relative eciency between the methods Rel E T 1 m 1 n MC Err MC IS Algo 1 Err Algo 1 Rel E Table 3.2: A comparison of the p values as evaluated by naive Monte Carlo MC, importance sampling Algo 1 and the relative eciency between the methods Rel E T 1 m 1 n IS Algo 2 Err Algo 2 IS Algo 1 Err Algo 1 Rel E Table 3.3: A comparison of the p values as evaluated by the two importance sampling algorithms Algo 2 and Algo 1 and the relative eciency between the methods Rel E 3.5 Examples and numerical results To illustrate the accuracy of our approximations, we consider in this section the particular cases of one, two and three dimensional discrete scan statistics. We compare our results with the existing ones presented in Chapter 1.

100 84 Chapter 3. Scan statistics and 1-dependent sequences 10 x Importance Sampling Algo 1 Importance Sampling Algo x Importance Sampling Algo 1 Importance Sampling Algo 2 Error P S Error P S Number of iterations a x Number of iterations b x 10 4 Figure 3.5: The evolution of simulation error in IS Algorithm 1 and IS Algorithm One dimensional scan statistics Based on the methodology presented in Section 3.2 and Section 3.3, we have the following approximation for the one dimensional scan statistics: P S m1 T 1 n From Eq.3.41, the theoretical error bound is 2Q 2 Q 3 [1 + Q 2 Q 3 + 2Q 2 Q 3 2 ] L E app 1 = L 1 1F Q 2, L Q 2 2, 3.92 while the simulation errors, according to Eq.3.45 and Eq.3.54, are given by E sapp 1 = L 1 1F ˆQ2, L ˆQ 2 + β 2 2, 3.93 E sf 1 = L 1 1β 2 + β For selected values of m 1, T 1 and n and selected values of the parameters of the binomial, Poisson and normal distributions, we evaluate the accuracy of the approximation in Eq. 3.91, as well as the sharpness of the error bounds given above. In Table 3.4, we compare our approximation with the exact value column Exact Value, the product type approximation and the lower and upper margins of the distribution of the scan statistics presented in Chapter 1. The exact values were obtained using the Markov chain imbedding methodology described in Section Numerical values for the distribution of S m1 T 1 in the case of binomial and Poisson models are illustrated in Table 3.5. To evaluate the two models, we considered a sequence of length T 1 = 5000, distributed according to a binomial distribution with parameters r = 5 and p = 0.01 and a Poisson of mean λ = rp, respectively and scanned with a window of size m 1 = 50. We have also included the simulated value, the product type approximation Eq.1.46 and the lower and upper bounds

101 3.5. Examples and numerical results 85 n Exact AppH E app 1 AppPT LowB UppB Value Eq.3.91 Eq.3.92 Eq.1.46 Eq.1.47 Eq Table 3.4: Approximations for PS m1 T 1 n in the Bernoulli B0.05 case: m 1 = 15, T 1 = 1000, IT ER = 10 5 Eqs.1.47 and 1.49 for comparison reasons. For the simulated value Sim and our approximation AppH, we used the importance sampling algorithm 1, as explained in Example 3.4.1, with IT ER sim = 10 4 and IT ER app = 10 5, respectively. The Gaussian model with mean µ = 0 and variance σ 2 = 1 is presented in Tan AppH E sapp1 E sf 1 Total Sim AppPT LowB UppB Eq.3.91 Eq.3.93 Eq.3.94 Error Eq.1.46 Eq.1.47 Eq.1.49 Bin5, P oiss Table 3.5: Approximations for PS m1 T 1 n for binomial and Poisson cases: m 1 = 50, T 1 = 5000, IT ER app = 10 5, IT ER sim = 10 4 ble 3.6. For the evaluation of our approximation and of the simulation, we applied the method described in Example 3.4.2, where we used 10 5 replicas of the algorithm for the approximation and 10 4 replicas for the simulation. We observe from the numerical results that our proposed approximation is quite accurate. Moreover, we see that the main contribution to the overall error is given by the simulation error associated with the approximation formula E sf 1. Increasing the number of iterations of our algorithm can solve this aspect.

102 86 Chapter 3. Scan statistics and 1-dependent sequences n AppH E sapp1 E sf 1 Total Sim AppPT LowB UppB Eq.3.91 Eq.3.93 Eq.3.93 Error Eq.1.46 Eq.1.47 Eq Table 3.6: Approximations for PS m1 T 1 n in the Gaussian N 0, 1 case: m 1 = 40, T 1 = 800, IT ER app = 10 5, IT ER sim = Two dimensional scan statistics The distribution of the two dimensional discrete scan statistics can be approximated via Eq.3.26 by P S m T n 2 ˆQ 2 ˆQ 3 [1 + ˆQ 2 ˆQ ˆQ 2 ˆQ 3 2 ] L1 1, 3.95 where ˆQ 2 = H ˆQ2,2, ˆQ 2,3, L 2 ˆQ 3 = H ˆQ3,2, ˆQ 3,3, L 2. The simulation errors follow from Eq.3.45 and Eq.3.54 and can be expresses as E sf 2 = L 1 1L 2 1 β 2,2 + β 2,3 + β 3,2 + β 3,3, 3.96 [ E sapp 2 = L 1 1 F 1 ˆQ A 2 + C 2 + L2 1F 2 1 ˆQ 2 2,2 + β 2,2 +L 2 1F 3 1 ˆQ ] 2 3,2 + β 3,2, 3.97 ˆQ2, L 1 1, F 2 = F ˆQ2,2, L 2 1 ˆQ3,2, L 2 1, where F = F and F 3 = F based on Eq The value of A 2 and C 2 can be computed, according to Eqs.3.46 and 3.48, by the following formulas: A 2 = L 2 1 β 2,2 + β 2,3, C 2 = L 2 1F 2 1 ˆQ 2,2 + β 2,2 2.

103 3.5. Examples and numerical results 87 Notice that, from Eq.3.41, the theoretical error becomes [ E app 2 = L 1 1F 1 γ 2 + B L 1 1L 2 1 F 2 1 Q 2,2 2 +F 3 1 Q 3,2 2], 3.98 where γ 2 = H Q 2,2, Q 2,3, L 2 and B 2 = L 2 1F 2 1 Q 2,2 2. To investigate the accuracy of the approximation and of the error bounds, described by Eqs.3.95, 3.96 and 3.97, for the two dimensional discrete scan statistics, we present a series of numerical results. We evaluate the distribution of the scan statistics for selected values of the parameters of the binomial, Poisson and Gaussian models. In Table 3.7, we include numerical values for two models: the binomial model with parameters r = 10 and p = and the Poisson model with mean λ = rp = To compare our results, we have included the product type approximation AppPT developed by [Chen and Glaz, 1996] Eq.1.67 and the lower LowB and upper UppB bounds discussed in Section The second column AppH, that corresponds to our proposed approximation given by Eq.3.95 and the sixth column Sim, which corresponds to the simulated value, are evaluated via the importance sampling Algorithm 1 developed in Section 3.4, with IT ER app = 10 4 and IT ER sim = 10 3 replications, respectively. We observe that our approximation is quite accurate see also Figure 3.6. n AppH E sapp2 E sf 2 Total Sim AppPT LowB UppB Eq.3.95 Eq.3.96 Eq.3.97 Error Eq.1.67 Eq.1.76 Eq.1.77 Bin10, P oiss Table 3.7: Approximations for PS m T n for binomial and Poisson models: m 1 = 20, m 2 = 30, T 1 = 500, T 2 = 600, IT ER app = 10 4, IT ER sim = 10 3 Numerical values for the Gaussian model of mean µ = 1 and variance σ 2 = 0.5 are illustrated in Table 3.8. An extension to the two dimensional case of the importance sampling procedure presented in Example was used to obtain the simulated values Sim and our approximation AppH. We used IT ER app = 10 4 replications of the algorithm for approximation and IT ER sim = 10 3 for simulation.

104 88 Chapter 3. Scan statistics and 1-dependent sequences AppH Sim AppPT LowB UppB a Binomial model AppH Sim AppPT LowB UppB b Poisson model Figure 3.6: The empirical cumulative distribution function for the binomial and Poisson models in Table 3.7 n AppH E sapp2 E sf 2 Total Sim AppPT LowB UppB Eq.3.95 Eq.3.96 Eq.3.97 Error Eq.1.67 Eq.1.76 Eq Table 3.8: Approximations for PS m T n in the Gaussian N 1, 0.5 model: m 1 = 10, m 2 = 20, T 1 = 400, T 2 = 400, IT ER app = 10 4, IT ER sim = Three dimensional scan statistics Following the methodology presented in Section 3.2 and the notations introduced in Section 3.3, we observe that in the three dimensional setting, the approximation formula for the distribution of the three dimensional discrete scan statistics is given by 2 P S m T n ˆQ 2 ˆQ 3 [1 + ˆQ 2 ˆQ ˆQ 2 ˆQ ] L , 3.99 where for t 1, t 2 {2, 3} we have ˆQ t1 = H ˆQt1,2, ˆQ t1,3, L 2 ˆQ t1,t 2 = H ˆQt1,t 2,2, ˆQ t1,t 2,3, L 3 with H given in Eq.3.13.

105 3.5. Examples and numerical results 89 It follows from Eq.3.54 that the simulation error corresponding to the approximation formula is given by E sf 3 = L 1 1L 2 1L t 1,t 2,t 3 {2,3} β t1,t 2,t 3 The second simulation error, which is associated with the approximation error, results from Eq.3.45 and is expressed as E sapp 3 = L 1 1F 1 ˆQ A 2 + C 2 + L1 1L 2 1 [F 2 1 ˆQ 2,2 + +A 2,2 + C 2,2 2 + F 3 1 ˆQ ] 2 3,2 + A 3,2 + C 3,2 + L 1 1L 2 1 L 3 1 [F 2,2 1 ˆQ 2 2,2,2 + β 2,2,2 + F2,3 1 ˆQ 2 2,3,2 + β 2,3,2 +F 3,2 1 ˆQ 2 3,2,2 + β 3,2,2 + F3,3 1 ˆQ ] 2 3,3,2 + β 3,3, Notice that, based on Eq.3.27, for all t 1, t 2 {2, 3}, the coecients F, F t1 and F t1,t 2 are computed as F ˆQ2, L 1 1, F ˆQt1,2, L 2 1 and F ˆQt1,t 2,2, L 3 1, respectively. From Eq.3.46, we have while from Eq.3.48 A 2,2 = L 3 1 β 2,2,2 + β 2,2,3, A 3,2 = L 3 1 β 3,2,2 + β 3,2,3, A 2 = L 2 1L 3 1 β 2,2,2 + β 2,2,3 + β 2,3,2 + β 2,3,3, C 2,2 = L 3 1F 2,2 1 ˆQ 2,2,2 + β 2,2,2 2, C 2,3 = L 3 1F 2,3 1 ˆQ 2,3,2 + β 2,3,2 2, C 3,2 = L 3 1F 3,2 1 ˆQ 2 3,2,2 + β 3,2,2, C 2 = L 2 1 [F 2 1 ˆQ ] 2 2,2 + A 2,2 + C 2,2 + C2,2 + C 2,3. For selected values of the parameters of the binomial, Poisson and normal distributions, we evaluate the approximation introduced in Section 3.2 and provide the corresponding error bounds. We show the contributions of the simulation errors E sapp 3 and E sf 3 in the overall error. For all our simulations we used the importance sampling algorithm with IT ER = 10 5 replications. We compare our results with those existing in literature, see [Guerriero et al., 2010a] for the Bernoulli model and with the simulated value of the scan statistics obtained by scanning the whole region R 3, denoted by ˆPS n.

106 90 Chapter 3. Scan statistics and 1-dependent sequences The scanning of R 3 being more time consuming than the scanning of the subregions corresponding to Q t1,t 2,t 3, we used 10 3 repetitions of the algorithm. In Table 3.9, we compare the results obtained by our approximation with the product type approximation presented by [Guerriero et al., 2010a] see also Chapter 1, Eq We observe that our approximation is very sharp. Table 3.10 presents n ˆPS n AppPT AppH Esapp E sf Total Eq.1.87 Eq.3.99 Eq Eq Error p = p = Table 3.9: Approximations for PS m T n in the Bernoulli model: m 1 = m 2 = m 3 = 5, T 1 = T 2 = T 3 = 60, IT ER app = 10 5, IT ER sim = 10 3 the numerical results obtained by scanning the region R 3 of size , with two windows of the same volume but dierent sizes, rst a cubic window of size and second a rectangular region of size We observe that the results are closely related, but signicantly dierent. In Table 3.11, we have included numern ˆPS n AppH Esapp 3 E sf 3 Total Eq.3.99 Eq Eq Error m 1 = m 2 = m 3 = m 1 = 8, m 2 = 4, m 3 = Table 3.10: Approximation for PS m T n over the region R 3 with windows of the same volume by dierent sizes: T 1 = T 2 = T 3 = 60, p = , IT ER app = 10 5, IT ER sim = 10 3 ical values emphasizing the situation described by Remark We consider the Bernoulli model of parameter p = over the region R 3 of size and scan it with a cubic window of length 10. The second and forth columns gives the values corresponding to the bounds described in Eq.3.23, while in the third column we presented the simulated values for P S 10,10,10 185, 185, 185 n.

107 3.5. Examples and numerical results 91 n P SL 1 + 1, L 2 + 1, L n ˆPS n P SL1, L 2, L 3 n ± ± ± ± ± ± ± ± ± Table 3.11: Approximation for PS m T n based on Remark 3.2.1: m 1 = m 2 = m 3 = 10, T 1 = T 2 = T 3 = 185, L 1 = L 2 = L 3 = 20, IT ER app = 10 5, IT ER sim = 10 3 In order to compare the binomial and Poisson models, in Table 3.12 we have evaluated the distribution of the scan statistics over a region of size , scanned with a cubic window, in the two situations. In the rst case, we have a binomial random eld with parameters m and p, that is X s1,s 2,s 3 Bm, p, while in the second we considered that X s1,s 2,s 3 P λ, with λ = mp. We observe that the n ˆPS n AppH Esapp 3 E sf 3 Total Eq.3.99 Eq Eq Error Bin10, P oiss Table 3.12: Approximation for PS m T n in the binomial and Poisson models: m 1 = m 2 = m 3 = 4, T 1 = T 2 = T 3 = 84, IT ER app = 10 5, IT ER sim = 10 3 cumulative function, in the two models, are close to each other see also Figure 3.7. Lastly, we have consider a numerical example for the Gaussian model. In Table 3.13, we present numerical values for the distribution of the three dimensional discrete scan statistics over a region of size , scanned with a cubic window of size The underlying random eld is taken here to be formed of i.i.d. random variables distributed according to a N 0, 1 law. We compare our results with the simulated value IT ER S im = 10 2 and with the product type

108 92 Chapter 3. Scan statistics and 1-dependent sequences AppH Sim a Binomial model AppH Sim b Poisson model Figure 3.7: The empirical cumulative distribution function for the binomial and Poisson models in Table 3.12 approximation introduced in [Guerriero et al., 2010a] for the Bernoulli case see also Chapter 1, Eq n ˆPS n AppPT. AppH Esapp 3 E sf 3 Total Eq.1.87 Eq.3.99 Eq Eq Error Table 3.13: Approximations for PS m T n in the Gaussian N 0, 1 model: m 1 = m 2 = m 3 = 10, T 1 = T 2 = T 3 = 256, IT ER app = 10 5, IT ER sim = 10 3

109 3.5. Examples and numerical results AppH AppPT Figure 3.8: The empirical cumulative distribution functions AppH = Our Approximation, AppPT = Product Type Approximation for the Gaussian model in Table 3.13 Notice that the contribution of the approximation error E sapp to the total error is almost negligible in most of the cases with respect to the simulation error E sf. Thus, the precision of the method will depend mostly on the number of iterations IT ER used to estimate Q t1,t 2,t 3.

110

111 Chapter 4 Scan statistics over some block-factor type dependent models In this chapter, we present an estimate for the distribution of the multidimensional discrete scan statistics over a random eld generated by a block-factor type model. This dependent model generalizes the i.i.d. model studied in Chapter 3 and is introduced in Section 4.1. The approximation process, as well as the associated error bounds, are described in Section 4.2. Section 4.3 includes examples and numerical applications for particular block-factor models in one and two dimensions. Some of the results presented in this chapter appeared in [Am rioarei and Preda, 2014], for the special case of two dimensions see also [Am rioarei and Preda, 2013b]. Contents 4.1 Block-factor type model Approximation and error bounds The approximation process The associated error bounds Examples and numerical results Example 1: A one dimensional Bernoulli model Example 2: Length of the longest increasing run Example 3: Moving average of order q model Example 4: A game of minesweeper Block-factor type model Most of the research devoted to the one, two or three dimensional discrete scan statistic considers the independent and identically distributed i.i.d. model for the random variables that generate the random eld which is to be scanned. In this section, we dene a dependence structure for the underlying random eld based on a block-factor type model. Throughout this chapter, we adopt the denitions and the notations introduced in Section 3.1 for the d dimensional setting.

112 96 Chapter 4. Scan statistics over some block-factor type models Recall from Denition see also [Burton et al., 1993], that a sequence W l l 1 of random variables with state space S W is said to be a k block-factor of the sequence W l l 1 with state space S W, if there is a measurable function f : S k W S W such that W l = f Wl, W l+1,..., W l+k 1 for all l. Our block-factor type model is { dened in the following way. Let T 1, T2,..., Td be positive integers, d 1 and Xs1,...,s d 1 s j T j, 1 j d} be a family of independent and identically distributed real valued random variables over the d dimensional rectangular region R d = [0, T 1 ] [0, T d ] in such a way that to each rectangular elementary subregion r d s 1, s 2,..., s d = [s 1 1, s 1 ] [s d 1, s d ] it corresponds a variable X s1,...,s d. As we saw in Chapter 3, it is customary to interpret these random variables as the number of some events of interest that occurred in the subregion r d s 1, s 2,..., s d. For j {1, 2,..., d}, let x j 1, xj 2 be nonnegative integers such that x j 1 + xj T j. Dene { c j = x j 1 +xj 2 +1 and take T j = T j c j +1. To each d-tuple s 1,..., s d, with s j x j 1 + 1,..., T } j x j 2, j {1,..., d}, we associate a d-way or of order d tensor see [Kolda and Bader, 2009] X s1,...,s d R c 1 c d whose elements are given by X s1,...,s d j 1,..., j d = X s1 x j 1,...,s d x d 1 1+j, 4.1 d where j 1,..., j d {1,..., c 1 } {1,..., c d }. We observe that for the particular cases of one and two dimensions, the d-way tensor X s1,...,s d reduces to a vector a tensor of order one and a matrix a tensor of order two, respectively. If Π : R c 1 c d R is a measurable real valued function dened on the set of the tensors X s1,...,s d measurable with respect to the usual Borel σ-elds of R c 1 c d and R, then we dene the block-factor type model by X s1,...,s d = Π X s1 +x 1 1,...,s d+x d, for all 1 s j T j, 1 j d. To illustrate the intuition behind the above denition, we consider the special case of two dimensional block-factor model d = 2. In this setting, the underlying random eld is dened as X s1,s 2 = Π X s1 +x 1 1,s 2+x 2, where the tensor X s1,s 2, 1 which encapsulates the dependence structure, takes the matrix form X s1,s 2 = X s1 x 1 1,s 2 x 2 1. X s1 x 1 1,s 2+x 2 2 Xs1 +x 1 2,s 2 x Xs1 +x 1 2,s 2+x Figure 4.1 illustrates the construction of the block-factor model: on the left see Figure 4.1a is presented the conguration matrix dened by Eq.4.3 and the resulted random variable after applying the transformation Π; on the right see

113 4.1. Block-factor type model 97 Figure 4.1b is exemplied how the i.i.d. model Xs1,s 2 block-factor model X s1,s 2. is transformed into the T 2 X s 1 x 1 1,s 2 +x X s 1 +x 1 2,s 2 +x2 2. T 2 x 2 2 X s1,s 2. X s1,s 2. X s1 x 1 1,s 2 x X s1 +x 1 2,s 2 x2 1 x Π 1... x T1 x T1 Π T 2 X s1 x 1 1,s 2 x2 1 X s1 x 1 1,s 2 x T 1 a b Figure 4.1: Illustration of the block-factor type model in two dimensions d = 2 It is clear, due to the overlapping structure, that {X s1,s 2 1 s 1 T 1, 1 s 2 T 2 } forms a dependent family of random variables see Figure 4.2 and the same is true for the general case. Recall from Denition that a sequence W l l 1 is m-dependent with m 1 see [Burton et al., 1993], if for any h 1 the σ-elds generated by {W 1,..., W h } and {W h+m+1,... } are independent. Thus, from the denition of the block-factor model in Eq.4.2, we observe that the sequence X s1,...,s h 1,s h,s h+1,...,s d 1 s h T h is c h 1-dependent, for each 1 h d. In Figure 4.2 we illustrate, for the two dimensional case, the dependence structure of the eld X s1,s 2 emphasizing its c 1 1 and c 2 1-dependent character. Remark Notice that if c 1 = = c d = 1 that is x j 1 = x j 2 = 0 for j {1,..., d} and Π is the identity function, then X s1,...,s d = X s1,...,s d and we are in the i.i.d. situation. In this case, an approximation method for the distribution of the d-dimensional discrete scan statistics was studied in Chapter 3. Remark If we take d = 1, we nd ourselves in the particular situation of one dimensional discrete scan statistics over a c 1 1-dependent sequence. The distribution of one dimensional scan statistics over this type of dependence was studied by

114 98 Chapter 4. Scan statistics over some block-factor type models X s1 c 1 +1,s 2 +c 2 1 X s1 +c 1 1,s 2 +c 2 1 c 2 1 c 1 1 X s1,s 2 c 1 1 c 2 1 X s1 c 1 +1,s 2 c 2 +1 X s1 +c 1 1,s 2 c 2 +1 Figure 4.2: The dependence structure of X s1,s 2 in two dimensions [Haiman and Preda, 2013] in the particular case of Gaussian stationary 1-dependent sequences W i N 0, 1 of random variables generated by a two block-factor of the form W i = au i + bu i+1, i 1, where a 2 + b 2 = 1 and U i i 1 is an i.i.d. sequence of N 0, 1 random variables. Clearly, their model can be obtained from the one described in this section by putting x 1 1 = 0, x 1 2 = 1, which implies c 1 = 2, and choosing Πα 1, α 2 = aα 1 + bα 2. Applications of the one dimensional discrete scan statistics over a sequence of moving average of order q c 1 = q + 1 was recently given by [Wang and Glaz, 2013] see also [Wang, 2013, Chapter 4]. We include in Section 4.3, a numerical example in which we compare their results with the ones obtained by our method. Based on the dependent model introduced in this section, in Section 4.2 we give an approximation for the distribution of the d-dimensional discrete scan statistic over the random eld generated by the random variables X s1,...,s d and the corresponding error bounds. 4.2 Approximation and error bounds In this section, we give an estimate for the distribution function, Q m T, of the d- dimensional discrete scan statistics evaluated over a random eld generated by the block-factor model introduced in the foregoing section. The methodology used for the derivation of the approximation formula follows closely the approach adopted in Chapter 3 for the i.i.d. model.

115 4.2. Approximation and error bounds The approximation process Let us consider that we are in the framework of the block-factor model introduced in Section 4.1 and we scan the generated random eld with a window of size m 1 m d, 2 m j T j, j {1,..., d}. As in the case of the i.i.d. model see Section 3.2, the main idea behind the estimation process is to express the scan statistics random variable, S m T, m = m 1,..., m d, T = T 1,..., T d, as the maximum of a 1- dependent stationary sequence of properly selected random variables. Assume that for each j {1,..., d}, Tj = L j m j + c j 2, where L 1,..., L d are positive integers and observe that the generated region R d, after applying the transformation Π, has the sides of length T j = L j 1m j + c j 2 + m j We dene for each k 1 {1, 2,..., L 1 1} the random variables Z k1 = max k 1 1m 1 +c i 1 k 1 m 1 +c i j L j 1m j +c j 2 j {2,...,d} Y i1,i 2,...,i d, 4.5 and we claim that {Z 1,..., Z L1 1} forms a 1-dependent and stationary sequence of random variables in the view of Denition and Denition To see this, we rst remark that Z k1 's represent the d-dimensional scan statistics on the overlapping rectangular strips [k 1 1m 1 + c 1 2, k 1 m 1 + c m 1 1] [0, T 2 ] [0, T d ], 4.6 of size 2m 1 + c 1 3 T 2 T d. In Figure 4.3 we illustrate, in the particular situation of two dimensions d = 2, the overlapping structure of Z 1, Z 2 and Z 3 emphasizing the 1-dependent character of the sequence. The one dependence structure of {Z 1,..., Z L1 1} follows immediately from Z k1 1 σ {X s1,...,s d k 1 2m 1 + c s 1 k 1 1m 1 + c m 1 1} { } k1 σ Xs1,...,s d 2m 1 + c s 1 k 1 m 1 + c 1 2 and similarly { } k1 Z k1 σ Xs1,...,s d 1m 1 + c s 1 k 1 + 1m 1 + c 1 2, { } k1 Z k1+1 σ Xs1,...,s d m 1 + c s 1 k 1 + 2m 1 + c 1 2 { and the independence of the family Xs1,...,s d 1 s j T } j, 1 j d of random variables. Moreover, since X s1,...,s d are identically distributed, the stationarity of the random variables Z k1 is veried. We observe that from the denition of the sequence Z k1 1 k1 L 1 1, the scan statistics random variable can be written as S m T = max 1 i j T j m j +1 j {1,2,...,d} Y i1,i 2,...,i d = max Z k k 1 L 1 1

116 100 Chapter 4. Scan statistics over some block-factor type models T 1 4m 1 + 3c 1 7 3m 1 + 2c 1 5 Z 3 2m 1 + c 1 2 2m 1 + c 1 3 c 1 Z 2 m 1 + c 1 2 m 1 Z 1 0 m 2 T 2 Figure 4.3: Illustration of Z k1 emphasizing the 1-dependence The foregoing relation is the key identity in the approximation process since it shows that S m T is the maximum of a 1-dependent stationary sequence. Tools for the estimation of the distribution of this type of random variables were developed in Chapter 2. Similar with the methodology used in the previous chapter for the i.i.d. model, for the approximation process we employ the estimate from Theorem 2.2.9, repeatedly. If, for t 1 {2, 3}, we take Q t1 = Q t1 n = P t 1 1 k 1 =1 {Z k1 n} = P max 1 i 1 t 1 1m 1 +c i j L j 1m j +c j 1 j {2,...,d} Y i1,i 2,...,i d n 4.8 and n is such that Q 2 n 0.9, then, based on Theorem 2.2.9, we get the rst step estimate Q m T H Q 2, Q 3, L 1 L 1 1F Q 2, L 1 11 Q 2 2, 4.9 where Hx, y, m and F x, m are evaluated via Eqs.3.13 and 3.10, respectively. Observe that, even if we employed the same notations as in Section 3.2, Q 2, Q 3 and L 1, L 2 in the above equations diers from the corresponding ones in the i.i.d. case. For example, here Q t1 's represents the distribution of the d-dimensional discrete scan statistics over the strip [0, t 1 1m 1 + c m 1 1] [0, T 2 ] [0, T d ],

117 4.2. Approximation and error bounds 101 as opposed to the i.i.d. model where we scanned the smaller region [0, t 1 m 1 1] [0, T 2 ] [0, T d ]. In order to derive an approximation expression for Q m T, that involves simpler quantities, we need to iterate the above procedure at most d steps. In the subsequent, we present the general s-step, 1 s d, routine. In this phase, the goal is to nd an estimate for the distribution of the d-dimensional discrete scan statistics evaluated over the multidimensional rectangular regions [0, t 1 1m 1 +c 1 2+m 1 1]... [0, t s 1 1m s 1 +c s 1 2+m s 1 1] [0, T s ] [0, T d ]. These distribution functions are denoted, for t 1,..., t s 1 {2, 3} s 1, by Q t1,t 2,...,t s 1 = Q t1,t 2,...,t s 1 n = P max 1 i l t l 1m l +c l 2 l {1,...,s 1} 1 i j L j 1m j +c j 2 j {s,...,d} Y i1,i 2,...,i d n To reduce the s dimension, we take for t l {2, 3}, l {1,..., s 1} and k s {1, 2,..., L s 1} the random variables Z t 1,t 2,...,t s 1 k s = max 1 i l t l 1m l +c l 2 l {1,2,...,s 1} k s 1m s+c s 2+1 i s k sm s+c s 2 1 i j L j 1m j +c j 2 j {s+1,...,d} Y i1,i 2,...,i d, 4.11 which form, based on the same arguments as in step one, a 1-dependent stationary sequence. Moreover, these random variables satisfy the relation Q t1,t 2,...,t s 1 = P max 1 k Zt 1,t 2,...,t s 1 k s L s 1 s n Notice that, from Eq.4.10 and Eq.4.11, we get the distribution function Q t1,t 2,...,t s = Q t1,t 2,...,t s n = P t s 1 k s=1 {Z t 1,t 2,...,t s 1 k s n} If we take n such that Q t1,t 2,...,t s 1,2n 0.9 then, applying the estimate from Theorem 2.2.9, we deduce the s step approximation Q t1,...,t s 1 H Q t1,...,t s 1,2, Q t1,...,t s 1,3, L s L s 1F Q t1,...,t s 1,2, L s 1 1 Q t1,...,t s 1, Depending on the problem dimension, by substituting repeatedly, for each s {2,..., d}, the estimate from Eq.4.14 in Eq.4.9, we obtain an approximation

118 102 Chapter 4. Scan statistics over some block-factor type models T 1 T 2 R 2 m 2 T 1 m 1 T 1 T 2 T 2 Q 2 2m 2 + c 2 3 Q 3 3m 2 + 2c 2 5 m 1 m 1 T 1 T 1 T 1 T 1 T 2 T 2 T 2 T 2 2m 2 + c 2 3 2m Q 2 + c Q 32 3m 2 + 2c 2 5 3m Q 2 + 2c Q 33 2m 1 + c 1 3 3m 1 + 2c 1 5 2m 1 + c 1 3 3m 1 + 2c 1 5 Figure 4.4: Illustration of the approximation process for d = 2 formula for the distribution of the d-dimensional scan statistics depending on the 2 d quantities Q t1,...,t d, which will be evaluated by Monte Carlo simulation. To get a better understanding of the approximation process described above, we consider the particular case of two dimensional scan statistics. Figure 4.4 illustrates the diagram that summarizes the steps involved in this process. Remark We should point out that if X s1,...,s 2 = X s1,...,s d, that is c 1 = = c d = 1 and Πx = x, then all of the above formulas reduces to the ones given in Section 3.2. Remark Notice that if there are indices j {1, 2,..., d} such that T j are not multiples of m j +c j 2 then, by taking L j =, we can approximate the Tj m j +c j 2 distribution of the scan statistics Q m T by the multi-linear interpolation procedure described in Remark The associated error bounds Since in the estimation process described in Section 4.2.1, by adopting the notations introduced in Section 3.1, we have obtained exactly the same type of relations as in the i.i.d. case, the resulted error bounds will also take the same expressions as those from Section 3.3. Such being the case, in this section we include only their formulas and not their derivation. As mentioned in Section 3.3, there are three

119 4.2. Approximation and error bounds 103 errors involved: the theoretical approximation error E app d, the simulation error that corresponds to the approximation error E sapp d and the simulation error associated to the approximation formula E sf d. Their expressions are given in the subsequent. a The theoretical approximation error Following the methodology described in Section for the i.i.d. model, we have E app d = where d L 1 1 L s 1 s=1 t 1,...,t s 1 {2,3} F t1,...,t s 1 1 γt1,...,t s 1,2 + B t1,...,t s 1,2 2, { H γt1,...,t γ t1,...,t s = s,2, γ t1,...,t s,3, L s+1, for 1 s d 1 Q t1,...,t d, for s = d, { H ˆQt1,...,t ˆQ t1,...,t s = s,2, ˆQ t1,...,t s,3, L s+1, for 1 s d 1 ˆQ t1,...,t d, for s = d and where for 2 s d F t1,...,t s 1 = F ˆQt1,...,t s 1,2, L s 1, B t1...,t s 1 = L s 1 F t1,...,t s 1 1 γt1,...,t s 1,2 + B t1...,t s 1,2 + with F = F ˆQ2, L 1 1 and B t1...,t d = 0. t s {2,3} B t1...,t s 4.19 Observe that in Eq.4.15 we adopted the following convention for s = 1: x = x, F t1,t 0 = F, γ t1,t 0,2 = γ 2 and B t1,t 0,2 = B 2. t 1,t 0 {2,3} b The simulation error that corresponds to the approximation error If the error between the true value of Q t1,...,t d and the estimated value ˆQ t1,...,t d, resulted from a simulation methodology, is denoted by β t1,...,t d then, the simulation error that appears from the approximation error given by Eq.4.15 has the expression E sapp d = d L 1 1 L s 1 s=1 t 1,...,t s 1 {2,3} F t1,...,t s 1 1 ˆQ t1,...,t s 1,2 +A t1,...,t s 1,2 + C t1,...,t s 1,2 2, 4.20

120 104 Chapter 4. Scan statistics over some block-factor type models where for s = 1: t 1,t 0 {2,3} x = x, F t1,t 0 = F, ˆQt1,t 0,2 = ˆQ 2, A t1,t 0,2 = A 2 and C t1,t 0,2 = C 2. The quantities A t1,...,t s and C t1,...,t s, that appear in the above formula, are computed via A t1,...,t s 1 = L s 1... L d 1 β t1,...,t d 4.21 and C t1...,t s 1 = L s 1 t s,...,t d {2,3} [F t1,...,t s 1 1 ˆQ 2 t1,...,t s 1,2 + A t1...,t s 1,2 +C t1...,t s 1,2 + t s {2,3} for 2 s d while for s = d we have A t1,...,t d = β t1,...,t d and C t1...,t d = 0. c The simulation error associated to the approximation formula This simulation error arise from the dierence H γ 2, γ 3, L 1 H ˆQ2, ˆQ 3, L 1 and is given by the relation E sf d = L L d 1 t 1,...,t d {2,3} C t1...,t s, 4.22 β t1,...,t d The total error is obtained by adding the two simulation error terms from Eq.4.20 and Eq.4.23 E total d = E sf d + E sapp d Examples and numerical results In order to illustrate the eciency of the approximation and the error bounds obtained in Section 4.2, in this section we present several examples for the particular cases of one and two dimensional discrete scan statistics Example 1: A one dimensional Bernoulli model We start this section with an example of a one dimensional block-factor model, for which we can compute exactly the distribution functions that appear in our proposed approximation: Q 2 and Q 3. This model is based on the parametrization introduced by [Haiman, 2012] see also [Haiman and Preda, 2013]. In the block-factor framework introduced in Section 4.1, we take d = 1, x 1 1 = 0, x 1 2 = 1 and we have c 1 = 2 and T 1 = T 1 1.

121 4.3. Examples and numerical results 105 Let, for 1 t T 1, W t, W t Bp, with 0 p 1 2 and W t B 1 2 be three sequences of 0 1 Bernoulli random variables such that W t, W t 1 t are independent random variables satisfying P W t = i, W t = j = qi, j and W t 's are T1 independent and independent of W t, W t 1 t. The joint probabilities qi, j are T1 evaluated via the relation 1 2p, if i, j = 0, 0 qi, j = p, if i, j = 0, 1 or i, j = 1, , if i, j = 1, 1. Let X s1 = W s1, W s 1, W s 1 for 1 s1 T 1, which forms an i.i.d. sequence of random variables and observe that the 1-way tensor that encapsulates the blockfactor model is given by X s1 = Xs1, X s We dene, for 1 s 1 T 1, our dependent model by X s1 = Π Xs1, X { Ws1 s1 +1 = +1, if W s 1 +1 = 0 W s 1, if W s 1 +1 = and we notice that X s1 's, being expressed as a 2 block-factor, form a 1-dependent, stationary sequence of 0 1 Bernoulli of parameter p random variables. This type of 1-dependent sequence was studied in [Haiman, 2012]. The author showed see [Haiman, 2012, Lemma 3] that the joint distribution of X 1, X 2 satises the equation P X 1 = i, X 2 = j = pi, j, 4.28 where 1 2p p2, if i, j = 0, 0 pi, j = p 3 4 p2, if i, j = 0, 1 or i, j = 1, p2, if i, j = 1, Moreover, from [Haiman, 2012, Theorem 1], we have that the joint distribution of the sequence X s1 1 s1 T 1 veries the recurrence formula P X 1 = a 1,..., X k+1 = a k+1 = P X 1 = a 1,..., X k = a k P X k+1 = a k+1 + P X 1 = a 1,..., X k 1 = a k 1 [P X k = a k, X k+1 = a k+1 P X k = a k P X k = a k ], 4.30 for 2 k T 1 1 and a 1,..., a k+1 {0, 1} k+1, which gives the proper tools for nding the exact values of Q 2 and Q 3. Following the approach from [Haiman and Preda, 2013, Section 2], let K 2 be a positive integer and denote with { } t+m 1 1 V K b = a = a 1,..., a K {0, 1} K max a i = b, 4.31 t=1,...,k m 1 +1 i=t

122 106 Chapter 4. Scan statistics over some block-factor type models the set of all binary sequences of length K for which the scan statistics with window of size m 1 takes the value b. Therefore, the value of the distribution function Q m1 K is computed by Q m1 K = P S m1 K n = n b=0 a V K b P X 1 = a 1,..., X K = a K, 4.32 where the last quantity is evaluated via the recurrence formula from Eq Particularizing for K = 2m 1 1 and K = 3m 1 1 in Eq.4.32, we obtain the exact values of Q 2 and Q 3, respectively. To show the accuracy of the proposed approximation for the one dimensional scan statistics distribution PS m1 T 1 n and the sharpness of the error bounds, we present in Table 4.1, a numerical study for m 1 = 8, T 1 = 1000 and selected values of the parameter p. The second and third column gives the exact values of Q 2 and Q 3, while the next two columns show the approximation and the corresponding error bound 1. n Q 2 Q 3 AppH E app 1 LowB UppB Eq.4.32 Eq.4.32 Eq.4.15 Eq.4.35 Eq.4.36 p = p = p = Table 4.1: One dimensional Bernoulli block-factor model: m 1 = 8, T 1 = 1000 Notice that, for comparison reasons, we have also included a lower LowB and an upper bound U ppb. These margins are derived from the following result, which extends the inequalities developed by [Glaz and Naus, 1991] see also [Wang et al., 2012], presented in Section Here we only have one error bound, namely the theoretical one, since the values of Q2 and Q 3 are computed exactly

123 4.3. Examples and numerical results 107 Proposition Let Ȳ 1, Ȳ2,... be a sequence of associated 2 see [Esary et al., 1967] and l-dependent random variables generated by a l + 1 block-factor. If we denote the distribution of the maximum of the sequence with QM = P max Ȳ t n, then for M l + 2 we have 1 t M QM [ Ql Ql+1 Ql+2 Ql+1Ql+2 ] M l 2, 4.33 QM Ql + 2 [1 Ql Ql + 2] M l In the particular case of one dimensional discrete scan statistics over the blockfactor model from Section 4.1, we have, in accordance with Remark 4.1.2, that X s1 is c 1 1 dependent so the sequence of moving sums Y 1,..., Y T1 m 1 +1 is l = m 1 +c 1 2 dependent. Thus, the distribution function Q m1 T 1, for T 1 2m 1 + c 1 1, is bounded by Q m1 T 1 Q m1 2m 1 + c 1 1 [ ] 1 + Qm 1 2m 1+c 1 2 Q m1 2m 1 +c 1 1 T1 2m 1 +c 1 1, 4.35 Q m1 2m 1 +c 1 2Q m1 2m 1 +c 1 1 Q m1 T 1 Q m1 2m 1 + c 1 1 [1 Q m1 2m 1 + c 1 2 +Q m1 2m 1 + c 1 1] T 1 2m 1 +c We observe that, if in the foregoing relations we take c 1 = 1, that is we consider the i.i.d. model, then we get exactly the bounds presented in Section From the numerical values in Table 4.1, we see that these bounds are very tight and also that our estimate is very accurate. Remark Notice that in the text of Proposition we assumed that the random variables Ȳ1, Ȳ2,... are associated and generated by a l + 1 block-factor. To construct such sequences, it is enough to take the function that denes the block-factor to be increasing or decreasing in each argument see for example [Oliveira, 2012, Chapter 1] or [Esary et al., 1967, Theorem 2.1 and property P 4 ]. Remark The proof of Proposition 4.3.1, due to the additional hypothesis of associated random variables, follows closely the proof steps of the corresponding result in [Glaz and Naus, 1991] and will be omitted here. 2 We say that the random variables Ȳ 1, Ȳ2,..., Ȳn are associated if, given two coordinatewise nondecreasing functions f, g : R n R then Cov [ f Ȳ 1, Ȳ2,..., Ȳn, g Ȳ1, Ȳ2,..., Ȳn ] 0, whenever the covariance exists.

124 108 Chapter 4. Scan statistics over some block-factor type models Example 2: Length of the longest increasing run We continue our list of examples with one that belongs to the class of runs statistics. In the subsequent we employ the notations introduced in Section 4.1. Let X1, X2,..., X T1 be a sequence of independent and identically distributed random variables with the common distribution F. We say that the subsequence Xk,..., X k+l 1 forms an increasing run or ascending run of length l 1, starting at position k 1, if it veries the following relation X k 1 > X k < X k+1 < < X k+l 1 > X k+l We denote the length of the longest increasing run among the rst T 1 random variables by M T1. This run statistics plays an important role in many applications in elds such computer science, reliability theory or quality control. The asymptotic behaviour of M T1 has been investigated by several authors depending on the common distribution, F. In the case of a continuous distribution, [Pittel, 1981] see also [Frolov and Martikainen, 1999] has shown that this behaviour does not depend on the common law. For the particular setting of uniform U[0, 1] random variables, this problem was addressed by [Révész, 1983], [Grill, 1987] and [Novak, 1992]. Under the assumption that the distribution F is discrete, the limit behaviour of M T1 depends strongly on the common law F, as [Csaki and Foldes, 1996] see also [Grabner et al., 2003] and [Eryilmaz, 2006] proved for the case of geometric and Poisson distribution. In [Louchard, 2005], the case of discrete uniform distribution is investigated, while in [Mitton et al., 2010], the authors study the asymptotic distribution of M T1 when the variables are uniformly distributed but not independent. In this example, we evaluate the distribution of the length of the longest increasing run using the methodology developed in this chapter. The idea is to express the distribution of the random variable M T1 in terms of the distribution of the scan statistics random variable. In the block-factor setting described in Section 4.1, we take d = 1, x 1 1 = 0, x 1 2 = 1 and T 1 = T 1 1. It follows that, for 1 s 1 T 1, the 1-way tensor X s1 is Xs1, X s1 +1 and if we dene the block-factor transformation Π : R 2 R by Πx, y = then, our block-factor model becomes { 1, if x < y 0, otherwise 4.38 X s1 = 1 Xs1 < X s Clearly, the foregoing equation shows that X 1,..., X T1 form a 1-dependent and stationary sequence of random variables. Notice that the distribution of M T1 and the distribution of the length of the longest run of ones 3, L T1, among the rst T 1 binary random variables X s1, are related and 3 This statistic is also known as the length of the longest success run or head run and was extensively studied in the literature. One can consult the monographs of [Balakrishnan and Koutras, 2002] and [Fu and Lou, 2003] for applications and further results concerning this statistic.

125 4.3. Examples and numerical results 109 satisfy the following identity P M T1 m 1 = P L T1 < m 1, for m Moreover, the random variable L T1 can be interpreted as a particular case of the scan statistics random variable and between the two we have the relation P L T1 m 1 = P S m1 T 1 m 1 = P S m1 T 1 = m Hence, combining Eq.4.40 and Eq.4.41, we can express the distribution of the length of the longest increasing run as P M T1 m 1 = P S m1 T 1 < m Thus, we can estimate the distribution of M T1 using the foregoing identity and the approximations developed in this chapter for the discrete scan statistics random variable. We should also note that [Novak, 1992] studied the asymptotic behaviour of L T1 over a sequence of m-dependent binary random variables. The author showed that, given a stationary m-dependent sequence of random variables with values 0 and 1, W k k 1, if there exist positive constants t, C such that P W k+1 = 1 W 1 = = W k = 1 1, for all k C, 4.43 Ckt then, as N max P L N < k e Nrk ln N h = O, k N N where rk = P W 1 = = W k = 1 P W 1 = = W k+1 = 1 and h = mt 1. In order to illustrate the accuracy of the approximation of M T1 based on scan statistics, using the methodology developed in Section 4.2.1, we consider that the random variables X s1 's have a common uniform U [0, 1] distribution. Simple calculations show that P X 1 = = X k = 1 = 1 k+1! and P X k+1 = 1 X 1 = = X k = 1 = 1 k k, 4.45 thus C = 2, t = 1 and h = 1. In the context of our particular situation, the result of [Novak, 1992] in Eq.4.44 becomes: P max LT1 < m 1 e T 1rm 1 ln T1 = O, m 1 T 1 where rm 1 = P X 1 = = X m1 = 1 P X 1 = = X m1 +1 = 1 = m 1+1 m 1 +2!. In Table 4.2, we consider a numerical compared study between the simulated value column Sim, the approximation based on scan statistics column AppH and the limit distribution column LimApp of the distribution of the length of the longest increasing run, P M T1 m 1, in a sequence of T 1 = random variables distributed uniformly over [0, 1]. The results show that both our method and the asymptotic approximation in Eq.4.43 are very accurate. T 1

126 110 Chapter 4. Scan statistics over some block-factor type models m 1 Sim AppH E total 1 LimApp Eq4.24 Eq Table 4.2: The distribution of the length of the longest increasing run: T1 = 10001, IT ER sim = 10 4, IT ER app = 10 5 PMT1 m Sim 0.1 Scan Approx Limit Distribution m 1 Figure 4.5: Cumulative distribution function for approximation, simulation and limit law Remark For the simulated values and the estimation of the distribution functions Q 2 and Q 3 that appear in the scan approximation formula see Eq.3.91, we employed the importance sampling algorithm Algorithm 1 presented in the context of i.i.d. random variables in Section 3.4.2, with IT ER sim = 10 4 and IT ER app = 10 5 iterations of the algorithm, respectively. We see that in our setting, the Bonferroni bound can be easily computed via B1 = T 1 m 1 +1 m 1 +1!. The second step in the algorithm is similar to the one described in Example and is implemented using a simple sorting procedure of m independent U [0, 1] random variables Example 3: Moving average of order q model In this third example, we consider the particular situation of the one dimensional discrete scan statistics over a M Aq model. In the block-factor model introduced in Section 4.1 we take d = 1, x 1 1 = 0, x 1 2 = q for q 1 a positive integer and X 1, X2,..., X T1 a sequence of independent and identically distributed Gaussian random variables with known mean µ and variance σ 2. We observe that, based on the notations used in Section 4.1, c 1 = q + 1, T 1 = T 1 q and for s 1 {1,..., T 1 }, the 1-way tensor X s1 becomes X s1 = Xs1, X s1 +1,..., X s1 +q Let a = a 1,..., a q+1 R q+1 be a xed non null vector and take Π : R q+1 R, the measurable transformation that denes the block-factor model, to be equal with Πx 1,..., x q+1 = a 1 x 1 + a 2 x a q+1 x q Following Eq.4.2, our dependent model is dened by the relation X s1 = Π X s1 = a 1 Xs1 + a 2 Xs a q+1 Xs1 +q, 1 s 1 T

127 4.3. Examples and numerical results 111 Clearly, from Eq.4.49, the random variables X 1,..., X T1 form a moving average of order q. Notice that the moving sums Y i1, 1 i 1 T 1 m can be expressed as Y i1 = i 1 +m 1 1 s 1 =i 1 X s1 = b 1 Xi1 + b 2 Xi b m1 +q X i1 +m 1 1+q, 4.50 where the coecients b 1,..., b m1 +q are evaluated by a For m 1 q, b k = k a j, k {1,..., q} j=1 q+1 a j, k {q + 1,..., m 1 } j=1 q+1 j=k m 1 +1 a j, k {m 1 + 1,..., m 1 + q} 4.51 b For m 1 < q, b k = k a j, k {1,..., m 1 } j=1 k a j, k {m 1 + 1,..., q} j=k m 1 +1 q+1 a j j=k m 1 +1, k {q + 1,..., m 1 + q} Therefore, for each i 1 {1,..., T 1 m 1 + 1}, the random variable Y i1 follows a normal distribution with mean E [Y i1 ] = b b m+q µ and variance V ar [Y i1 ] = b b 2 m+q σ 2. Moreover, it is not hard to see one can use the same type of arguments as in Lemma that the covariance matrix Σ = {Cov [Y t, Y s ]} has the entries m 1 +q t s b j b Cov [Y t, Y s ] = t s +j σ 2, t s m 1 + q j=1 0, otherwise. Given the mean and the covariance matrix of the vector Y 1,..., Y T1 m 1 +1, one can use the importance sampling algorithm of [Naiman and Priebe, 2001] see also [Malley et al., 2002] detailed in Section Example or the one of [Shi et al., 2007] presented in Section to estimate the distribution of the one dimensional discrete scan statistics S m1 T 1. Another way is to use the quasi-monte

128 112 Chapter 4. Scan statistics over some block-factor type models Carlo algorithm developed by [Genz and Bretz, 2009] to approximate the multivariate normal distribution. Following the comparison study presented in Section 3.4.4, in this example we adopt the rst importance sampling procedure. In order to evaluate the accuracy of the approximation developed in Section 4.2.1, we consider q = 2, T 1 = 1000, m 1 = 20, Xi N 0, 1 and the coecients of the moving average model to be a 1, a 2, a 3 = 0.3, 0.1, 0.5. We compare our approximation with the one column AppPT given in [Wang and Glaz, 2013] see also [Wang, 2013, Chapter 4]. In Table 4.3, we present numerical results for the setting described above. In our algorithms we used IT ER app = 10 6 iterations for the approximation and IT ER sim = 10 5 replicas for the simulation. n Sim AppPT AppH E sapp 1 E sf 1 Total Eq.1.46 Eq.4.20 Eq.4.23 Error Table 4.3: MA2 model: m 1 = 20, T 1 = 1000, X i = 0.3 X i X i X i+2, IT ER app = 10 6, IT ER sim = 10 5 In Figure 4.6, we illustrate the cumulative distribution functions obtained by approximation and simulation. For the approximation we present also the corresponding lower and upper bounds computed from the total error of the approximation process, the last column in Table Example 4: A game of minesweeper This example draws its motivation and name from the well known computer game of Minesweeper, whose objective is to detect all the mines from a grided mineeld in such a way that no bomb is detonated. Our two dimensional model can be described as follows. Let d = 2, T 1, T { 2 be positive integers and Xs1,s 2 1 s 1 T 1, 1 s 2 T } 2 be a family of i.i.d. Bernoulli random variables of parameter p. We interpret the random variable X s1,s 2 as representing the presence X s1,s 2 = 1 or the absence X s1,s 2 = 0 of a mine in the elementary square subregion r 2 s 1, s 2 = [s 1 1, s 1 ] [s 2 1, s 2 ], within the region R 2 = [0, T 1 ] [0, T 2 ]. In this example, we consider x 1 1 = x 1 2 = 1 and x 2 1 = x 2 2 = 1. Based on the notations introduced in Section 4.1, we observe that c 1 = c 2 = 3, T 1 = T 1 2 and

129 4.3. Examples and numerical results 113 PSm1 T 1 n n Error Bound App Sim Figure 4.6: Cumulative distribution function for approximation and simulation along with the corresponding error under M A model T 2 = T 2 2. For each pair s 1, s 2 {2,..., T 1 1} {2,..., T 2 1}, the 2-way tensor X s1,s 2 is given by X s1,s 2 j 1, j 2 = X s1 +j 1 2,s 2 +j 2 2, where 1 j i c i, i {1, 2} Since the elements of the tensor X s1,s 2 are arranged in a 3 3 matrix, we dene the block-factor real valued transformation Π on the set of matrices M 3,3 R, such that it veries the relation Π a 11 a 12 a 13 a 21 a 22 a 23 a 31 a 32 a 33 = 1 s,t 3 a st a From Eq.4.2 and Eq.4.55, our dependent model is dened as X s1,s 2 = Π X s1 +1,s 2 +1 = X s1 +i,s 2 +j i,j {0,1,2} 2 i,j 1,1 From the foregoing relation, we can interpret the random variable X s1,s 2 as the number of neighboring mines associated with the location s 1, s 2. In Figure 4.7, we present a realization of the introduced model. On the left, we have the realization of the initial set of random variables where a gray square represents the presence of a mine, while the white square signies the absence of a mine. On the right side, we have the realization of the X s1,s 2 according to the transformation Π from Eq.4.55, that is the corresponding number of neighboring mines associated to each site. We present numerical results Table 4.4-Table 4.11 for the described block-factor model with T 1 = T 2 = 44 that is T 1 = T 2 = 42, m 1 = m 2 = 3 and the underlying

130 114 Chapter 4. Scan statistics over some block-factor type models X s1,s2 : X s1,s2 : T2 T2 1 T Π T T1 1 T1 Figure 4.7: A realization of the minesweeper related model random eld generated by i.i.d. Bernoulli random variables of parameter p X s1,s 2 Bp in the range {0.1, 0.3, 0.5, 0.7}. We also include numerical values for the corresponding i.i.d. model: T 1 = T 2 = 42, m 1 = m 2 = 3 and X s1,s 2 B8, p. n Sim AppH E sapp2 E sf 2 Total Eq.4.20 Eq.4.23 Error Table 4.4: Block-factor: m 1 = m 2 = 3, T 1 = T 2 = 44, T 1 = T 2 = 42, p = 0.1, IT ER = 10 8 For all of our results presented in the tables we used Monte Carlo simulations with 10 8 iterations for the block-factor model and with 10 5 replicas for the i.i.d. model. Notice that the contribution of the simulation error that corresponds to the approximation error E sapp to the total error is almost negligible in most of the cases with respect to the other simulation error E sf. Thus, the precision of the method will depend mostly on the number of iterations IT ER used to estimate Q t1,t 2. The cumulative distribution function and the probability mass function for the block-factor and i.i.d. models are presented in Figure 4.8 and Figure 4.9.

131 4.3. Examples and numerical results 115 n Sim AppH E sapp2 E sf 2 Total Eq.3.97 Eq.3.96 Error Table 4.5: Independent: m 1 = m 2 = 3, T 1 = T 2 = 42, Br = 8, p = 0.1, IT ER = 10 5 n Sim AppH E sapp2 E sf 2 Total Eq.4.20 Eq.4.23 Error Table 4.6: Block-factor: m 1 = m 2 = 3, T 1 = T 2 = 44, T 1 = T 2 = 42, p = 0.3, IT ER = 10 8 n Sim AppH E sapp2 E sf 2 Total Eq.3.97 Eq.3.96 Error Table 4.7: Independent: m 1 = m 2 = 3, T 1 = T 2 = 42, Br = 8, p = 0.3, IT ER = 10 5

132 116 Chapter 4. Scan statistics over some block-factor type models n Sim AppH E sapp2 E sf 2 Total Eq.4.20 Eq.4.23 Error Table 4.8: Block-factor: m 1 = m 2 = 3, T 1 = T 2 = 44, T 1 = T 2 = 42, p = 0.5, IT ER = 10 8 n Sim AppH E sapp2 E sf 2 Total Eq.3.97 Eq.3.96 Error Table 4.9: Independent: m 1 = m 2 = 3, T 1 = T 2 = 42, Br = 8, p = 0.5, IT ER = 10 5 n Sim AppH E sapp2 E sf 2 Total Eq.4.20 Eq.4.23 Error Table 4.10: Block-factor: m 1 = m 2 = 3, T1 = T 2 = 44, T 1 = T 2 = 42, p = 0.7, IT ER = 10 8 n Sim AppH E sapp2 E sf 2 Total Eq.3.97 Eq.3.96 Error Table 4.11: Independent: m 1 = m 2 = 3, T 1 = T 2 = 42, Br = 8, p = 0.7, IT ER = 10 5

133 4.3. Examples and numerical results 117 ECDF for block factor and i.i.d. model p = 0.1 p = T1,T2 n PSm1,m T1,T2 n PSm1,m independent block factor 0.75 independent block factor n n p = 0.5 p = T1,T2 n T1,T2 n PSm1,m independent block factor n PSm1,m independent block factor n Figure 4.8: Cumulative distribution function for blockfactor and i.i.d. models p = 0.1 independent block factor p = 0.3 independent block factor T1,T2 = n T1,T2 = n PSm1,m n PSm1,m n p = 0.5 independent block factor p = 0.7 independent block factor T1,T2 = n T1,T2 = n PSm1,m n PSm1,m n Figure 4.9: Probability mass function for blockâˆ -factor and i.i.d. models

134

135 Conclusions and perspectives In this thesis, we provide a unied method for estimating the distribution of the multidimensional discrete scan statistics based on a result concerning the extremes of 1-dependent sequences of random variables. This approach has two main advantages over the existing ones presented in the literature: rst, it includes, beside an approximation formula, expressions for the corresponding error bounds and second, the approximation can be applied no matter what the distribution of the observations under the null hypothesis is. We consider two models for the underlying distribution of the random eld over which the scan process is performed: the widely studied i.i.d. model and a new dependent model based on a block-factor construction. For each of these models we give detailed expressions for the approximation, as well as for the associated error bounds formulas. Since the simulation process plays an important part in the estimation of the multidimensional discrete scan statistics distribution, we present, for the particular case of i.i.d. observations, a general importance sampling algorithm that increases the eciency of the proposed approximation. From the numerical applications conducted in both, the i.i.d. and the block-factor models, we conclude that our results are accurate especially for the high order quantiles. Currently, we are working on adapting the methodology developed for the multidimensional discrete scan statistics to the continuous scan statistics framework. One of the main challenge in this problem is to develop fast and ecient algorithms for the simulation of the continuous scan statistics, in more than three dimensions. Deriving accurate approximations for the distribution of the scan statistics over a random eld generated by independent but not identically distributed random variables, constitutes a problem of future interest. Another future direction of research, is to consider, in the multidimensional discrete scan statistics problem, that the scanning window has a general convex shape and to study the inuence of this shape in the scanning process. We should mention that for the multidimensional continuous scan statistics for Poisson processes, this problem was already addressed by [Alm, 1998].

136

137 Appendix A Supplementary material for Chapter 1 A.1 Supplement for Section 1.1 In this section, we outline the algorithm of [Karwe and Naus, 1997], used for nding the distribution function of the one dimensional scan statistics over a sequence of i.i.d. discrete random variables of length at most 3m 1. Based on the notations introduced in Section 1.1, we dene b n 2m 1 y = P S m 1 2m 1 n, Y m1 +1 = y, fy = PX 1 = y, Q n 2m 1 = P S m1 2m 1 n. We have the following recurrence relations for Q m1 2m 1 1 and Q m1 2m 1 : n b n 21 y = fj fy, b n 2m 1 y = Q n 2m 1 = Q n 2m 1 1 = j=0 y η=0 n y+η ν=0 n b n 2m 1 y, y=0 n x=0 b n ν 2m 1 1 y ηfνfη, fxq n x 2m 1 1. For the computation of Q m1 3m 1 1 and Q m1 3m 1, we take b k 1,k 2 3m 1 x, y = 0 if x > k 1 k 2 or y > k 2 and for the other cases as b k 1,k 2 3m 1 m 1 x, y = P {Y i k 1 } {Y m1 +1 = x} i=1 2m 1 j=m 1 +2 {Y j k 2 } {Y 2m1 +1 = y}. Considering that the random variables X i take values in the set {0,..., c}, we have the following recurrences see [Karwe and Naus, 1997, Appendix] for closed form

138 122 Appendix A. Supplementary material for Chapter 1 formulas: b k 1,k 2 3m 1 x, y = c Q n 3m 1 = Q n 3m 1 1 = c c α=0 β=0 γ=0 n n x=0 y=0 n n x 1 b k 1 α,k 2 β 3m 1 1 x β, y γfαfβfγ, b n,n 3m 1 x, y, n x 1 x 1 =0 x 2 =0 x 3 =0 [ n x3 x 4 =0 A.2 Supplement for Section 1.2 b n x 2,n x 3 3m 1 1 x 1, x 4 ] fx 2 fx 3. Some supplementary materials for the results presented in Section 1.2 are given in this section. A.2.1 Supplement for Section In this section, we include the formulas for the unknown quantities that appear in the computation of the product-type approximation given in Eq.1.64 in the case of binomial and Poisson observations. These formulas were presented in [Glaz et al., 2009, Chapter 5], for the case of T 1 = T 2 and m 1 = m 2. The following notation, W j 1,j 2 i 1,i 2 = j 1 i=i 1 j 2 j=i 2 X i,j, will be used throughout this section. a X i,j are i.i.d. binomial random variables with parameters r and p We have and Qm 1 + 1, m 2 = Qm 1 + 1, m = n m 1 1m 2 r y=0 n m 1 1m 2 1r y 1 =0 [n y 1 y 2 y 3 ] m 1 1r P P P y 4 =0 W 1,m 2+1 1,m 2 +1 a 2 P 2 W 1,m 2 1,1 n y P W m 1,m 2 2,1 = y n y 1 m 2 1r y 2 =0 [n y 1 y 2 y 3 ] m 1 1r W m 1,m 2 2,2 = y 1 P W m 1,1 2,1 = y 4 P P y 5 =0 W m 1+1,1 m 1 +1,1 a 3 W 1,m 2 1,2 = y 2 P n y 1 m 2 1r W m 1,m ,m 2 +1 = y 5, y 3 =0 P W 1,1 1,1 a 1 P W m 1+1,m 2 +1 m 1 +1,m 2 +1 a 4 W m 1+1,m 2 m 1 +1,2 = y 3

139 A.2. Supplement for Section where a 1 = n y 1 y 2 y 4, a 2 = n y 1 y 2 y 5, a 3 = n y 1 y 3 y 4, a 4 = n y 1 y 3 y 5. The random variables W 1,1 1,1, W 1,m 2+1 1,m 2 +1, W m 1+1,m 2 +1 m 1 +1,m 2 +1 are binomial distributed with parameters r and p, W m 1,m 2 2,2 is a binomial random variable with parameters m 1 1m 2 1r and p, W 1,m 2 1,2 and W m 1+1,m 2 m 1 +1,2 are binomial random variable with parameters m 2 1r and p and W m 1,1 2,1 and W m 1,m ,m 2 +1 are binomial random variable with parameters m 1 1r and p. b X i,j are i.i.d. Poisson random variables of parameter λ We have and where Qm 1 + 1, m 2 = Qm 1 + 1, m = P P P n n y=0 n y 1 y 1 =0 y 2 =0 y 3 =0 W 1,m 2+1 P 2 W 1,m 2 1,1 n y P W m 1,m 2 2,1 = y n y 1 1,m 2 +1 a 2 P W m 1,m 2 2,2 = y 1 P W m 1,1 2,1 = y 4 P n y 1 y 2 y 3 y 4 =0 W m 1+1,1 n y 1 y 2 y 3 y 5 =0 m 1 +1,1 a 3 W 1,m 2 1,2 = y 2 P W m 1,m ,m 2 +1 = y 5, P a 1 = n y 1 y 2 y 4, a 2 = n y 1 y 2 y 5, a 3 = n y 1 y 3 y 4, a 4 = n y 1 y 3 y 5. The random variables W 1,1 1,1, W 1,m 2+1 1,m 2 +1, W m 1+1,m 2 +1 m 1 +1,m 2 +1 P W 1,1 1,1 a 1 W m 1+1,m 2 +1 m 1 +1,m 2 +1 a 4 W m 1+1,m 2 m 1 +1,2 = y 3 are Poisson distributed of parameter λ, W m 1,m 2 2,2 is a Poisson random variable of mean m 1 1m 2 1λ, W 1,m 2 1,2 and W m 1+1,m 2 m 1 +1,2 are Poisson of mean m 2 1λ and W m 1,1 2,1 and W m 1,m ,m 2 +1 are Poisson random variable of mean m 1 1λ. A.2.2 Supplement for Section Following the methodology presented in Section for nding the upper bound of the one dimensional scan statistics, in this section we extend the results to the two dimensional case. Assume, for simplicity, that T 1 = L 1 m 1 and T 2 = L 2 m 2.

140 124 Appendix A. Supplementary material for Chapter 1 Applying [Kuai et al., 2000] inequality, we can write P S m1,m 2 T 1, T 2 n = 1 P 1 + L 1 1 i 1 =1 L 1 1 L 2 1 i 2 =1 L 2 1 j 1 =1 j 2 =1 L 1 1 L1 1 i 1 =1 L 2 1 j 1 =1 j 2 =1 L 2 1 i 2 =1 E c i 1,i 2 θ i1,i 1 PE c i 1,i 2 2 P E c i 1,i 2 E c j 1,j θi1,i 2 PE c i 1,i 2 1 θ i1,i 2 PEi c 1,i 2 2 P =: UB, Ei c 1,i 2 Ej c 1,j 2 θi1,i 2 PEi c 1,i 2 where for 1 i 1 L 1 1 and 1 i 2 L 2 1, the events E i1,i 2 E i1,i 2 = max Y s1,s 2 n i 1 1m 1 +1 s 1 i 1 m 1 +1 i 2 1m 2 +1 s 2 i 2 m 2 +1 are dened by and where θ i1,i 2 = L 1 1 L 2 1 j 1 =1 j 2 =1 P Ei c 1,i 2 Ej c 1,j 2 PEi c 1,i 2 L 1 1 L 2 1 j 1 =1 j 2 =1 P Ei c 1,i 2 Ej c 1,j 2 PEi c. 1,i 2 To simplify the results, we adopt the notations M = L 1 1L 2 1 and Σ i1,i 2 = L 1 1 L 2 1 j 1 =1 j 2 =1 P E c i 1,i 2 E c j 1,j 2. We observe, from the denition of E i1,i 2, that if i 1 j 1 2 or i 2 j 2 2, then E i1,i 2 and E j1,j 2 are independent and, since the events are also stationary, we have and PE c i 1,i 2 = 1 PE i1,i 2 = 1 Q2m 1, 2m 2, PE c i 1,i 2 E c j 1,j 2 = 1 PE i1,i 2 PE i1,i 2 + PE i1,i 2 E j1,j 2 = 1 2Q2m 1, 2m 2 + PE i1,i 2 E j1,j 2.

141 A.3. Supplement for Section Simple calculations lead to Σ i1,i 2 = 4 7Q2m 1, 2m 2 + Q2m 1, 3m 2 + Q3m 1, 2m 2 + Q3m 1, 3m 2 +M 4 [1 Q2m 1, 2m 2 ] 2, if i 1 {1, L 1 1}, i 2 {1, L 2 1} 6 11Q2m 1, 2m 2 + Q2m 1, 3m 2 + 2Q3m 1, 2m 2 + 2Q3m 1, 3m 2 +M 6 [1 Q2m 1, 2m 2 ] 2, if 2 i 1 L 1 2, i 2 {1, L 2 1} 6 11Q2m 1, 2m 2 + 2Q2m 1, 3m 2 + Q3m 1, 2m 2 + 2Q3m 1, 3m 2 +M 6 [1 Q2m 1, 2m 2 ] 2, if i 1 {1, L 1 1}, 2 i 2 L Q2m 1, 2m [Q2m 1, 3m 2 + Q3m 1, 2m 2 + Q3m 1, 3m 2 ] +M 9 [1 Q2m 1, 2m 2 ] 2, if 2 i 1 L 1 2, 2 i 2 L 2 2. Hence, the upper bound becomes UB = 1 4B 1 2L 1 3B 2 2L 2 3B 3 L 1 3L 2 3B 4, where for i {1, 2, 3, 4}, θ i [1 Q2m 1, 2m 2 ] 2 B i = Σ i + 1 θ i [1 Q2m 1, 2m 2 ] + 1 θ i [1 Q2m 1, 2m 2 ] 2 Σ i θ i [1 Q2m 1, 2m 2 ], Σ i θ i = 1 Q2m 1, 2m 2 Σ i, 1 Q2m 1, 2m 2 and Σ 1, Σ 2, Σ 3 and Σ 4 correspond to the rst, second, third and fourth branch, respectively of Σ i1,i 2 given above. The unknown quantities Q2m 1, 2m 2, Q2m 1, 3m 2, Q3m 1, 2m 2 and Q3m 1, 3m 2 are evaluated by simulation. A.3 Supplement for Section 1.3 Using the same approach as in [Guerriero et al., 2010a] for the case of T 1 = T 2 = T 3, m 1 = m 2 = m 3 and where X i,j,k were Bernoulli random variables, we present computational expressions for the unknown quantities involved in the approximation formula given in Eq We consider detailed formulas only for the situation of binomial observations, the Poisson case being similar see Remark A.3.1. Throughout this section, we will use the following shorthand notation: W j 1,j 2,j 3 i 1,i 2,i 3 = j 1 j 2 i=i 1 j=i 2 j 3 k=i 3 X i,j,k. Assume that X i,j,k are i.i.d. binomial random variables with parameters r and p. Qm 1, m 2, m 3

142 126 Appendix A. Supplementary material for Chapter 1 It is clear that Qm 1, m 2, m 3 = P W m 1,m 2,m 3 1,1,1 n n m 1 m 2 m 3 r m1 m 2 m 3 r = p i 1 p m 1m 2 m 3 r i. i i=0 Qm 1 + 1, m 2, m 3, Qm 1, m 2 + 1, m 3, Qm 1, m 2, m We have Qm 1 + 1, m 2, m 3 = Qm 1, m 2 + 1, m 3 = Qm 1, m 2, m = Qm 1, m 2 + 1, m We can write where n m 1 1m 2 m 3 r x 1 =0 n m 1 m 2 1m 3 r x 1 =0 n m 1 m 2 m 3 1r x 1 =0 Qm 1, m 2 + 1, m y 1 P 2 W 1,m 2,m 3 1,1,1 n x 1 P W m 1,m 2,m 3 2,1,1 = x 1, P 2 W m 1,1,m 3 1,1,1 n x 1 P W m 1,m 2,m 3 1,2,1 = x 1, P 2 W 1,m 2,m 3 1,1,1 n x 1 P W m 1,m 2,m 3 1,1,2 = x 1. y 2 y 3 y 4 y 5 x 1 =0 x 2 =0 x 3 =0 x 4 =0 x 5 =0 y 1 = n m 1 m 2 1m 3 1r, y 2 = n x 1 m 1 m 2 1r, y 3 = n x 1 m 1 m 2 1r, 4 i=1 y 4 = [n x 1 x 2 x 3 ] m 1 m 3 1r, y 5 = [n x 1 x 2 x 3 ] m 1 m 3 1r a i 5 d j x j, j=1 and a 1 = P W m 1,1,1 1,1,1 n x 1 x 2 x 4, a 2 = P W m 1,m 2 +1,1 1,m 2 +1,1 n x 1 x 3 x 4, a 3 = P W m 1,1,m ,1,m 3 +1 n x 1 x 2 x 5, a 4 = P W m 1,m 2 +1,m ,m 2 +1,m 3 +1 n x 1 x 3 x 5

143 A.3. Supplement for Section and d 1 x 1 = P W m 1,m 2,m 3 1,2,2 = x 1, d 2 x 2 = P W m 1,m 2,1 1,2,1 = x 2, d 3 x 3 = P W m 1,m 2,m ,2,m 3 +1 = x 3, d 4 x 4 = P W m 1,1,m 3 1,1,2 = x 4, d 5 x 5 = P W m 1,m 2 +1,m 3 1,m 2 +1,2 = x 5. Qm 1 + 1, m 2, m Similarly, we can write where Qm 1 + 1, m 2, m y 1 y 2 y 3 y 4 y 5 x 1 =0 x 2 =0 x 3 =0 x 4 =0 x 5 =0 y 1 = n m 1 1m 2 m 3 1r, y 2 = n x 1 m 1 1m 2 r, y 3 = n x 1 m 1 m 2 1r, 4 i=1 y 4 = [n x 1 x 2 x 3 ] m 2 m 3 1r, y 5 = [n x 1 x 2 x 3 ] m 2 m 3 1r a i 5 d j x j, j=1 and and a 1 = P W 1,m 2,1 1,1,1 n x 1 x 2 x 4, a 2 = P W 1,m 2,m ,1,m 3 +1 n x 1 x 3 x 4, a 3 = P W m 1+1,m 2,1 m 1 +1,1,1 n x 1 x 2 x 5, a 4 = P W m 1+1,m 2,m 3 +1 m 1 +1,1,m 3 +1 n x 1 x 3 x 5 d 1 x 1 = P W m 1,m 2,m 3 2,1,2 = x 1, d 2 x 2 = P W m 1,m 2,1 2,1,1 = x 2, d 3 x 3 = P W m 1,m 2,m ,1,m 3 +1 = x 3, d 4 x 4 = P W 1,m 2,m 3 1,1,2 = x 4, d 5 x 5 = P W m 1+1,m 2,m 3 m 1 +1,1,2 = x 5.

144 128 Appendix A. Supplementary material for Chapter 1 Qm 1 + 1, m 2 + 1, m 3 As before, where Qm 1 + 1, m 2 + 1, m 3 y 1 y 2 y 3 y 4 y 5 x 1 =0 x 2 =0 x 3 =0 x 4 =0 x 5 =0 y 1 = n m 1 1m 2 1m 3 r, y 2 = n x 1 m 1 1m 3 r, y 3 = n x 1 m 1 1m 3 r, 4 i=1 y 4 = [n x 1 x 2 x 3 ] m 2 1m 3 r, y 5 = [n x 1 x 2 x 3 ] m 2 1m 3 r a i 5 d j x j, j=1 and and a 1 = P W 1,1,m 3 1,1,1 n x 1 x 2 x 4, a 2 = P W 1,m 2+1,m 3 1,m 2 +1,1 n x 1 x 3 x 4, a 3 = P W m 1+1,1,m 3 m 1 +1,1,1 n x 1 x 2 x 5, a 4 = P W m 1+1,m 2 +1,m 3 m 1 +1,m 2 +1,1 n x 1 x 3 x 5 d 1 x 1 = P W m 1,m 2,m 3 2,2,1 = x 1, d 2 x 2 = P W m 1,1,m 3 2,1,1 = x 2, d 3 x 3 = P W m 1,m 2 +1,m 3 2,m 2 +1,1 = x 3, d 4 x 4 = P W 1,m 2,m 3 1,2,1 = x 4, d 5 x 5 = P W m 1+1,m 2,m 3 m 1 +1,2,1 = x 5. Qm 1 + 1, m 2 + 1, m We have Qm 1 + 1, m 2 + 1, m y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 x 1 =0 x 2 =0 x 3 =0 x 4 =0 x 5 =0 x 6 =0 x 7 =0 x 8 =0 x 9 =0 9 d j x j, j=1 A 1 A 2

145 A.3. Supplement for Section with y 1 = n m 1 1m 2 1m 3 1r, y 2 = n x 1 m 1 1m 2 1r, y 3 = n x 1 m 1 1m 2 1r, y 4 = [n x 1 x 2 x 3 ] m 1 1m 3 1r, y 5 = [n x 1 x 2 x 3 ] m 1 1m 3 1r, y 6 = n x 1 x 2 x 4 m 1 1r, y 7 = n x 1 x 2 x 5 m 1 1r, y 8 = n x 1 x 3 x 4 m 1 1r, y 9 = n x 1 x 3 x 5 m 1 1r, and d 1 x 1 = P W m 1,m 2,m 3 2,2,2 = x 1, d 2 x 2 = P W m 1,m 2,1 2,2,1 = x 2, d 3 x 3 = P W m 1,m 2,m ,2,m 3 +1 = x 3, d 4 x 4 = P W m 1,1,m 3 2,1,2 = x 4, d 5 x 5 = P W m 1,m 2 +1,m 3 2,m 2 +1,2 = x 5, d 6 x 6 = P W m 1,1,1 2,1,1 = x 6, d 7 x 7 = P W m 1,m 2 +1,1 2,m 2 +1,1 = x 7, d 8 x 8 = P W m 1,1,m ,1,m 3 +1 = x 8, d 9 x 9 = P W m 1,m 2 +1,m ,m 2 +1,m 3 +1 = x 9. The unknown A 1 is computed by with A 1 = y 10 y 11 y 12 y 13 y 14 x 10 =0 x 11 =0 x 12 =0 x 13 =0 x 14 =0 4 i=1 a i 14 j=10 y 10 = n x 1 x 2 x 4 x 6 m 2 1m 3 1r, d j x j, y 11 = [n x 1 x 4 x 10 x 2 + x 6 x 3 + x 8 ] m 3 1r, y 12 = [n x 1 x 5 x 10 x 2 + x 7 x 3 + x 9 ] m 3 1r, y 13 = [n x 1 x 2 x 10 x 4 + x 6 x 5 + x 7 ] m 2 1r, y 14 = [n x 1 x 3 x 10 x 4 + x 8 x 5 + x 9 ] m 2 1r,

146 130 Appendix A. Supplementary material for Chapter 1 and where a 1 = P W 1,1,1 1,1,1 n x 1 x 2 x 4 x 6 x 10 x 11 x 13, a 2 = P W 1,1,m 3+1 1,1,m 3 +1 n x 1 x 3 x 4 x 8 x 10 x 11 x 14, a 3 = P W 1,m 2+1,1 1,m 2 +1,1 n x 1 x 2 x 5 x 7 x 10 x 12 x 13, a 4 = P W 1,m 2+1,m ,m 2 +1,m 3 +1 n x 1 x 3 x 5 x 9 x 10 x 12 x 14 and d 10 x 10 = P W 1,m 2,m 3 1,2,2 = x 10, d 11 x 11 = P W 1,1,m 3 1,1,2 = x 11, d 12 x 12 = P W 1,m 2+1,m 3 1,m 2 +1,2 = x 12, d 13 x 13 = P W 1,m 2,1 1,2,1 = x 13, d 14 x 14 = P W 1,m 2,m ,2,m 3 +1 = x 14. Similarly, to compute A 2, we use with A 2 = y 15 y 16 y 17 y 18 y 19 x 15 =0 x 16 =0 x 17 =0 x 18 =0 x 19 =0 8 i=5 a i 19 j=15 d j x j, and where y 15 = n x 1 x 2 x 4 x 6 m 2 1m 3 1r, y 16 = [n x 1 x 4 x 15 x 2 + x 6 x 3 + x 8 ] m 3 1r, y 17 = [n x 1 x 5 x 15 x 2 + x 7 x 3 + x 9 ] m 3 1r, y 18 = [n x 1 x 2 x 15 x 4 + x 6 x 5 + x 7 ] m 2 1r, y 19 = [n x 1 x 3 x 15 x 4 + x 8 x 5 + x 9 ] m 2 1r a 5 = P W m 1+1,1,1 m 1 +1,1,1 n x 1 x 2 x 4 x 6 x 15 x 16 x 18, a 6 = P W m 1+1,1,m 3 +1 m 1 +1,1,m 3 +1 n x 1 x 3 x 4 x 8 x 15 x 16 x 19, a 7 = P W m 1+1,m 2 +1,1 m 1 +1,m 2 +1,1 n x 1 x 2 x 5 x 7 x 15 x 17 x 18, a 8 = P W m 1+1,m 2 +1,m 3 +1 m 1 +1,m 2 +1,m 3 +1 n x 1 x 3 x 5 x 9 x 15 x 17 x 19,

147 A.3. Supplement for Section and d 15 x 15 = P W m 1+1,m 2,m 3 m 1 +1,2,2 = x 15, d 16 x 16 = P W m 1+1,1,m 3 m 1 +1,1,2 = x 16, d 17 x 17 = P W m 1+1,m 2 +1,m 3 m 1 +1,m 2 +1,2 = x 17, d 18 x 18 = P W m 1+1,m 2,1 m 1 +1,2,1 = x 18, d 19 x 19 = P W m 1+1,m 2,m 3 +1 m 1 +1,2,m 3 +1 = x 19. Remark A.3.1. The above expressions remain valid in the i.i.d. Poisson model, the only dierence is that the upper bounds that appear in the sums the y j 's do not contain the minimum part. For example, consider the case of Qm 1, m 2 +1, m The upper bounds y 1 to y 5 that appear in the formula, now become y 1 = n, y 2 = n x 1, y 3 = n x 1, y 4 = n x 1 x 2 x 3, y 5 = n x 1 x 2 x 3.

148

149 Appendix B Supplementary material for Chapter 3 B.1 Proof of Lemma From the mean value theorem in two dimensions we have Hx 1, y 1 Hx 2, y 2 = Hx, y x x 1 x 2 + Hx, y y 1 y 2, B.1 y where x, y is a point on the segment x 1, y 1 x 2, y 2. Notice that Hx, y x Hx, y y = 2 [ 1 + x y + 2x y 2] m 12x y [1 + 4x y] [1 + x y + 2x y 2 ] m, B.2 = [ 1 + x y + 2x y 2] + m 12x y [1 + 4x y] [1 + x y + 2x y 2 ] m B.3 and that when y i x i for i {1, 2}, we have y x since both points, x 1, y 1 and x 2, y 2, are lying between 0x and the rst bisector the same is true for any point on the segment determined by them. Applying Bernoulli inequality to the denominator in Eqs.B.2 and B.3 [ 1 + x y + 2x y 2 ] m 1 + m 1 [ x y + 2x y 2 ], B.4 we see, by elementary calculations study of the sign of the equation of degree 2, that Hx, y x 4x y2 m 2 + x y [4xm 1 + m 3] + mx x m [x y + 2x y 2 ] { m 1, for 3 m 5 B.5 m 2, for m 6. In the same way, Hx, y y 2x y2 2m 3 + x y [4xm 1 + m 2] + mx x m [x y + 2x y 2 ] { m 1, for 3 m 5 B.6 m 2, for m 6. and in combination with Eq.B.5, we obtain the requested result.

150 134 Appendix B. Supplementary material for Chapter 3 B.2 Proof of Lemma Obviously, the random variables Y i1,...,i d follow a multivariate normal distribution since are equal, by denition, with a sum of i.i.d normals X s1,...,s d N µ, σ 2, 1 s j T j, j {1,..., d}. Thus, the mean is, clearly, equal with µ = m 1 m d µ. We show that the covariance matrix is given by the formula m 1 i 1 j 1 m d i d j d σ 2, i s j s < m s Cov [Y i1,...,i d, Y j1,...,j d ] = s {1,..., d}, 0, otherwise. B.7 Take u q = i q j q, for each q {1,..., d}, and observe that if there is an index r {1,..., d} such that u r m r, then the random variables Y i1,...,i d and Y j1,...,j d are independent they do not share any random variable X s1,...,s d, so the covariance is zero. Assume now that u q m q 1, for all q {1,..., d}. We observe that the two random variables, Y i1,...,i d and Y j1,...,j d, can be rewritten as Y i1,...,i d = Z j 1,...,j d i 1,...,i d + X, B.8 Y j1,...,j d = Z j 1,...,j d i 1,...,i d + Y, B.9 where i 1 j 1 +m 1 1 i d j d +m d 1 Z j 1,...,j d i 1,...,i d = X s1,...,s d s 1 =i 1 j 1 +u 1 s d =i d j d +u d B.10 and X = Y i1,...,i d Z j 1,...,j d i 1,...,i d and Y = Y j1,...,j d Z j 1,...,j d i 1,...,i d, respectively. Since X s1,...,s d N µ, σ 2 are independent, 1 s j T j, j {1,..., d}, the random variables Z j 1,...,j d i 1,...,i d, X and Y are pairwise independent. The result now follows from the following simple property: Fact B.2.1. Assume that W 1 and W 2 are two random variables such that W 1 = X +Y, W 2 = X +Z, where X, Y and Z are pairwise independent random variables. Then, the covariance between W 1 and W 2 satises Cov [W 1, W 2 ] = V ar [X]. B.11 From the foregoing relation and since the variance of Z j 1,...,j d i 1,...,i d is given by [ ] V ar Z j 1,...,j d i 1,...,i d = m 1 u 1 m d u d σ 2, B.12 we conclude that if u q m q 1 for all q {1,..., d}, then Cov [Y i1,...,i d, Y j1,...,j d ] = m 1 i 1 j 1 m d i d j d σ 2. B.13

151 B.3. Validity of the algorithm in Example B.3 Validity of the algorithm in Example To verify the validity of the algorithm proposed by [Homan and Ribak, 1991], reduces to show that W i N µ wi t, Σ wi t, where 1 µ wi t = E[W i ] + V ar[z l ] Cov[W i, Z l ]t E[Z l ], Σ wi t = CovW i 1 V ar[z l ] Cov[W i, Z l ]Cov T [W i, Z l ]. B.14 B.15 Clearly, from the denition, the random vector W i follow a multivariate normal distribution and satises the relation W i = W i + µ wi t E [W i Z l ]. B.16 By conditioning with Z l in Eq.B.16, we have E [ W i Z l ] = E [Wi Z l ] + µ wi t E [W i Z l ] = µ wi t B.17 hence, by taking the mean, E [ W i ] = E [ E [ Wi Z l ]] = µwi t. B.18 For the covariance matrix, we notice that Cov [ W i Z l ] = Cov [ Wi + µ wi t E [W i Z l ] Z l ] and considering the r, q element, we have = Cov [W i E [W i Z l ] Z l ] B.19 Cov [ W i Z l ] r, q = E [Zr E [Z r Z l ] Z q E [Z q Z l ] Z l ] E [Z r E [Z r Z l ] Z l ] E [Z q E [Z q Z l ] Z l ] = E[Z r Z q Z l ] E[Z r Z l ]E[Z q Z l ] E[Z q Z l ]E[Z r Z l ] + E[Z r Z l ]E[Z q Z l ] = E[Z r Z q Z l ] E[Z r Z l ]E[Z q Z l ] = Σ wi tr, q, B.20 since the posterior covariance is independent of the specic observations from Eq.B.15. By taking the average in Eq.B.19 and using the last relation, we get Cov [ W i ] = Σwi t. B.21 B.4 Proof of Lemma The result in Lemma is a direct consequence of the following proposition by an extension to the d-dimensional setting:

152 136 Appendix B. Supplementary material for Chapter 3 Proposition B.4.1. Let 2 m N 1 be positive integers and X 1,..., X N be independent and identically distributed normal random variables with mean µ and i+m 1 variance σ 2. If i {1,..., N m + 1}, then conditionally given X j = t, the random variables X s, s i, are jointly distributed as the random variables X s = t µ m m i+m 1 m 1 m j=i+1 X j j=i 1 {i+1,...,i+m 1} s + X s, B.22 where 1 A x is the indicator of the set A, i.e. 1 A x = 1 if x A and zero otherwise. For simplicity, we will prove the result for standard normal X j N 0, 1 random variables, since the general case will be straightforward after the usual change of variable X j µ + σx j. Thus, we consider X 1,..., X N be i.i.d. standard normal random variables and we want to show that X s = t m 1 m i+m 1 m j=i+1 X j are jointly distributed as X 1,..., X i 1, X i+1,..., X N 1 {i+1,...,i+m 1} s + X s, i+m 1 j=i X j = t. B.23 To simplify the notations, let f X and g X be the joint density functions corresponding to the random vectors X = X1,..., X i 1, X i+1,..., X N and X = X 1,..., X i 1, X i+1,..., X N, respectively. Also, denote by Φx the density function of a standard normal random variable. By the change of variable in N 1 dimensions formula, we have f X u 1,..., u i 1, u i+1,..., u N = g X h 1 u 1,..., u i 1, u i+1,..., u N detj h 1, B.24 where the function h is given by i+m 1 hv 1,..., v i 1, v i+1,..., v N = v 1,..., v i 1, α β + v i+1,..., α β i+m 1 v j j=i+1 j=i+1 v j + v i+m 1, v i+m,..., v N, B.25 with α = t m, β = 1 m m and where the Jacobian matrix of h 1 is equal with J h 1x = h 1 u x. B.26

153 B.4. Proof of Lemma From Eq.B.25, we obtain that h 1 has the form h 1 u 1,..., u i 1, u i+1,..., u N = u 1,..., u i 1, t 1 + m m + 1 +u i+1,..., t m + 1 m + 1 i+m 1 j=i+1 hence, by simple calculations, the Jacobian is equal with u j i+m 1 j=i+1 u j + u i+m 1, u i+m,..., u N, B.27 detj h 1 = m. B.28 Since the random variables X 1,..., X N are independent, the joint density function g X is the product of the standard normal densities, that is g X v 1,..., v i 1, v i+1,..., v N = Φv 1... Φv i 1 Φv i+1... Φv N, B.29 thus, from Eqs.B.24, B.27 and B.28, we obtain f X u 1,..., u i 1, u i+1,..., u N = mφu 1... Φu i 1 Φu i+m... Φu N i+m 1 Φ t i+m u s. m m + 1 s=1 s {i,...,i+m 1} s=i+1 s=i+1 j=i+1 u j j=i+1 B.30 Taking the exponent of the joint density function f X denoted with Exp f X, we can write Exp 2 1 N i+m 1 f X = 2 u 2 s + u s t i+m u j m m + 1. We have, by denoting with S = with A = S = = i+m 1 s=i+1 i+m 1 s=i+1 i+m 1 j=i+1 [ u s u j, [ u s t m m i+m 1 s=i+1 u s t m + ] 1 2 A t m + 1 t 2 2A t + u s m m + 1 m m + 1 i+m 1 j=i+1 u j t + m m + 1 B.31 2 and ] A t2 m B.32

154 138 Appendix B. Supplementary material for Chapter 3 Since and i+m 1 s=i+1 i+m 1 s=i+1 i+m 1 s=i+1 2A t m + 1 u s u s A t 2 m 1 m + 1 = A t 2, m + 1 t 2A t2 = + m m + 1 m + 1 t 2 i+m 1 = m m + 1 s=i+1 2tA t m m + 1 B.33 B.34 u 2 2tA t s + t2 m 1, B.35 m m + 1 m m + 1 we deduce, by substituting Eqs.B.33, B.34 and B.35 in Eq.B.32, that 2 i+m 1 i+m 1 S = u 2 s + u j t t2 m, B.36 s=i+1 j=i+1 so exponent Exp f X becomes Exp 1 N f X = u 2 s + 2 s=1 i+m 1 j=i+1 u j t 2 t2. m Combining Eqs.B.30 and B.37, we get that the density function f X f X u 1,..., u i 1, u i+1,..., u N = 1 N 1 m e Expf X. 2π B.37 is given by B.38 We note that the joint density function f of the random variables X s, s i, conditionally given X j = t, is by i+m 1 denition j=i fy 1,..., y i 1, y i+1,..., y N = f 1y 1,..., y i 1, t, y i+1,..., y N, B.39 f 2 t where f 1 is the density function of X 1,..., X i 1, is the density of i+m 1 j=i X j. Since i+m 1 j=i i+m 1 j=i X j N 0, m, we have X j, X i+1,..., X N and f 2 f 2 t = 1 2πm e t2 2m. B.40 The density f 1 is given by, after applying the change of variable, f 1 y 1,..., y i 1, t, y i+1,..., y N = ḡ h 1 y 1,..., y N detj h 1, B.41

155 B.4. Proof of Lemma where ḡ is the density of the vector X 1,..., X N and h 1 y 1,..., y N = y 1,..., y i 1, y i i+m 1 j=i+1 y j, y i+1,..., y N. B.42 We can easily see that detj h 1 = 1 and, from the independence of X s, ḡv 1,..., v N = Φv 1... Φv N. B.43 Substituting Eqs.B.40-B.43 in B.39, we obtain f = Φy 1... Φy i 1 Φ t = 1 N 1 m e 2π i+m 1 j=i+1 y j 1 2πm e t2 2m 1 2 s=1 N ys 2 + Φy i+1... Φy N i+m 1 j=i+1 y j t 2 t2 m which is exactly the formula in Eq.B.38, what we needed to show., B.44 Remark B.4.2. It is interesting to notice that the result in Proposition B.4.1 can be generalized by conditioning with a linear combination instead of just the sum. The proof follows the lines of the above proof and we will omit it.

156

Appendix C A Matlab graphical user interface for discrete scan statistics We present here a graphical user interface GUI, developed in Matlab R, that permits to estimate the distribution of the

The purpose of this GUI application is to illustrate the accuracy of the existing methods used for the approximation process of the scan statistic variable.

157 Appendix C A Matlab graphical user interface for discrete scan statistics We present here a graphical user interface GUI, developed in Matlab R, that permits to estimate the distribution of the discrete scan statistics for dierent scenarios. The purpose of this GUI application is to illustrate the accuracy of the existing methods used for the approximation process of the scan statistic variable. We consider the cases of one, two and three dimensional scan statistics over a random eld distributed according to a Bernoulli, binomial, Poisson or Gaussian law. In the particular situation of one dimensional scan statistics, we have also included a moving average of order q model. We should emphasize that almost all of the numerical results included in this thesis were obtained with the help of this GUI program. C.1 How to use the interface a The Input Window b The Output Window Figure C.1: The Scan Statistics Simulator GUI

Scan Statistics: Theory and Applications

Scan Statistics: Theory and Applications Alexandru Am rioarei Laboratoire de Mathématiques Paul Painlevé Département de Probabilités et Statistique Université de Lille 1, INRIA Modal Team Seminaire de