EXTENSIONS TO THE EXPONENTIALLY WEIGHTED MOVING AVERAGE PROCEDURES

Size: px

Start display at page:

Download "EXTENSIONS TO THE EXPONENTIALLY WEIGHTED MOVING AVERAGE PROCEDURES"

Jesse Malone
5 years ago
Views:

1 EXTENSIONS TO THE EXPONENTIALLY WEIGHTED MOVING AVERAGE PROCEDURES CHAO AN-KUO NATIONAL UNIVERSITY OF SINGAPORE 2016

2 EXTENSIONS TO THE EXPONENTIALLY WEIGHTED MOVING AVERAGE PROCEDURES CHAO AN-KUO (MS, National Tsing Hua University) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF INDUSTRIAL AND SYSTEMS ENGINEERING NATIONAL UNIVERSITY OF SINGAPORE 2016 Supervisor: Professor Tang Loon Ching Examiners: Dr Chen Nan Associate Professor Ng Szu Hui Professor Fugee Tsung, Hong Kong University of Science and Technology

3 Declaration I hereby declare that this thesis is my original work and it has been written by me in its entirety. I have duly acknowledged all the sources of information which have been used in the thesis. This thesis has also not been submitted for any degree in any university previously. Chao An-Kuo December

4 Acknowledgements Iwouldliketoexpressmysinceregratitudetomysupervisor,Professor Tang Loon Ching, whose immense knowledge, academic support, and patience are a source of invaluable guidance, constant encouragement and great inspiration throughout my Ph.D. study at National University of Singapore. I wish to thank my thesis committee members, Professor Chen Nan and Professor Ng Szu Hui, for their insights and helpful suggestions. I would also like to thank present and past members of the computing lab for the fun and support during the last five years. Many thanks go to my parents and my sister for the support they provided me through my entire life and my girl friend, Ding Lan, for her understanding, care, and encouragement. 2

5 Contents 1 Introduction Change Detection Change Detection Procedures Scope of Research Literature Review Statistical Formulation Chagne Detection Procedures The CUSUM Procedure The SR Procedure The EWMA Procedure Conclusion EWMA Procedures for Specified Pre-change Parameters Maximum Weighted Log-likelihood EWMA Procedures i

6 3.3 An Example EWMA Procedures for Partially Specified Pre-change Parameters Constrained Maximum Likelihood Partial EWMA Procedures An Example Conclusion 66 A Regularity Conditions 77 B Chapter 3 Proofs 79 B.1 Proof of Theorem B.2 Proof of Proposition B.3 Proof of Proposition C Details of the Simulation 94 C.1 Detection Threshold C.2 SADD D Chapter 4 Proofs 97 D.1 Proof of Theorem D.2 Proof of Proposition ii

7 Summary Change detection procedures have been successfully implemented in many industries. Among them, the cumulative sum (CUSUM) procedure, the Shiryaev-Roberts (SR) procedure, and the exponentially weighted moving average (EWMA) procedure are the most popular. A common feature of these procedures is that their detection statistics can be expressed in recursive formulae. This ensures both that they can be performed from the arrival of the first observation and that the computational cost and the memory requirement for evaluating their detection statistics would not increase over time. However, in many applications, the post-change distribution is not specified. Though the three procedures have been modified under this assumption, their practical use may be limited because their detection statistics cannot be expressed in recursive formulae. Consequently, they may not be able to detect early change and the computational cost grows to infinity as time goes by. Recently, Zhang et al. (2010a) proposed an EWMA control chart for simultaneously monitor the mean and covariance among a sequence of multivariate normal vectors. As their detection statistic can be evaluated recursively, it seems that the use of the EWMA procedure may not be limited under certain conditions. iii

8 This thesis presents a simple approach for developing EWMA procedures under the assumption that the pre-change distribution is specified and the post-change distribution is not specified. By applying the maximum likelihood framework to the exponential weighted log-likelihood, we established four EWMA procedures based on the generalized likelihood-ratio test, the Lagrange multiplier test, the Wald test, and the gradient test. All these EWMA procedures are asymptotically equivalent to the MEWMA control chart (Lowry et al., 1992) when no change occurs. In other words, they reach the same conclusion in a neighbourhood of the pre-change parameters from an asymptotic viewpoint. In addition, we show that the proposed EWMA statistics can be recursively evaluated if the distribution belongs to the exponential family. In practice, one may be interested in detecting changes in only a subset of the parameters. To fulfil this need, we further propose the partial EWMA (PEWMA) procedures. Four PEWMA procedures are developed in a similar manner under the assumption that the pre-change distribution is partially specified. These PEWMA procedures are also asymptotically equivalent to the MEWMA control chart when no change occurs and their detection statistics are recursively evaluable if the distribution belongs to the exponential family. The strength of the PEWMA procedures is that they are relatively robust against changes in the nuisance parameters and might be a remedy when the pre-change distribution is inferred based on iv

9 limited historical data. v

10 List of Tables 1.1 Possible scenarios for the pre- and post-change distributions The detection thresholds of the EWMA procedures The ARL2FA of the EWMA procedures using the MEWMA control limits The SADD of the EWMA procedures for various shifts in mean ( 2 =0) The SADD of the EWMA procedures for various shifts in variance ( 1 =0) The detection thresholds of the EWMA and the PEWMA procedures The ARL2FA of the EWMA and the PEWMA procedures for various shifts in variance (µ =0) The SADD of the EWMA and the PEWMA procedures for various shifts in mean ( 2 =1) vi

11 List of Figures 2.1 Single-run and multi-cyclic change detection A geometric illustration of the four tests The SADDs of the EWMA procedures for small shifts The SADDs of the EWMA procedures for small shifts (Continued) vii

12 Chapter 1 Introduction Change detection is an important task in the area of industrial and systems engineering. Any system or process may undergo an unexpected change in its normal operating state, causing undesirable consequences. As such a change is di cult to be predicted and prevented, it is necessary to identify it as soon as possible after its occurrence so that appropriate actions can be taken in a timely manner. During the past decades, a large amount of change detection procedures have been proposed in the literature. Many of them have been successfully implemented in a wide range of applications, such as statistical quality control, signal processing, navigation, epidemiology, computer network security, and financial security (Montgomery, 2012; Basseville and Nikiforov, 1993; Tartakovsky et al., 2014). However, the use of the change detection procedures could be limited when the pre- or post-change distribution is not specified. This is because 1

13 the parameters of the distribution are unknown and need to be estimated, and the change detection procedures cannot be performed until a su cient amount of data are collected for estimation. This problem could be more severe when the distribution is complex and involves a lot of parameters. For example, when monitoring the mean with unknown covariance among a sequence of normal vectors, one may encounter a di culty that the data are not enough for estimating of the covariance. In this research, we propose methods to address this problem. To define the problem, we begin with an introduction of change detection in Section 1.1. After that, we briefly review the existing methods in Section 1.2. Finally, we present in Section 1.3 the scope of this research. 1.1 Change Detection Change detection concerns determining whether or not a change in the operating state of a process has occurred. In general, the operating state is inferred based on quantitative observations or measurements collected sequentially from the process. As long as the behaviour of the observations is consistent with the normal operating state, one is content to let the process continue. Once the state changes and becomes abnormal, one should stop the process and make necessary adjustments. In other words, whenever a new observation arrives, one is faced with a choice between continuing and 2

14 stopping the process. The decision must be made on a real-time basis and the challenge is that the time instant at which the change occurs is not known in advance. Table 1.1: Possible scenarios for the pre- and post-change distributions. Pre-change distribution Post-change distribution 1 Specified Specified 2 Specified Unspecified 3 Unspecified Unspecified 4 Partially specified Unspecified The behaviour of the observations can usually be portrayed in a probability distribution. In the above context, the observations collected before and after the change follow two distinct distributions. Table 1.1 shows four possible scenarios for the pre- and post-change distributions. In the first scenario, one has a complete understanding of the normal operating state and aims to detect a particular type of change. The second scenario assumes that there are multiple possible changes and it is not known what type of change would occur. In the third scenario, one does not know what the normal operating state is. These three scenarios have been considered in the literature. In this research, we further consider the fourth scenario in which one wishes to detect changes in some parameters and hopes the result would 3

15 not be a ected by changes in other parameters. This means that the parameters are not of the same importance and some of them are nuisance parameters. We illustrate the purpose through an example. Suppose we want to detect changes in mean with unknown variance among a sequence of normal observations. In practice, the variance is estimated based on historical data. If the historical data are not su cient, the estimated variance may be inaccurate. When such an inaccurate variance is used, we may think that a change in variance has occurred at the beginning of detection. Consequently, the result would be a ected even if no change in mean occurs. To avoid this situation, we should treat the variance as the nuisance parameter. From this example, we see that the fourth scenario should be considered if the parameters cannot be accurately estimated. In the subsequent section, we review change detection procedures in the first three scenarios. 1.2 Change Detection Procedures The change detection problem was first addressed by Shewhart (1931) for statistical quality control. Noticing that changes in the operating state of amanufacturingprocessmayincreaseprocessvariation,heinventedthe control chart to raise an alarm when a change occurs. Based on the alarm, engineers can reduce the process variation by investigating and eliminating 4

16 the assignable causes of the change. The Shewhart control chart is popular in many industries due to its e ectiveness and simplicity. However, the Shewhart control chart is not e cient for detecting persistent changes because the decision depends solely on the current sample. To improve e ciency, statisticians have established three sequential change detection procedures, namely the cumulative sum (CUSUM) procedure (Page, 1954), the Shiryaev-Roberts (SR) procedure (Shiryaev, 1963; Roberts, 1966), and the exponentially weighted moving average (EWMA) procedure (Roberts, 1959). These procedures make full use of available data, and thus they perform better than the Shewhart control chart in terms of minimizing detection delay. All these procedures have two attractive properties when the pre- and post-change distributions are specified. First, their detection statistics can be evaluated with even only one observation. This allows us to perform these procedures from the arrival of the first observation. Second, their detection statistics have the Markov (memoryless) property. This means that these statistics can be evaluated recursively. As a result, the memory requirement and the computational cost for evaluating them would not increase over time. The CUSUM and SR procedures have been extended to the case where the post-change distribution is not specified. Specifically, the CUSUM procedure is modified by adding a supremum over the post-change parameters 5

17 (Barnard, 1959; Lorden, 1971), and the SR procedure is modified by involving a prior of the post-change parameters into the Bayesian framework (Pollak, 1987). These modifications are quite straightforward, but their practical use could be limited. The modified CUSUM procedure needs to estimate the unknown post-change parameters, and hence it cannot be performed until a su cient amount of data are collected. The modified SR procedure needs a prior of the post-change parameters, which can be di - cult to be determined if the distribution has a lot of parameters. Moreover, their detection statistics do not have the Markov property. In consequence, the memory requirement and the computational cost raise as time goes by. It should be noted that the CUSUM and SR procedures can be further extended for unspecified and partially specified pre-change distributions. But these extensions still su er from the same problems. Recently, Zhang et al. (2010a) proposed an EWMA control chart for simultaneously monitoring the mean and covariance among a sequence of multivariate normal vectors. Their approach retains the two attractive properties when the post-change distribution is unspecified. It seems that the EWMA procedure may not be limited by the requirement for estimation. This motivates us to explore the possibilities of the EWMA procedure. In the literature, many EWMA control charts have been published, but a systematic approach for developing the EWMA procedure has not been well established. Traditionally the EWMA is regarded as a smoothing technique 6

18 for observations. Tartakovsky et al. (2014) defined the EWMA with respect to the logarithm of the likelihood-ratios of the observations. To be more general, Zhou et al. (2012) definedtheexponentiallyweightedlog-likelihood and developed the EWMA procedure by the generalized likelihood-ratio test. However, in order to perform the likelihood-ratio test, one must estimate both pre- and post-change distributions. In statistics, there are other hypothesis tests that approximate the likelihood-ratio test but require that only one distribution be estimated. It is worth to consider those tests. 1.3 Scope of Research This dissertation presents an simple approach for developing EWMA procedures under the assumption that the pre-change distribution is completely specified and the post-change distribution is not specified. We first apply the maximum likelihood framework to the exponentially weighted loglikelihood and study the asymptotic properties of the maximum weight loglikelihood estimator (MWLE). Based on these properties, we construct four EWMA procedures based on the generalized likelihood-ratio test, the Lagrange multiplier test (also known as the score test), the Wald test, and the gradient test. We show that these EWMA procedures are asymptotically equivalent to the multivariate EWMA (MEWMA) control chart (Lowry et al., 1992) when no change occurs. This allows us to apply the MEWMA 7

19 control limit as the detection thresholds of the proposed procedures. Furthermore, we show that if the distribution belongs to the exponential family, these EWMA procedures can be performed from the arrival of the first observation and their detection statistics can be recursively evaluated. Even if the distribution does not belong to the exponential family, the EWMA procedure based on the Lagrange multiplier test still has these two properties. The four EWMA procedures are designed to detect changes in all parameters. In some applications, one may be interested in only a subset of the parameters. To achieve this, we extend the previous approach to the partially specified pre-change distribution. We develop four partial EWMA (PEWMA) procedures based on the four hypothesis tests. These PEWMA procedures are also asymptotically equivalent to the MEWMA control chart. In addition, they have the two attractive properties when the distribution belongs to the exponential family. The contributions of this thesis are summarized below. First, this thesis provides a systematic approach for developing the EWMA and PEWMA procedures. All these procedures have the two attractive properties when the distribution belongs to the exponential family. Second, we show that the EWMA and PEWMA procedures are closely related to the MEWMA control chart from an asymptotically viewpoint. This provides a sketch of the behaviour of the proposed procedures. Third, the PEWMA pro- 8

20 cedures could be the remedy for lack of knowledge about the pre-change distribution. As the PEWMA procedures are not sensitive to changes in the nuisance parameters, it may be applied when the nuisance parameters cannot be well estimated. The derivations in this thesis are based on standard regularity conditions listed in Appendix A. The rest of the thesis is organized as follows. In Chapter 2, wegiveaformal description of the change detection problem and review the change detection procedures. Chapter 3 presents a systematic framework for develop EWMA procedures under the assumption that the pre-change distribution is completely specified and the post-change distribution is unspecified. After that, we construct the PEWMA procedures under the assumption that the pre-change distribution is partially specified in Chapter 4. Chapter 5 gives a conclusion. 9

21 Chapter 2 Literature Review From the previous chapter, we see that the use of change detection procedures might be limited in some situations. To clearly define the problem, we present the statistical formulation of the change detection problem. After that, we review three main streams of change detection procedures. A conclusion is given in the end of this chapter. 2.1 Statistical Formulation Change detection is usually achieved by analysing data collected one at atimefromtheprocess. Fromastatisticalviewpoint,suchdatacanbe regarded as a realization of a stochastic process. The objective of change detection is to check whether the probability distribution of the stochastic process has changed. 10

22 The stochastic feature can be characterized by the change point model. Let X 1,X 2,... be a sequence of independent random variables and let denote the change point such that X 1,X 2,...,X have one distribution and X +1,X +2,... have another distribution. In particular, =0means that all random variables follow the post-change distribution, and = 1 means that all random variables follow the pre-change distribution. Assume that the pre- and post-change distributions belong to an identifiable family of distributions with a probability density function f(x; ), where is a p-dimensional vector of parameters. The identifiability implies that a distribution is specified if and only if its parameters are specified. Let 0 and 1 denote the pre- and post-change parameters, respectively. The joint probability density function of X 1,X 2,...,X t is 8 ty f(x i ; 1 ) if =0, >: i=1 >< Y p (x 1,x 2,...,x t )= f(x i ; 0 ) i=1 ty f(x i ; 1 ) if =1, 2,...,t 1, i= +1 ty f(x i ; 0 ) if = t, t +1,... i=1 (2.1) The probability and expectation with respect to p are denoted by P and E,respectively. ReadersarereferredtoTartakovsky et al. (2014)foramore general change point model where the random variables are not independent and identically distributed. In general, a change detection procedure is a sequential decision-making 11

23 process consisting of a detection statistic and a decision rule. At every time instant t the detection statistic t is evaluated based on available observations x 1,x 2,...,x t. If, without loss of generality, t is greater than apredetermineddetectionthresholdh, thentheprocedureraisesanalarm that a change has occurred. The time instant at which the alarm is triggered is a random stopping time, denoted by T =min t 1: t >h. We say that an alarm is a correct detection if T> ;otherwise,itisafalsealarm. It should be noted that every change detection procedure corresponds to a stopping time. In practice, change detection procedures are applied under two scenarios: the single-run change detection and the multi-cyclic change detection. The single-run change detection assumes that the change detection procedure is performed only once. The result is either a correct detection or a false alarm. What takes places beyond the stopping time is of no concern. Figure 2.1 (a) shows two possible trajectories of the detection statistics. It can be seen that the solid line exceeds the detection threshold after the change point, leading to a correct detection. In this case one is often interested in the detection delay T. On the other hand,the dashed line goes beyond the detection threshold before the change point, resulting in a false alarm. The corresponding stopping time T is the run length to false alarm. The multi-cyclic change detection assumes that the change detection procedure is applied repeatedly. Specifically, the change detection proce- 12

24 (a) Single-run change detection Run length to false alarm Detection delay Detection statistic, t Detection threshold False alarm Correct detection Start of surveillance T Change point T t (b) Multi-cyclic change detection Detection statistic, t False alarm Detection threshold False alarm Correct detection Start of surveillance T (1) T (2) Change point T (3) t Figure 2.1: Single-run and multi-cyclic change detection. dure is renewed from scratch after each false alarm until a change is correctly detected. Figure 2.1 (b) gives an example. It can be seen that the change detection procedure gives multiple false alarms prior to the correct detection. In some applications, it is acceptable to have many false alarms before the correct detection because the cost of a false alarm is much less than the cost of the detection delay. In this case, the change point is considerably greater than the average run length between consecutive false alarms. As 13

25 the change is preceded by a stationary flow of false alarms, this case is called the change detection under the stationary regime. In this thesis, change detection procedures will be compared under this regime. When applying a change detection procedure, one must consider two types of risks: the risk associated with a false alarm and the risk associated with detection delay. Page (1954) suggestedmeasuringtheriskassociated with a false alarm by the the average run length to false alarm (ARL2FA) ARL2FA(T )=E T T apple = E1 T, (2.2) where the second equality holds since p (x 1,x 2,...,x t )=p 1 (x 1,x 2,...,x t ) for any t apple. From Figure 2.1, the ARL2FA is reasonable for both singlerun change detection and multi-cyclic change detection. To measure the risk associated with detection delay, Lorden (1971) consideredtheworstcase scenario and defined the worst-case average delay to detection (WADD) WADD(T )=sup 0 esssup E T T>,X1,X 2,...,X, (2.3) X 1,X 2,...,X where esssup denotes the essential supremum. It can be seen that the conditional average delay to detection E T T> is first maximized over all possible trajectories of the observations up to the change point and then over the change points. Thus the WADD is rather conservative. Another measure for the risk associated with detection delay is developed for change detection under the stationary regime. Consider a sequence of independent repetitions of the stopping time T (1),T (2),...Denotethefirststoppingtime 14

26 after the change point by T (N ), wheren =min t 1: P t i=1 T (i) >. As the change point is considerably large, the risk associated with detection delay can be measured by the stationary average delay to detection (SADD) SADD(T )= lim!1 E " N X i=1 T (i) #. (2.4) Agoodchangedetectionprocedureshouldguaranteesmallvaluesof the WADD or the SADD while keeping the ARL2FA above a certain level. In other words, we seek a change detection procedure T that minimize WADD(T ) or SADD(T )subjecttotheconstrainte 1 T,where 1 is the target value of the ARL2FA. Recall that the stopping time of a change detection procedure is T =min t 1: t >h. Hence the detection threshold h is usually obtained by solving E 1 T =. In section 1.2, weseethattherearefourpossiblescenariosforthepreand post-change distributions. In the next section, we review change detection procedures in these scenarios. 2.2 Chagne Detection Procedures In general, there are three change detection procedures for detecting persistent changes. They are the cumulative sum (CUSUM) procedure, the Shiryaev-Roberts (SR) procedure, and the exponentially weighted moving average (EWMA) procedure. These procedures are developed based on di erent concepts, which are summarized in the following subsections. 15

27 2.2.1 The CUSUM Procedure The CUSUM procedure is developed based on the change point model (2.1). To determine whether a change has occurred before time instant t, itsu ces to test H 0 : t against H 1 : <t. As the change point is an unknown parameter, this can be achieved by the generalized likelihood-ratio test. The generalized likelihood-ratio statistic is max 0applek<t p k (x 1,x 2,...,x t ) p 1 (x 1,x 2,...,x t ). (2.5) It can be seen that the worst possible outcome of the change point is considered. Page (1954) defined the CUSUM statistic by the logarithm of the generalized likelihood-ratio statistic CUSUM t =max 1applekapplet tx i=k log f(x i; 1 ) f(x i ; 0 ). (2.6) The CUSUM statistic can easily evaluated by the recursive formula CUSUM t =max 0, CUSUM t 1 +log f(x t; 1 ) f(x t ; 0 ), (2.7) CUSUM 0 =0. (2.8) From the generalized likelihood-ratio test, a large value of the CUSUM statistic is in favour of the alternative hypothesis. Thus the stopping time of the CUSUM procedure is defined by T CUSUM =min t 1: CUSUM t >h. The CUSUM procedure has three nice properties. First, it is optimal in terms of minimizing the WADD (Moustakides, 1986). Second, it can 16

28 be performed from the arrival of the first observation. Third, its detection statistic can be evaluated recursively. The CUSUM procedure is developed under the assumption that both pre- and post-change distributions are specified. When the post-change distribution is not specified, the generalized likelihood-ratio statistic for testing H 0 : t against H 1 : <tbecomes max sup p k (x 1,x 2,...,x t ) 0applekapplet u 2 p 1 (x 1,x 2,...,x t ), (2.9) where denotes the parameter space and u is a positive integer for ensuring that all the supremums are finite. By taking the logarithm of this statistic, Barnard (1959) and Lorden (1971) defined the generalized CUSUM (GCUSUM) statistic as GCUSUM t =max uapplekapplet sup tx 2 i=t k+1 log f(x i; ) f(x i ; 0 ). (2.10) In the literature, the GCUSUM procedure is often called the GLR control chart, but in this thesis we do not use this name to avoid confounding. The GCUSUM procedure, however, does not have the three properties of the CUSUM procedure. In particular, it cannot be performed for t<u. This limits the use of the GCUSUM procedure as it cannot detect early changes. This problem become more severe if the distribution is complex and has a lot of parameters. In addition, the GCUSUM statistic cannot be evaluated recursively. In consequence, the memory requirement and the computational cost grow to infinity as time goes on. To allevi- 17

29 ate this problem, Willsky and Jones (1976) introduced the window-limited approach where the maximization is performed not over all available observations but in a sliding window with size w. The window-limited GCUSUM (WLGCUCUM) statistic is defined by WLGCUSUM t = max uapplek<u+w sup tx 2 i=t k+1 log f(x i; ) f(x i ; 0 ). (2.11) The WLGCUCUM procedure has found widespread applications in navigation and signal processing (Basseville and Nikiforov, 1993). However, how to choose the window size w appropriately has remained a di cult problem. Lai and Shan (1999) showedthatanadequatechoiceisw = O(log )for a finite target ARL2FA 1. But this result is too vague to be used in practice. When the pre-change distribution is not specified, the GCUSUM procedure can be further modified by adding another supremum over the prechange parameters. Nevertheless, this modification still su ers from the two problems The SR Procedure The SR Procedure was constructed based on the generalized Bayesian principle with a uniform improper prior. Suppose that the change point has a geometric prior P = k = (1 ) k k =0, 1,..., (2.12) 18

30 where 2 (0, 1) is the parameter. At time instant t, theposteriorprobability of the event that a change has occurred is P <t x 1,x 2,...,x t = P t 1 k=0 (1 )k p k (x 1,x 2,...,x t ) P 1 k=0 (1 )k p k (x 1,x 2,...,x t ). (2.13) From the change point model (2.1), we know that p k (x 1,x 2,...,x t ) = p 1 (x 1,x 2,...,x t )forallk t. This implies that P <t x 1,x 2,...,x t = R t R t + 1, (2.14) where R t = tx ty 1 f(x i ; 1 ) 1 f(x i ; 0 ). (2.15) k=1 i=k Shiryaev (1963) suggestedraisinganalarmiftheposteriorprobabilityis high or, equivalently, if R t is large. By taking! 0, Roberts (1966)defined the the SR statistic as SR t = tx ty k=1 i=k The SR statistic can be written in a recursive fashion f(x i ; 1 ) f(x i ; 0 ). (2.16) SR t =(1+ SR t 1) f(x t; 1 ) f(x t ; 0 ), (2.17) SR 0 =0. (2.18) The stopping time of the SR procedure is T SR =min t 1: SR t >h. It should be noted that the SR procedure can be viewed as an approximation of the CUSUM procedure. Taking the logarithm of the SR statistic 19

31 yields log SR t =log tx k=1 tx exp log f(x t; 1 ), (2.19) f(x t ; 0 ) i=k where the LogSumExp function is a smooth approximation to the maximum function, mainly used by machine learning algorithms. Similar to the CUSUM procedure, the SR procedure also has three nice properties. First, it is optimal in terms of minimizing the SADD (Pollak and Tartakovsky, 2009). Second, it can be performed from the arrival of the first observation. Third, its detection statistic can be evaluated recursively. The SR procedure is developed under the assumption that both pre- and post-change distributions are specified. If the post-change distribution is not specified, Pollak (1987) suggestedinvolvingthepriorofthepost-change parameters ( ) intothebayesianframework. Theposteriorprobabilityof the event that a change occurs before time instant t becomes where P, R, <t x 1,x 2,...,x t = t R, t +, (2.20) 1 R, t = Z 2 tx k=1 i=k Taking! 0 yields the weighted SR (WSR) statistic ty 1 f(x i ; ) ( )d. (2.21) 1 f(x i ; 0 ) WSR t = Z 2 tx ty k=1 i=k f(x i ; ) ( )d. (2.22) f(x i ; 0 ) Although this modification is quite straightforward, the use of the WSR procedure is often limited for two reasons. On one hand, it could be di cult 20

32 to choose the prior appropriately, especially when the distribution has a lot of parameters. On the other hand, the WSR statistic cannot be evaluated recursively. Consequently, the memory requirement and computational cost increase over time. Similar to the GCUSUM procedure, this problem can be alleviated by the window-limited approach (Willsky and Jones, 1976). Considering the latest w observations, we can define the window-limited WSR (WLWSR) statistic as WLWSR t = Z 2 wx k=1 ty i=t k+1 f(x i ; ) ( )d, (2.23) f(x i ; 0 ) where w is the size of the sliding window. Again, how to choose w appropriately is a di cult problem. When the pre-change distribution is not specified, the WSR procedure can be further modified by introducing another prior to the pre-change parameters. Nonetheless, the use of this modification is still limited for the same reasons The EWMA Procedure The EWMA procedure is built on the idea of allocating di erent weights to observations according to their chronological order (Roberts, 1959). Specifically, the highest weight is given to the most recent observation, and this weight decreases gradually as it gets older. From the information theory, the information carried by an observation is associated with its log- 21

33 likelihood. To combine information, it is reasonable to consider the exponentially weighted log-likelihood Q t ( ) = tx i=1 (1 ) t i log f(x i ; )+(1 ) t E 0 log f(x; ), (2.24) where 2 (0, 1) is the smoothing parameter and E 0 is the expectation with respect to f(x; 0 ). It should be noted that E 0 log f(x; ) can be regarded as the log-likelihood of a pseudo observation sampled from the pre-change distribution. This term does not play an important role in detecting changes as its coe cient (1 ) t decreases exponentially as t increases. However, it ensures the stability of the exponentially weighted log-likelihood in the sense that E Qt ( ) = E 0 log f(x; ) for all t apple. The exponentially weighted log-likelihood was first defined by Zhou et al. (2012). Here we make a slight change for the weight of the pseudo observation from (1 ) t to (1 ) t so that the sum of the weights is always one. It can be seen that the exponentially weighted log-likelihood makes full use of all available data as the weight of each observation never reaches zero. Before the change point, the exponentially weighted log-likelihood consists of all pre-change observations. Once a change occurs, the proportion of the pre-change observations in the exponentially weighted loglikelihood (1 ) t decreases exponentially as t increases. Eventually, the exponentially weighted log-likelihood would consist almost entirely of the 22

34 post-change observations. Therefore, to determine whether a change has occurred, it su ces to test H 0 : = 0 at every time instant t. When the pre- and post-change distributions are specified, we can apply the likelihood-ratio test to test H 0 : = 0 versus H 1 : = 1. The EWMA statistic is defined by the logarithm of the likelihood-ratio statistic EWMA t = Q t ( 1 ) Q t ( 0 ) = tx i=1 (1 ) t i log f(x i; 1 ) f(x i ; 0 ) +(1 )t E 0 apple log f(x; 1) f(x; 0 ). (2.25) The EWMA statistic can be easily evaluated through the recursive formula EWMA t = log f(x t; 1 ) f(x t ; 0 ) +(1 ) EWMA t 1, (2.26) apple EWMA 0 = E 0 log f(x; 1) f(x; 0 ). (2.27) From the likelihood-ratio test, a large value of the EWMA statistic is in favour of the alternative hypothesis. Therefore the stopping time of the EWMA procedure is defined by T EWMA =min t 1: EWMA t >h. To apply the EWMA procedure, one has to choose an appropriate value for the smoothing parameter.suchavaluemaybefoundbyminimizing ariskmeasureassociatedwithdetectiondelay. Forexample,fordetecting mean shifts with known variance among a sequence of normal observations, Srivastava and Wu (1993) showedthatthesmoothingparameter has a unique value that minimizes the SADD. This value can be obtained 23

35 through the Markov chain approach (Brook and Evans, 1972) orbysolving integral equations (Crowder, 1987). It should be noticed that the EWMA procedure does not consider the stochastic feature of the change point model (2.1). Hence it is not surprising that the EWMA procedure is not optimal in terms of minimizing WADD or SADD. Nevertheless, Lucas and Saccucci (1990) claimed that the EWMA procedure is quite competitive in most practical situations. Moreover, the EWMA procedure has the other two nice properties. Specifically, it can be performed from the arrival of the first observation and its detection statistic can be evaluated recursively. When the post-change distribution is unspecified, it su ces to test H 0 : = 0 versus H 1 : 6= 0. This can be achieved by the generalized likelihood-ratio test. By taking the logarithm of the generalized likelihoodratio statistic, Zhou etal. (2012) defined the generalized EWMA (GEWMA) statistic as GEWMA t =supq t ( ) Q t ( 0 ) 2 ( tx =sup 2 i=1 (1 ) t i log f(x i; ) f(x i ; 0 ) +(1 )t E 0 apple f(x; ) log f(x; 0 ) ). (2.28) Similar to the GCUSUM procedure, the GEWMA statistic may require at least u observations for the supremum to be finite. This means that the 24

36 GEWMA procedure may not be performed for t<u,andthuscannotdetect early changes. Moreover, the GEWMA statistic does not have a recursive expression, resulting in computational burden. Though this problem can be alleviated by the window-limited approach (Willsky and Jones, 1976), choosing an appropriate window size is still a di cult problem. However, in the literature, many EWMA control charts have been proposed under various distribution assumptions, such as the normal distribution (Roberts, 1959; Lowry et al., 1992; MacGregor and Harris, 1993; Hawkins and Maboudou-Tchao, 2008; Zhang et al., 2010b,a), the exponential distribution (Gan, 1998), the Bernoulli distribution (Yeh et al., 2008), and the Poisson distribution (Borror et al., 1998; Zhou et al., 2012). These control charts can be perform from t = 1 and their detection statistics can be evaluated recursively. It seems that the GEWMA procedure have these two properties under certain conditions. This motivates us to further explore the possibility of the EWMA procedure. Moreover, when the pre-change distribution is not specified, it seems that the GEWMA procedure cannot be further modified. This is because the exponentially weighted log-likelihood (2.24) containsthepre-change parameters. If the pre-change parameters are completely unknown, then the exponentially weighted log-likelihood is unidentifiable. However, if the pre-change parameters are partially specified, it seems that we have a chance to develop a change detection procedure. To the best of my knowledge, this 25

37 topic has not been studied in the literature. 2.3 Conclusion From the above discussion, when the pre- and post-change distributions are specified, it is recommended to apply the CUSUM and SR procedures since they are optimal in terms of minimizing the WADD and SADD, respectively. When the post-change distribution is unspecified, these two procedures have been modified. However, the use of the modified procedures may be limited if the distribution is complex and has a lot of parameters. Specifically, they may not be able to detect early changes and the memory requirement and computational cost for evaluating their detection statistics grow to infinity as time goes by. Alternatively, we see that the GEWMA procedure does not su er from these two problems for certain distributions. This motivates us to further explore the possibility of the EWMA procedure. Since the GEWMA procedure is developed based on the generalized likelihood-ratio test, it requires that the pre- and post-change distribution be estimated. In statistics, there are other hypothesis tests that approximate the generalized likelihood-ratio test but require that only one distribution be estimated. For example, the Lagrange multiplier test requires only the information under the null hypothesis. This could be useful as it need not estimate the post-change parameters. In Chapter 3, we develop 26

38 EWMA procedures based on those hypothesis tests. Moreover, in some applications, the parameters are not of the same importance. This means that the distribution has nuisance parameters. We seek a change detection procedure for detecting changes in the important parameters and whose performance would not be a ected by changes in nuisance parameters. In Chapter 4, wemakeanattempttoaddressthis problem. We extend the approach introduced in Chapter 3 to the scenario where the pre-change distribution is partially specified. 27

39 Chapter 3 EWMA Procedures for Specified Pre-change Parameters This chapter presents a simple approach for developing EWMA procedures when the pre-change distribution is specified and the post-change distribution is not specified. We first define the exponentially weighted log-likelihood. As the post-change parameters are unknown, we apply the maximum likelihood framework to the exponentially weighted log-likelihood and derive mathematical properties of the maximum weighted log-likelihood estimator. Next, we construct four EWMA procedures based on the generalized likelihood-ratio test, the Lagrange multiplier test, the Wald test, and the gradient test. When no change occurs, these EWMA procedures 28

40 are asymptotically equivalent to the multivariate EWMA (MEWMA) control chart (Lowry et al., 1992). This suggests that the MEWMA control limit may be directly applied as the detection threshold under certain circumstances. We further show that the detection statistics of these EWMA procedures can be evaluated recursively when the distribution belongs to the exponential family. To illustrate the use of the proposed EWMA procedure, we apply them to simultaneously monitoring mean and variance of asequenceofnormallydistributedobservations. 3.1 Maximum Weighted Log-likelihood From Subsection 2.2.3, we know that when the post-change distribution is unspecified, we shall test H 0 : = 0 versus H 1 : 6= 0. This can be achieved by the generalized likelihood-ratio test but also other hypothesis tests that approximate the generalized likelihood-ratio test. These tests may require an estimator for the unknown post-change parameters. In the following, we define the maximum weighted log-likelihood estimator (MWLE) and study its asymptotic properties. We begin by introducing some notations. The exponentially weighted score is defined by the gradient of the exponentially weighted log-likelihood 29

41 (2.24) withrespectto t( Q t( ) tx (3.1) = (1 ) t i ( ; x i )+(1 ) t E 0 ( ; X), i=1 where ( ; x) log f(x; ) isthescoreofanobservationx. The information I( ) is defined by the covariance of ( ; x). By the regularity condition (A8), it can be easily shown that E ( ; X) = 0 (3.2) for any 2 and I( ) =E ( ; X) ( ; X) 0 = E 2 0 log f(x; where E denotes the expectation with respect to f(x; ). Furthermore, the MWLE of at time instant t is defined by ˆ t =argsupq t ( ), (3.4) 2 which can often be obtained by solving the score equations t( ) = 0. From (2.24), the exponentially weighted log-likelihood degenerates to the standard log-likelihood where each observation receives equal weight as! 0. Thus it is not surprising that the MWLE has the same asymptotic properties as the maximum likelihood estimator (MLE). These properties are summarized in the Theorem below. Its proof is given in Appendix B. 30

42 Theorem 3.1. Suppose that x 1,x 2,... are independently sampled from f(x; ), where denotes the true values of the parameters. Let! d and! p denote convergence in distribution and convergence in probability, respectively. As! 0 and t! 1, we have the following results. (a) ˆ t is consistent: ˆ t p!. (b) t( ) is asymptotically normal: r 2 t( ) d! N 0, I( ). (c) ˆ t is asymptotically normal: r 2 (ˆ t ) d! N 0, I 1 ( ). It should be noted that! 0and t!1are stronger than! 0 and t!1. In fact, t!1is necessary because if t = c for some c>0, the coe cient of the pseudo sample in the exponentially weighted log-likelihood (2.24), (1 ) t,tendstoe c as! 0, which means that the e ect of the pseudo sample never vanishes. 3.2 EWMA Procedures To test H 0 : = 0 against H 1 : 6= 0 at every time instant t, weconsider four famous hypotheses tests, namely the generalized likelihood-ratio test (GLR), the Lagrange multiplier test (LM), the Wald test (W), and the gradient test (G). These tests are developed based on di erent measures of the distance between the null and the alternative hypotheses, which are briefly summarized below. 31

43 The generalized likelihood-ratio test (Neyman and Pearson, 1928a,b; Wald, 1943) isbasedonthedi erencebetweenthemaximumoftheexponentially weighted log-likelihood under the alternative hypothesis and that under the null hypothesis. The GLR statistic is defined by GLR t = 2 2 Q t (ˆ t ) Q t ( 0 ). (3.5) The Lagrange multiplier test (Silvey, 1959), also known as the score test (Rao and Bartlett, 1948), is derived from a constrained maximization principle. Maximizing the exponentially weighted log-likelihood subject to the constraint = 0 yields a set of Lagrange multipliers which measure the shadow price of the constraint. If the shadow price is high, the constraint should be rejected as inconsistent with the data. Specifically, the Lagrange function is Q t ( ) ( 0 ) 0 apple,whereappleis a p-dimensional vector of Lagrange multipliers. Its first-order conditions are t( ) =appleand = 0. The shadow price of the constraint = 0 is t( 0 ). As t( 0 )isasymptotically normal from Theorem 3.1 (b), the LM statistic is defined by LM t = 2 t( 0 ) 0 I 1 ( 0 ) t( 0 ). (3.6) The Wald test (Wald, 1943) is based on the deviation between the MWLE ˆ t and the pre-change parameters 0. As ˆ t is asymptotically normal from Theorem 3.1 (c), the W statistic is defined by W t = 2 (ˆ t 0 ) 0 I(ˆ t )(ˆ t 0 ). (3.7) 32

44 Readers are referred to Buse (1982) andengle (1984) formoredetailsabout these three tests. Recently, Terrell (2002) proposed the gradient statistic that shares the same asymptotic property with the previous three test statistics. The G statistic is defined by G t = 2 t( 0 ) 0 (ˆ t 0 ). (3.8) Since t( 0 )istheshadowpriceoftheconstraint = 0,thegradient statistic can be viewed as a measure of the di erence between Q t (ˆ t )and Q t ( 0 ). Q t ( ) Q t (ˆ t ) G Q t ( 0 ) GLR W LM ˆ t 0 Figure 3.1: A geometric illustration of the four tests. Figure 3.1 shows a geometric illustration of these tests. It can be seen that the generalized likelihood-ratio test and the gradient test are based on 33

45 the vertical di erence between Q t (ˆ t )andq t ( 0 ), the Wald test is based on the horizontal di erence between ˆ t and 0,andtheLagrangemultipliertest is based on the slope of Q t ( ) at 0. Their test statistics measure the distance between H 0 and H 1,andhenceH 0 should be rejected if the observed test statistics are large. The following proposition states the relationship between these statistics from an asymptotic viewpoint. Its proof is given in Appendix B. Proposition 3.2. Under H 0 : = 0, the di erence between any two of GLR t, LM t, W t, and G t converges in probability to zero as! 0 and t!1. Proposition 3.2 tells us that the four tests are asymptotically equivalent under the null hypothesis. Furthermore, when no change occurs, we know from Theorem 3.1 (b) that t( 0 )isasymptoticallynormal,implyingthat LM t is asymptotically chi-squared distributed with p degrees of freedom. This leads to the subsequent corollary. Corollary 3.3. Under H 0 : = 0, GLR t, LM t, W t, and G t are asymptotically chi-squared distributed with p degrees of freedom as! 0 and t!1. Another useful property is functional invariance. Similar to the MLE, the MWLE is also invariant under any transformation of the parameters. Besides, the GLR and the LM statistics are invariant under certain trans- 34

46 formations of the parameters. These results are summarized in the next proposition. Its proof is given in Appendix B. Proposition 3.4. Let h : R p! R p be a function of such that ' = h( ). (a) ˆ' t = h(ˆ t ) is the MWLE for ' = h( ). (b) If h is invertible, GLR t is invariant. (c) If h is continuously di erentiable and its Jacobian J( ) = finite and non-singular at 0, then LM t is 0 h( ) From Proposition 3.4 (b), we need not consider transformations of the parameters when deriving the GLR and the LM statistics. On the contrary, additional W and G statistics can be derived by transforming the parameters. These additional statistics also satisfy the asymptotic relationship stated in Proposition 3.2 and Corollary 3.3. The use of transforming the parameters will be demonstrated later by an example. Among the four statistics, the LM statistic can be obtained recursively. This is because from (3.1), t( 0 )canbeevaluatedthroughtherecursive formula t( 0 )= ( 0 ; x t )+(1 ) t 1( 0 ), (3.9) 0( 0 )=E 0 ( 0 ; X) = 0, (3.10) and I( 0 )isaconstantmatrixwithrespecttot. Even if I( 0 )doesnot have a closed-form expression, it can be well approximated by the sample 35

47 covariance matrix of a su cient number of simulated ( 0 ; x). Hence, to evaluated the LM statistic, we only need to compute I( 0 )atthebeginning of surveillance and to update t( 0 )ateverytimeinstantt. The other three statistics, however, may not be obtainable for t<pdue to the insu ciency of observations for computing the MWLE. Below we show that if the distribution belongs to the exponentially family, all four statistics can be evaluated recursively. The probability density function of the exponential family distribution can be expressed as f(x; ) =exp ( ) 0 T(x) A( )+B(x), (3.11) where T(x) denotes the su cient statistics. Note that ( ) and T(x) are p-dimensional vectors and A( ) andb(x) arescalars. Theexponentially weighted log-likelihood can be written as Q t ( ) / ( ) 0 z t A( ), (3.12) where z t is the exponentially weighted average of the su cient statistics z t = tx i=1 (1 ) t i T(x i )+(1 ) t E 0 T(X). (3.13) The term associated with B(x) isomittedbecauseitisirrelevantto. The exponentially weighted score can be expressed as t( ( )0 z A( ). The MWLE can be obtained by solving t( ) = 0, or equivalently,by 36

48 solving Since E g( ) @ ( A( ) =z t. (3.15) ( ; X) = 0, wehavee T(X) = g( ). This implies that the Fisher information has a closed-form expression 2 I( ) =E 0 log 2 = g j 0 @ 0 A( ), j=1 (3.16) where j ( ) andg j ( ) arethej-th elements of ( ) andg( ), respectively. Since 0 is known, Q t ( 0 )and t( 0 )arefunctionsofz t and I( 0 )isa constant matrix with respect to t. Moreover,asˆ t = g 1 (z t )isafunction of z t, Q t (ˆ t )andi(ˆ t )arealsofunctionsofz t.theseimplythatthefour statistics are all functions of z t. In other words, the four statistics depend on the observations only through z t.from(3.13), z t has a recursive expression z t = T(x t )+(1 )z t 1, (3.17) z 0 = E 0 T(X) = g( 0 ). (3.18) Hence there is no need to store all historical data but to update the su - cient statistics at each time instant. This is reasonable since the su cient statistics retain all the relevant information about the parameters. As a result, the memory requirement and the computational cost for evaluating the four statistics would not increase over time. By treating the four statistics as detection statistics, we construct four EWMA procedures. For simplicity, we omit the EWMA and call them the 37

An Adaptive Exponentially Weighted Moving Average Control Chart for Monitoring Process Variances

An Adaptive Exponentially Weighted Moving Average Control Chart for Monitoring Process Variances Lianjie Shu Faculty of Business Administration University of Macau Taipa, Macau (ljshu@umac.mo) Abstract