Adaptive estimation with change detection for streaming data

Size: px

Start display at page:

Download "Adaptive estimation with change detection for streaming data"

Charlene Craig
5 years ago
Views:

1 Adative estimation with change detection for streaming data A thesis resented for the degree of Doctor of Philosohy of the University of London and the Diloma of Imerial College by Dean Adam Bodenham Deartment of Mathematics Imerial College 80 Queen s Gate, London SW7 AZ NOVEMBER, 04

2 I certify that this thesis, and the research to which it refers, are the roduct of my own work, and that any ideas or quotations from the work of other eole, ublished or otherwise, are fully acknowledged in accordance with the standard referencing ractices of the disciline. Signed:

3 3 Coyright The coyright of this thesis rests with the author and is made available under a Creative Commons Attribution Non-Commercial No Derivatives licence. Researchers are free to coy, distribute or transmit the thesis on the condition that they attribute it, that they do not use it for commercial uroses and that they do not alter, transform or build uon it. For any reuse or redistribution, researchers must make clear to others the licence terms of this work.

4 To my arents 4

5 5 Abstract Data streams have become ubiquitous over the last two decades; otentially unending streams of continuously-arriving data occur in fields as diverse as medicine, finance, astronomy and comuter networks. As the world changes, so the behaviour of these streams is exected to change. This thesis describes sequential methods for the timely detection of changes in data streams based on an adative forgetting factor framework. These change detection methods are first formulated in terms of detecting a change in the mean of a univariate stream, but this is later extended to the multivariate setting, and to detecting a change in the variance. The key issues driving the research in this thesis are that streaming data change detectors must oerate sequentially, using a fixed amount of memory and, after encountering a change, must continue to monitor for successive changes. We call this challenging scenario continuous monitoring to distinguish it from the traditional setting which generally monitors for only a single changeoint. Additionally, continuous monitoring demands that there be limited deendence on the setting of arameters controlling the erformance of the algorithms. One of the main contributions of this thesis is the develoment of an efficient, fully sequential change detector for the mean of a univariate stream in the continuous monitoring context. It is cometitive with algorithms that are the benchmark in the single changeoint setting, yet our change detector only requires a single control arameter, which is easy to set. The multivariate extension rovides similarly cometitive erformance results. These methods are alied to monitoring foreign exchange streams and comuter network traffic.

6 6 Acknowledgements First and foremost, I would like to thank my suervisors Niall Adams and Nicholas Heard for their incredible atience and suort over the last four years. I am very grateful to Imerial College for funding my studies and roviding me with a Roth Doctoral Fellowshi. Without the financial suort, this work would not have been ossible. Thank you to Christoforos Anagnostooulos for many helful discussions and atient exlanations, and for the chance to articiate in some interesting rojects. Thanks to my office mates, ast and resent, for all the advice, exlanations, quick tis, shared lunches, and for keeing me thoroughly entertained. I consider myself fortunate to count many of you as friends. Finally, thanks to my arents and family, and Qing, for their love, suort and atience over these ast few years. This thesis was tyeset using L A TEX, and (almost all) tables and lots were roduced in R using the xtable and gglot [58] ackages, resectively. Dean Bodenham

7 7 Table of contents Abstract 5 Glossary Introduction 3 Background 8. Change detection and streaming data Performance measures: the average run length Control charts Shewhart charts CUSUM EWMA Other sequential change detection methods Estimating the arameters of the known distribution Restarting an algorithm after a changeoint Adative filtering for the mean 8 3. Estimating the mean Fixed forgetting factor mean The effective samle size Exectation and variance of forgetting factor mean The relationshi between x N, and Sequential udating Relation to EWMA Adative forgetting factor Adative forgetting factor mean Udating N N Behaviour of! when encountering a change Plotting the average behaviour of! Truncating the range of! Choice of ste size... 45

8 3.3.7 Comarison of the AFF mean and the FFF mean Relation to Kalman Filter Otimal forgetting factors and! Definition of otimal Solving for otimal fixed Results for otimal fixed : Examle Otimal adative! Results for otimal! : Examle Comarison of b 5,N and c! 5,N Discussion Change detection using adative estimation Change detection using forgetting factors Normal streams Comarison between EWMA and FFF Performance of the AFF scheme for different choices of ste size Distribution-free forgetting factor methods AFF-Chebyshev method Fixed-adative forgetting factor method Monitoring the AFF Summary of distribution-free AFF methods Exeriments and results Discussion Continuous Monitoring 7 5. Detecting multile changeoints in streaming data Recent aroaches and continuous monitoring Performance measures A simulation study Classifying the detected changes Average run length for a data stream Choice of ste size in the continuous monitoring context Exeriments and results Foreign exchange data Discussion Multivariate adative filtering and change detection Multivariate change detection in the literature Multivariate adative forgetting factor mean Adative forgetting factors for each stream Decision rules for multivariate change detection Assuming the streams are indeendent

9 6.3. Estimating the covariance Taking the covariance into account: Brown s method A simulation study Exeriments and results: indeendent streams Exeriments and results: deendent streams Monitoring a comuter network Change detection on NetFlow Data Discussion Aroximating the cdf of a weighted sum of chi-squared random variables The cdf of a weighted sum of chi-squared random variables Aroximations in a streaming data context Efficient aroximate moment-matching methods Comuting cumulants and moments Satterthwaite-Welch aroximation Hall-Buckley-Eagleson aroximation Wood F aroximation Lindsay-Pilla-Basak aroximation The evaluation of aroximate methods for comuting F QN in the literature Evaluating the erformance of an aroximate method for a cdf of a weighted sum of arbitrary random variables Performance of an aroximate method for a articular distribution Estimating the accuracy of an aroximate method for a secific N Results Accuracy Comarison to the normal aroximation Seed of comutation Weighted sums with a small number of terms Results for another choice of coefficients The referred aroximate method in a streaming data context Discussion Variance Change Detection 9 8. Adative estimation of the variance The fixed forgetting factor variance Sequential udate equations for s N, The relationshi between s N, and Change detection for s N, assuming normality Simulation study for change detection with FFF variance Performance of the aroximate cdf with the FFF variance Change detection using the FFF variance The adative forgetting factor variance

10 8.3. Cost functions for the AFF variance Distribution-free methods utilising the AFF variance Discussion Conclusion 48 A Derivations 53 A. Methods for combining -values A.. Fisher s method A.. Stouffer s method A. Distribution theory A.. Normal, Gamma and Chi-squared distributions A.. Cumulants and moments A..3 The square of a standard normal variable is chi-squared A..4 Sum of Gamma variables with same scale is again Gamma A..5 The cumulants of a chi-squared random variable A.3 Derivations for the adative forgetting factor mean A.3. Non-sequential definitions for x! N,...59 A.3. Alternate non-sequential equation of N,!...6 A.3.3 Proof of Lemma A.3.4 Sequential udate equation for N,!...64 A.3.5 Non-sequential and sequential definitions for! N,...65 A.3.6 Non-sequential definition for u! N,...65 A.3.7 Sequential udate equation for u! N,...66 A.3.8 Exectation of cost function L! N,...68 A.3.9 Exectation of the derivative of the cost function L! N,...69 A.4 Derivations for the otimal fixed forgetting factor A.5 Derivation of equations for Wood F method A.6 Derivations for the forgetting factor variance A.6. Derivation of v N,...84 A.6. Sequential udate equation for s N,...86 A.6.3 The covariance of the FFF variance terms A.6.4 The variance of y i...9 A.6.5 Comuting the cumulants for the FFF variance A.6.6 Adative forgetting factor variance A.6.7 Derivative of the adative forgetting factor variance A.6.8 Summary of AFF variance with a single forgetting factor References 57 0

11 Glossary SPC Statistical rocess control. 4 N The number of observations observed so far, or the number of terms in the weighted sum of Q N (which only occurs in Chater 7). 9 the time the true changeoint occurs. 9 b the time the changeoint is detected. FFF Fixed forgetting factor. 33 A fixed forgetting factor valued in [0, ]. 33 AFF Adative forgetting factor. 4! An adative forgetting factor that takes values in [0, ]. 4 x N The mean of observations x,x,...,x N. 39 x N, Fixed forgetting mean for observations x,x,...,x N with forgetting factor. 33 x N,! Adative forgetting mean for observations x,x,...,x N with adative forgetting factor!. 4 The ste size, or learning rate, used in the gradient descent ste to udate the adative forgetting factor. 43 d The dimension of a multivariate observation. 9 x! N, Multivariate adative forgetting factor mean for observations x, x,...,x N. 9

12 Glossary ARL0 The in-control Average Run Length for a change detection algorithm, i.e. the average number of observations before an algorithm falsely signals that a change has occurred. 0 ARL The out-of-control Average Run Length for a change detection algorithm, i.e. the average number of observations between a changeoint occurring and the algorithm detecting that a change has occurred. 0 CCD Proortion of true changeoints that are correctly detected, a erformance metric in the continuous monitoring context. 78 DNF Proortion of detected changeoints that are not false detections, a erformance metric in the continuous monitoring context. 78 cdf Cumulative distribution function. 7 df Probability distribution function. 8 EWMA Exonentially Weighted Moving Average; a control chart originally described in [8]. CUSUM Cumulative Sum; a control chart originally described in [6].

13 3 Chater Introduction Data streams can be defined as otentially unending sequences of ordered observations, where each observation may be read exactly once [7]. Such a model is aroriate when the amount of continuously arriving data may be too large to ractically store in memory and analyse later [5]. For examle, Imerial College s comuter network generates 9 GB of NetFlow [] data every hour. Furthermore, streaming data may be arriving at a high frequency relative to the rocessing resources available, and so streaming data algorithms ( analytics ) should be as efficient as ossible [8, 07]. Another consideration is that timely decision-making may be necessary, and hence timely-outut from an analytic is required. The combination of these three characteristics results in a reference for sequential or online statistical techniques. Moreover, it is inherent to many alications that data streams cannot be stoed while adjustments to an algorithm are made the data will continue to arrive, whatever action is taken. This unique combination of characteristics dictates that streaming data algorithms should be as autonomous as ossible. Therefore, algorithms deloyed on data streams should have as few control arameters as ossible. Modern data collection technology has made streaming data ubiquitous; examles of data streams abound in fields as diverse as fraud detection [6], astronomy [], finance [6], comuter networks [44] and medicine [5, 3]. Data streams are time-varying in nature [73]; as the world changes, so are its measurements exected to change. However, we may exect there to be eriods of stability, where the observations are generated by some underlying stochastic mechanism. A simle formu-

14 Chater. Introduction 4 lation would be that the stream is generated by underlying robability distributions, which irregularly change from one regime to the next. This is the context ursued in this thesis. However, the data stream could equally be the residual sequence outut from a model. With this formulation in mind, a roblem of articular interest is the detection of changeoints in a data stream, the oints at which the underlying distribution of the stream changes from one regime to another. Note that all the work in this thesis focuses on discrete-time sequences. Traditionally, the area of statistics concerned with timely change detection is known as statistical rocess control (SPC) [9]. For the streaming context outlined above SPC has a number of limitations, rimarily that sequences are considered to only contain a single changeoint. It is assumed that the rocess can be stoed, once a changeoint is located, while changes are made to the rocess generating the sequence [66]. In contrast, streaming data will have multile changeoints, and usually cannot be stoed. For examle, in the case of foreign exchange streams, detecting a change may trigger a trading action, but the flow of data will not sto. Many SPC methods further assume that the underlying distributions (and arameters) of the sequence are known, and may require comutationallyintensive calculations [65, 66]. Although traditional SPC methods may not be well-suited to streaming data analysis, they find wide alication in the area of quality control. In order to make a clear distinction with SPC, we shall refer to the roblem of detecting multile changes in streaming data as continuous monitoring. Besides the alications mentioned above [6,, 6, 44, 5, 3], continuous monitoring has articular alication in comuter network monitoring [9, 3, 6], intrusion detection systems (IDS) [48, 49], and detecting denial of service (DoS) attacks []. Measuring the erformance of a change detector in the continuous monitoring context is subtle, and requires the develoment of additional erformance metrics. Moreover, in this setting there is a secific need for any control arameters to be automatically set [4]. The framework below addresses this need while develoing secific estimation and change detection methodology. The bulk of this thesis is dedicated to the develoment of a forgetting factor framework. This framework is used to sequentially and adatively estimate the current mean and variance of a stream, by lacing greater weight on more recent observations and thereby forgetting older observations. This is a very natural formulation which has been studied

15 Chater. Introduction 5 before [3, 69], but here we undertake a more rigorous aroach in defining an adative forgetting factor scheme and its extension to a change-detection framework. This will allow for an essentially arameter-free method for tracking the mean and variance for any distribution on a data stream. Methods are develoed to use these adative estimates of the mean and variance to rovide a change detector in a streaming data context. There are at least three benefits to this aroach. First, the forgetting factor framework rovides good estimates of the time-varying mean and variance, no matter how many changes occur in the stream. As remarked above, data streams are exected to have multile, unexected changes. Second, when these estimates are used for change detection, only a single control arameter needs to be secified. Most ractical change detection schemes require at least two control arameters to detect even a single change, so this is an imortant reduction of the burden on the analyst. Recalling the continuous monitoring context above, a change detector on a data stream needs to run autonomously, since intervention to reset control arameters is imractical. This is one motivation for using the adative forgetting factor framework in this context. Another advantage of the forgetting factor formulation is that it may recover from missed detections better. A further benefit is that these methods are, comutationally, very efficient. Finally, this aroach is very flexible, and can be alied to detecting a change in the mean or variance of a stream. While there are Bayesian aroaches to arameter estimation (e.g. [4]), these are not considered in this thesis due to their comutational comlexity. Descrition of the thesis Chater rovides background for change detection that will be required for later chaters. Descritions of the Shewhart, CUSUM and EWMA charts are given, and erformance metrics such as the Average Run Length (ARL) are discussed. In addition, a review of chisquared variables is given, and a brief discussion of moments and cumulants is included, since these form the foundation of the methods comared in Chater 7. Chater 3 describes the forgetting factor framework for adatively estimating the mean. The fixed forgetting factor formulation is exlored before introducing the adative forgetting factor methodology. This adative rocedure otimises the control arameter by taking

16 Chater. Introduction 6 a single ste in the direction of a suitably defined derivative. A modest contribution here is the formal definition of the derivative with resect to the adative forgetting factor, which makes rigorous reviously heuristically-defined udate equations. The chater concludes with a discussion on theoretically otimal values for a forgetting factor, which suggests that the adative forgetting factor has almost ideal behaviour. Chater 4 describes how the forgetting factor framework can be used to construct a change detector. This aroach is develoed following the aroach in the SPC literature, by first using this method to detect a single change in the mean of a univariate stream. Several aroaches are discussed for the construction of a decision rule for a change detector, from assuming that the underlying distribution is normal (but with unknown arameters), to a distribution-free method, using robabilistic arguments. A simulation study exlores the behaviour of the different forgetting factor change detection methods, as well as the standard CUSUM and EWMA schemes. This a recursor to the next chater, where the schemes will be more carefully comared in the continous monitoring setting. Chater 5 extends the methodology of Chater 4 to the novel and challenging case of detecting multile changes in a stream, which we refer to as continous monitoring. The standard erformance metrics are no longer adequate, and so additional metrics are introduced. Again, a simulation study is erformed to comare our forgetting factor methodology to the standard algorithms, which shows similar erformance among the algorithms. Notably, our method has the added advantage of only requiring a single control arameter. Finally, the adative forgetting factor change detector is comared to an otimal offline algorithm on real data obtained from a foreign exchange stream, and shows remarkable agreement in the changeoints detected by the offline algorithm. This contribution is summarised in [5]. This work has also been adated and alied to a comuter monitoring alication in [6]. Building on the receding chaters, Chater 6 rovides an extension of continous monitoring to the case of detecting a change in the mean of a multivariate stream. This is a significant contribution, since it aears to be the first method dedicated to sequentially detecting multile changes in a multivariate stream. Since there are no obvious candidates for comarison, a recently-ublished self-starting multivariate method [65] from the SPC literature is adated as a benchmark for comarison in a simulation study. Although our

17 Chater. Introduction 7 forgetting factor method is comutationally far more efficient and only requires a single control arameter, it erforms at least as well as if not better than this state-of-the-art method. This aroach has been ublished with an alication to monitoring comuter networks [3]. Chater 7 describes a general framework for analysing aroximate methods for comuting the cumulative distribution function (cdf) of weighted sums of arbitrary random variables. The work in this chater is used to enable the construction of a change detector for the variance, as described in Chater 8. This general framework is alied to analysing four aroximate methods for comuting the cdf of a weighted sum of chi-squared random variables. These methods were chosen because they utilise moment-matching techniques and are suitable for use in a streaming data context. The main contribution here is that this analysis is far more comrehensive than any revious study. An interesting result is that these four moment-matching methods increase in accuracy as the number of terms in the weighted sum increases this has not been shown before. The result of the analysis is that a three-moment aroximation is recommended for use by most ractitioners. This contribution is summarised in [4]. Chater 8 resents first stes toward change detection for the variance of a stream, resenting equations for the forgetting factor variance. Assuming the stream is normally distributed, an extension of Cochran s theorem [8] allows the forgetting factor variance to be written as a weighted sum of chi-squared random variables. Equations for the first three cumulants are derived in order to allow the methods analysed in Chater 7 to be used to obtain a decision rule for signalling a change. A formulation for the adative forgetting factor variance is resented along with a discussion about the choice of cost function. Potential alication to a distribution-free change detector for the mean is discussed. Chater 9 concludes the thesis with a summary of the main results and a discussion on otential future work. Derivations for equations used in the text are rovided in the aendix.

18 8 Chater Background This chater reviews standard definitions and methodology used throughout this thesis. Section. reviews some background and terminology for change detection. Concets such as control charts and burn-in eriod are introduced, standard algorithms and erformance measures are described, and challenges of arameter estimation are discussed. This background is required for Chaters 4, 5, 6 and 8.. Change detection and streaming data The goal of a change detection algorithm is to detect a change in the robability distribution of a sequence of random observations. One examle is the following situation: suose we have observations x,x,...,x N generated from the random variables X,X,...,X N which are indeendent and distributed according to distribution D 0 or D, such that X,X...,X D 0, (.) X +,...,X N D, (.) where the changeoint is unknown. In this thesis, since we are dealing with streaming data, N will indicate the number of observations observed so far. Two statistics that are commonly monitored for change are the mean and variance. We say E[X ] E[X + ] is the size of the change in the mean, and Var[X ] Var[X + ] is the size of the change in

19 Chater. Background Value Value Observation (a) Observation (b) Figure.: (a) a change in the mean, and (b) a change in the variance. The vertical lines indicate the true change oints at 50. the variance. Figure. shows two grahs illustrating a change in the mean and a change in the variance of a sequence of observations. This articular examle assumes we have a fixed data set of size N, with only one changeoint. Traditional sequential rocedures often require that the re-change and ost-change distributions (and erhas the arameters of those distributions) are known. Here N is not very large (N 00), and all the observations can be stored and then analysed in order to detect with an offline algorithm. Standard references for online (sequential) change detection and change detection with adative filtering are [9] and [58]. Two good reviews can be found in [90] and [45]. In recent years, the roblem of detecting changes in data streams has arisen in areas such as astronomy [], comuter network traffic analysis [06] and finance [38]. A data stream is a (otentially unending) sequence of ordered data oints x,x,... generated from random variables X,X,... [7, 5]. Detecting changes in a data stream resents a number of challenges [8]:. the number of observations may be very large, making it imractical to store and then analyse the data in an offline manner,. the data may be arriving at a very high rate, requiring the change-detection algorithms to be comutationally efficient,

20 . Change detection and streaming data 0 3. the re-change and/or ost-change distributions are usually unknown, or at least their arameters may be unknown, 4. there may be multile change oints, requiring us to consider restarting the algorithm once a change has been detected. This is further discussed in Section..8, 5. timely detection may be critical ossibly even real-time. This combination of features forces us to abandon attemts at offline analysis, and instead we only consider online change detection algorithms. Another consequence of the above oints is that the changes in the stream are likely to be of different sizes. The following sections describe issues related to change detection erformance, basic change detection and charting rocedures. In the streaming context, issues relating to unknown arameters and restarting occur and basic aroaches to these issues are discussed... Performance measures: the average run length The erfect change detection algorithm would not detect a change until one has occurred, and when a change does occur, it would detect that change immediately. However, this ideal will never be achieved due to stochastic variation. In ractice there will be times when an algorithm detects a change when none has occurred (a false alarm), and when a change actually does occur, there will always be some delay in that change being detected. This gives rise to two standard erformance measures, ARL0 and ARL, where ARL is an acronym for average run length. We define the ARL0 of an algorithm to be the average time between false alarms raised by that algorithm, while we define the ARL to be the average delay between a change occurring and that change being detected. We can define these more recisely as follows: let the data stream be defined as in Equations (.) and (.), so that the changeoint is at time. Let b denote the time when the change is detected. Then we can define these measures as ARL0 E[b X,X, D 0 ], ARL E[b X,...,X D 0 ; X +,...,X b D ].

21 Chater. Background Ideally we would like our algorithms to have a high ARL0 and a low ARL. However, tuning an algorithm s arameters to achieve a desirable value for one of these measures will have a negative effect on the other measure. We shall revisit this toic in Chater 4 and extend the concet to multile changes in Chater 5. Note that in the SPC literature it is usual to reort tables of Monte Carlo simulation results reorting ARL0 and ARL. However, it is uncommon to reort associated estimates of uncertainty (e.g. [66, 00, 6]). In this thesis we refer to reort the estimated standard deviation of the run lengths, denoted SDRL0 and SDRL. The magnitude of these uncertainty estimates is consistent with that which has been reorted in the SPC literature (e.g. [8])... Control charts The idea of a control chart was first described in [38], with the original motivation being the detection of change in manufacturing rocesses for the uroses of quality control. A control chart consists of oints z,z,... reresenting a statistic and control limits a, b, with a<b. When z j (a, b) we say the rocess is in control, and when z j 6 (a, b) we say that the rocess is out of control. We call z j (a, b) ) in control (.3) z j 6 (a, b) ) out of control (.4) a decision rule. Note that here we are using z j to reresent a sequence of statistics in order to distinguish from the observations x j. We call a the lower control limit (LCL) and b the uer control limit (UCL). We call b the detected changeoint of the data stream if z b 6 (a, b), but z j (a, b) for all j<b, while we reserve the letter for the true changeoint. Figure. is an examle of a control chart, and the data stream has a changeoint at b 7. Two well-known control chart schemes are CUSUM and EWMA, first described in [6] and [8], resectively, and discussed below. The genesis of control chart methodology is due to Walter Shewhart [38], and his control chart is described next.

22 . Change detection and streaming data Value Observation Figure.: A control chart for detecting a change in the mean. The control limits are indicated by the black dashed lines, while the in-control mean is indicated by the grey dashed line...3 Shewhart charts Suose the stream of observations x,x,... is generated from random variables X,X,..., and it is known that for k ale E[X k ]µ, Var[X k ]. Then the control limits for a Shewhart chart are defined as a µ, b µ +, where is a arameter controlling the sensitivity of the chart. Therefore, if x,x,...x t (a, b), x t 6 (a, b),

23 Chater. Background 3 then the changeoint b t. So, for the Shewhart chart, the chart statistics are the observations themselves, i.e. z k x k. This will not be the case for later methods. While is usually set to 3, Figure. is an examle of a Shewhart chart with 5. If µ and are unknown, then the samle mean x and samle variance s can rovide estimates, and the control limits become a x s, b x + s. The Shewhart chart is known to be effective at detecting large changes, however it is insensitive to small changes in the mean [0]...4 CUSUM The Cumulative Sum (CUSUM) algorithm was first roosed in [6] and shown to be otimal under certain assumtions [], in the sense of [97]. However, if the distributions arameters are unknown, as is often the case, then this otimality is not guaranteed. If the stream is initially N(µ, )-distributed, the CUSUM statistic S j is defined as: S 0 µ, and S j max(0,s j + x j kµ), j {,,...} in order to detect an increase in the mean. A change is detected when S j >h. A statistic to detect a decrease in the mean can be similarly defined. Here the control arameters k and h need to be chosen. These values are often chosen according to the needs of the alication, and secifically the magnitude of the changes one is trying to detect. For examle, in the context of setting arameters related to h and k in the self-starting CUSUM rocedure of [6], it is recommended to used standard tables such as those in [99]. This selection is based on the anticiated change size, E[X ] E[X + ] (see Section.). Although [] showed that CUSUM is otimal, this is only the case when both the re- and ost-change distributions (and the arameter values) are known. If this is not the case, we do not have such strong theoretical guarantees. Moreover, the sensitivity of the

24 . Change detection and streaming data 4 change detector when deloyed in ractice will deend on the choice of the arameters k and h. According to [34], three common (airs of) choices for the control arameters are k 0.5,h8.00, k 0.50,h4.77, k.00,h.49. Although these airs of control arameters may each erform well in a given situation, it is not obvious, given a data stream, which air should be used. Indeed, as a stream evolves and changes occur, it is unlikely that a fixed air of arameters will continue to be otimal after each change. This is art of the motivation for introducing our adative forgetting factor scheme in Chaters 3 and EWMA The EWMA (Exonentially Weighted Moving Average) control chart is another online change detection method, first described in [8]. Suose we have observations x,x,... samled from a distribution with known mean µ and variance. We then define the new variables Z,Z,...,by Z 0 µ Z j ( r)z j + rx j where r [0, ] acts as an exonential forgetting factor. It can be shown [8] that the standard deviation of Z k is Z j r r ( r) j. r If we wanted to detect an increase in the mean, a change would be detected when Z j >µ+ L Zj,

25 Chater. Background 5 where L is a control arameter chosen to give the algorithm a desired erformance in terms of ARL0 or ARL. It is also ossible to modify EWMA to erform two-sided detection. According to [00], the arameter r is usually chosen to be 0.05 < r < 0.5 for detecting small shifts, while L is usually chosen to be close to 3. The original aer [8] showed that, in ractice, EWMA is good at detecting small shifts in the rocess mean. Interestingly, [00] showed that the roerties of EWMA are similar to those of CUSUM schemes, a oint further discussed in [03]. However, this was the case when an otimal choice of arameters were used for EWMA in comarison to a CUSUM scheme using a seemingly arbitrary (not necessarily otimal) choice of arameters. This oint was recently revisited in [67], where it was shown that EWMA can outerform CUSUM if the size of the change is smaller than that for which the CUSUM arameters were selected. For our resent urose, a central difficulty with both CUSUM and EWMA is the selection of control arameters. We shall return to this in Chater 5 in the context of continuous monitoring...6 Other sequential change detection methods Besides CUSUM and EWMA, three other well-known sequential change detection methods, are the Shiryaev-Roberts method, the generalised likelihood ratio (GLR) method, and the changeoint model. The Shiryaev-Roberts rocedure was indeendently created by A. N. Shiryaev [39] and S. W. Roberts (the creator of EWMA) [9], and is defined as the sum of a sequence of likelihood ratios. It requires knowledge of the re- and ost-change distributions, but has some otimality roerties []. The GLR method is similar to CUSUM, and is reviewed in [9, 90]. It also requires knowledge of the re- and ost-change distributions and has recently been alied to detecting changes in the mean and variance [6]. Moreover, it can erform well in relation to CUSUM and EWMA [60]. The changeoint model was first roosed in [66] for detecting a change in a univariate mean, and has since been extended to detecting a change in the variance [68], multivariate

26 . Change detection and streaming data 6 change detection [64] and the non-arametric setting [33, 3]. However, all imlementations of this method require the storage of an ever-growing sequence of statistics, so it can be considered unsuitable for use on streaming data. All of these methods have been enhanced or extended in some way, for instance to detecting a change in the variance, or to the multivariate setting (e.g. [54, 68]). While each of these methods has its suorters, there is no clear evidence that any of these should be referred over CUSUM and EWMA...7 Estimating the arameters of the known distribution In Section..4, we mentioned that CUSUM is otimal at detecting a change, but only when the re- and ost-change distributions arameters are known. With a data stream, however, we may not know what the distributions are, let alone the values of the arameters. However, there are cases where we may confidently model a rocess by a given family of distributions, but we will not know the values of the arameters. For examle, we may know that a given set of observations are samled from a normal distribution N(µ, ), but we do not know the values of µ and. In these cases, we could try and estimate the values of the arameters during an initial monitoring eriod, assuming that no change occurs. This is called a burn-in eriod. However, if the burn-in eriod is too short, our arameter estimates will be inaccurate, which will then lead to oor erformance of the change detection algorithm. An extensive literature review of this aroach can be found in [77], where the key issues that are discussed include samle size (for the monitoring eriod), the imact of arameter estimation on algorithm erformance, and other ossible aroaches. In the continuous monitoring scenario considered in Chater 5, the burn-in used is relatively short and the changeoints occur after short intervals. This is in contrast to traditional SPC aroaches, which usually only consider a single changeoint occurring after a long eriod of stationarity. It is notable that much literature involving a burn-in eriod is concerned with both a long burn-in and detecting a single changeoint. The work develoed in Chater 5 is concerned with much shorter regime shifts which imoses constraints on the length of the burn-in eriod.

27 Chater. Background 7..8 Restarting an algorithm after a changeoint Suose we are monitoring a data stream x,x,..., which we assume is samled from a distribution D, and detect a change at b. Once a change has been detected at time b, this signals that (at least) the arameter values of the distribution have changed. In a streaming data context, it is likely that we would wish to continue monitoring the new data stream x b,x b +,... for future changes. This leads to a roblem: our change detection algorithm requires the distribution s arameter values, but it is rare that we will know the values of the ost-change arameters, and we cannot use the re-change arameter values (since a change has occurred). In this situation, one aroach would be to estimate the ost-change arameters during a new burn-in eriod, and then use these estimates to detect a change in the new stream x b,x b +,... This is the aroach adoted later in this thesis.

28 8 Chater 3 Adative filtering for the mean The method of change detection in this thesis relies on being able to accurately estimate arameters of a stream such as the mean or variance. Suosing that such stream arameters are well estimated, changeoints are signalled when these statistics deviate from their current estimates beyond a certain threshold, following the conventional aroach discussed in Section... Obtaining suitable thresholds for the mean is discussed in Chater 4, and monitoring the variance is discussed in Chater 8, but both chaters rely on the methodology develoed in this chater. In this chater, a framework utilising adative forgetting factors is develoed that allows stream arameters to be adatively estimated and will form the basis of change detection methods discussed in later chaters. Previous work on adative forgetting includes [7, 6, 50, 69, 3]. The definition of the derivative with resect to the adative forgetting factor!, described in Section 3.3., is a key contribution of this thesis as it allows for recursive udate equations that do not require the underlying distribution of the stream to be known. This method is well-suited to a streaming data context (see Chater ), as everything can be comuted sequentially. This chater roceeds as follows: Section 3. introduces the idea of estimating the mean, Section 3. describes the fixed forgetting factor framework, Section 3.3 describes the adative forgetting factor framework, and Section 3.4 investigates the otimal values of fixed and adative forgetting factors when the re-change and ost-change stream arameters are known. In Section a result from [3] shows that in a secial case the otimal

29 Chater 3. Adative filtering for the mean 9 linear filter is closely related to this adative forgetting factor scheme. 3. Estimating the mean Suose that the stream x,x,... is generated by the random variables X,X,... and that N observations have been observed so far. The goal is to estimate the current mean of the stream, namely E [X N ]. This estimate will be used for detecting changeoints in the stream, which is the subject of Chater 4. One way to estimate E [X N ] would be to comute the samle mean of the stream, x N N x i. i If the stream had been stationary u until this oint, i.e. E[X i ]µ, i,,...,n, it follows that bµ x N would estimate µ well, since E " XN E N # X i N i E [X i ] N i µ µ. i If, however, there had been a change in the mean at some oint <N, E[X i ] ( µ 0, i,,..., µ, i +, +,...,N,..., µ 0 6 µ (3.) then bµ x N may not estimate µ very well, if the difference between µ and µ 0 is large. In fact, a better estimate would be obtained by only taking the mean of those observations that occur after the changeoint, bµ N [x + + x x N ]. (3.) At resent, only the weaker condition of stationarity is needed for tracking the mean.

30 3. Fixed forgetting factor mean 30 Of course, since the location of the changeoint is generally unknown, such an aroach is infeasible. An aroach to aroximate Equation (3.) by comuting a weighted sum of the observations x,x,...,x N is exlored in the next section. 3. Fixed forgetting factor mean The method at the core of this thesis attemts to estimate the the current mean µ of a non-stationary stream as in Equation (3.) by comuting a weighted sum of the observations x,x,...,x N using an (exonential) fixed forgetting factor [0, ]. The fixed forgetting factor mean x N, after N observations is defined as x N, w N, i N i x i, (3.3) where the effective samle size w N, is defined as Writing these two equations out in full, w N, i N i. (3.4) x N, w N, N x + N x + + x N + x N, w N, N + N + + +, it can be immediately seen that for, x N, x N, the samle mean, for 0, x N, x N, the most recent observation. This is where the forgetting factor gets its name: the closer is to zero, the more that x N, forgets early observations. Excluding these two limit cases, for (0, ) the fixed forgetting factor (FFF) mean is the weighted sum of all the observations x,x,...,x N, with more weight laced on recent observations and less weight on older observations.

31 Chater 3. Adative filtering for the mean 3 Section 3..3 illustrates how different values of can affect the estimation of the current mean of the stream. Before this, we first discuss the effective samle size w N, in Section 3.. and comute the exectation and variance of x N, in Section 3... These statistics will be used to reason about setting a value for and will be crucial when constructing change detection rules in Chater The effective samle size The quantity w N, defined in (3.4) is often referred to as the effective samle size since it aroximates the number of observations over which we are averaging; for the two trivial cases if then w N, N, if 0then w N,. However, if (0, ) then w N, (,N). In other words, the sum of the weights measures the effective size of the samle used to comute x N,. It is worth investigating the value of w N, in the limit. If (0, ) then w N, i N i N and as N!, w, Equation (3.5) indicates that, for examle, if lim N! w N,. (3.5) 0.95 and N is large, then the effective samle size w, Equation (3.5) will be significant in Section 3..5, when the FFF and EWMA schemes are comared. Note that the effective samle size is referred to as the memory in [69]. 3.. Exectation and variance of forgetting factor mean Following the definition of the forgetting factor mean x N, in Equations (3.3) and (3.4), suose that for i,,...,n, the random variables X i generate the observations x i.we

32 3. Fixed forgetting factor mean 3 can then define the forgetting factor mean of the random variables X i by X N, w N, i N i X i, (3.6) where w N, is again defined in Equation (3.4). Suose further that the random variables X i are indeendent and identically-distributed (i.i.d.) with exectation and variance E [X i ]µ, Var [X i ]. Then the exectation of X N, is E XN, E " w N, w N, µ, w N, i i i N i X i # N i E [X i ], N i µ,, and (using the indeendence assumtion) the variance of X N, Var XN, Var " w N, (w N, ) (w N, ) (u N, ), i i i N i X i #, is N i Var [Xi ], N i,

33 Chater 3. Adative filtering for the mean 33 where we define u N, w N, (w N, ) (w N, ) i N i. (3.7) In summary, XN, is a random variable with exectation and variance E XN, µ, Var XN, (un, ). It is interesting to note that lim! E XN, µ E XN, lim! Var XN, N Var XN, which shows that the FFF mean behaves as the samle mean in the limit, as imlied earlier The relationshi between x N, and The closer is to zero, the more that x N, forgets earlier data, since greater weight is laced on recent data. This is illustrated in Figure 3. where the stream x,x,...x 300 has been samled from X,X,...,X 00 N(0, ), (3.8) X 0,...,X 300 N(3, ), (3.9) and the value of the forgetting factor mean x,, x,,..., x N, values of [0.9, ]. is shown for a selection of If M streams are samled as in Equations (3.8) and (3.9), then M sequences x,, x,,..., x N, can be roduced, and their average x av,, x av,,..., x av N, can be lotted, as in Figure 3. (which shows the average value over M 00streams). This figure shows that when 0.9, the forgetting factor mean x N,0.9 quickly reacts to the change at t 00, and is close to µ 3soon after the changeoint. However, if, then x N, does not estimate the mean µ 3very well after the changeoint (in fact

34 3. Fixed forgetting factor mean 34 Value Observation (a) Forgetting factor mean Observation (b) Value of λ λ 0.9 λ 0.95 λ 0.99 λ Figure 3.: (a) A stream x,x,...,x 300 samled from X,...,X 00 N(0, ), X 0,...,X 300 N(3, ), and (b) the value of the fixed forgetting factor mean x N, (on this stream) for different values of. it is still less than at observation 300). Error bars, with a width of one standard deviation, are rovided at certain oints (including those where the standard deviation is largest). If, therefore, smaller values of allow x N, to react to changes faster, it would seem as if a value of 0.would be better than 0.9. However, recall from Section 3.. that Var XN, (un, ), where u N, is defined in Equation (3.7). As Figure 3.3(b) shows, lower values of lead to larger values of u N,. This imlies that for lower values of, the variance of X N, is higher. This is because the effective samle size, w N,, shown in Figure 3.3(a), is much smaller for lower values of. In other words, if is too close to, then X N, will be slow to react to changes, but if is too small, then the behaviour of X N, may be subject to large variations. This is a manifestation of the familiar tradeoff between bias and variance. Figure 3.4 has been included to show the behaviour of w N, and u N, for [0.6, 0.99]. This will be relevant in Section 3.3.5, where the issue of truncating the adative forgetting factor is discussed. One might refer to have closer to when the stream is not exeriencing a change (for stability), but then have closer to 0 after a change occurs, at least temorarily to allow X N, to react to the change quickly. This leads to the idea of an adative forgetting factor, which is discussed in Section 3.3: one might hoe that a time-varying forgetting factor could allow for both stability and quick reaction to changes. In fact, using an adative forgetting factor has the added benefit that the value of a fixed forgetting factor no longer needs to be secified, because the algorithm develoed in Section 3.3 automatically selects

35 Chater 3. Adative filtering for the mean 35 Forgetting factor mean 3 0 Value of λ λ 0.9 λ 0.95 λ 0.99 λ Observation Figure 3.: The average behaviour of x N, for different values of, for X,...,X 00 N(0, ), X 0,...,X 300 N(3, ), averaged over 00 simulations. Error bars indicate a width of one standard deviation on either side of x N, w N, λ 50 5 w N, λ N 0 N 0 N 50 N 00 N u N, λ u N, λ N 0 N 0 N 50 N 00 N λ (a) λ (b) Figure 3.3: Values of (a) w N, and (b) u N, for various values of N, for [0., 0.99].

36 3. Fixed forgetting factor mean w N, λ 50 5 w N, λ N 0 N 0 N 50 N 00 N u N, λ u N, λ N 0 N 0 N 50 N 00 N λ (a) λ (b) Figure 3.4: Values of (a) w N, and (b) u N, for various values of N, for [0.6, 0.99]. a value based on the observations so far observed. In Chater 4 it is shown that this leads to a change detection algorithm which has fewer control arameters to be secified, in comarison to other change detection methods. First, however, in Section 3..4 it is shown that X N, can be udated sequentially, and these sequential udate equations will form the basis of the adative forgetting factor framework in Section Sequential udating It is imortant for streaming data alications to have sequential udate equations for x N,. Such equations show that the comutation of x N, only requires a finite number of statistics to be stored, and show that the comutation er datum is of constant comlexity. Moreover, the sequential udate equations for the x N, will form the basis for defining the adative forgetting factor framework. The samle mean x N could be comuted sequentially by m N m N + x N, w N w N +, x N m N /w N for i,,... and m 0 w 0 0. Similarly, the FFF mean x N, can also be defined

37 Chater 3. Adative filtering for the mean 37 sequentially by m N, m N, + x N, w N, w N, +, (3.0) x N, m N, w N,, for N,,... and m 0, w 0, 0. Alternatively, we can write x N, w N, x N, + w N, x N, (3.) and so Equations (3.0) and (3.) are all that is needed to recursively udate x N,. It is shown in Aendix A.3.7 that u N, can also be udated sequentially by u 0, 0and wn, u N, u N, + w N, w N,. (3.) The udate equations in this section now rovide the basis for the adative forgetting factor framework, discussed in Section Relation to EWMA One might notice that the FFF scheme resembles the EWMA scheme described in Section..5. Indeed, the two are closely related; starting with Equation (3.), x N, w N, N N+ x N, + w N, x N x N, + N+ x N, and then if (0, ), as N!, this becomes x N, x N, +( )x N,

38 3.3 Adative forgetting factor 38 which is equivalent to the EWMA scheme if we set r. Therefore, it would seem as if the EWMA scheme is the limit case of the FFF scheme, since as N gets very large, the udate equations of the FFF scheme tend to the EWMA udate equation. This derivation was sketched in [36], but not commented on. It might aear that there is little difference between the two schemes, but in fact the FFF formulation has two key advantages over EWMA. Firstly, since EWMA is essentially the limit case of FFF, the FFF formulation might react quicker to changes in the short-term; this is exlored in Section 4... Secondly, the FFF formulation leads to much simler batch definitions, which will then be used to define an adative forgetting factor scheme in Section Adative forgetting factor The FFF scheme, like other filtering schemes such as EWMA, suffers from one major drawback: it is not clear how to set the value of. In this section the concet of an adative forgetting factor (AFF)! is introduced, where! (,,...) imlies that the value i is used to downweight observations u to and including x i. Other adative forgetting factor rocedures have been discussed in [7, 6, 50, 69, 3]. Note the difference in notation: the AFF is denoted!, while the FFF is denoted by. The AFF scheme, exlained in detail below, allows the value of the forgetting factor to be set automatically after each observation. Besides alleviating the significant burden of setting a value for the forgetting factor, this scheme has the nice feature that the comonents of! will be close to while the stream is in-control, but will dro in value after a change occurs in order to forget the ast regime and react to the change quickly Adative forgetting factor mean This AFF scheme closely mirrors the FFF scheme described in Section 3. and will lead to a change detection method in Chater 4. Suose the stream x,x,... is generated by the

39 Chater 3. Adative filtering for the mean 39 random variables X,X,...; the AFF mean x N,! is defined for N,,... by m N,! N m N,! + x N, (3.3) w N,! N w N,! +, (3.4) x N,! m N,! w N,!, where! (,,...) and m 0,! w 0,! 0and udated by x N,! w N,! Note the similarity between Equations (3.) and (3.5); the FFF! 0. Alternatively, x N,! can be x N,! + w N,! x N. (3.5) in Equation (3.) has simly been relaced by the AFF! in Equation (3.5). Following Section 3.., suose again that the random variables X i are i.i.d. with E [X i ]µ, Var [X i ]. Defining X 0,! 0and X N,! for N,,... to be X N,! w N,!! X N,! + w N,! X N, the exectation and variance of X N,! can be comuted as in Section 3.. to be h i E XN,! h i Var XN,! µ, (3.6) (u N,! ), (3.7) u N,! w N,! w N,!! un! +,!, u! w!,. N, The method for udating i! i+ the key art of the algorithm and crucial for streaming data is described in the next section.

40 3.3 Adative forgetting factor Udating N! N+ Suose that,,..., N have already been defined and that N+ should be chosen to minimise a articular cost function L! N+, involving x! N,. Since this will be alied in a streaming data context, online otimisation is required. Suosing further that there is a definition for the one-ste gradient descent [9, 7, 58] can be used to define L N+,!, (3.8) where is the ste size, and. The remainder of this section discusses a x N,!, (3.9) since once this derivative is defined, the derivative of any continuous function of x! N, can be comuted using the chain rule. Recalling that! (,,...), for any R define! + ( +, +,... ). Using the definition of m N,! in Equation (3.3), m N,! + is defined as m N,! + ( N + ) m N,! + + x N, and then the derivative of m N,! is defined in a first rinciles manner m h i N,! lim m!!0 N, + m! N,. In order to roceed further, the non-sequential form of m N,! is required. In Aendix A.3. it is shown that using Equation (3.3), m N,! i " N Y i # x i.

41 Chater 3. Adative filtering for the mean 4 The following result (Lemma roved in Aendix A.3.3) is also needed: NY i + NY i + Y N ji i 6j! + O( ). The non-sequential equation for the derivative N,! can now be comuted (with Lemma used between lines (3.0) and (3.)) by h N,! lim!0 ) N,! lim!0 lim!0 lim!0 lim!0 lim!0 i m! N, + m! N, " X N "! # NY ( + ) x i i i i " N Y i i ( + ) N 6 4 X NY ji N 6 4 X i N X 6 4 i N X 6 4 i ji Next, a sequential definition of (3.) as, ji N Y i 6j ji N Y i 6j i 6j N Y x i i 6j NY i # " i NY i! x i ## x i (3.0) 3! + O( ) 7 5 x i (3.) 3! + O( ) 7 5 x i 3 + O( ) 7 5 x i (3.) N,! can be comuted (see Aendix A.3.4) from Equation N,! N N,! + m N,!. (3.3)

42 3.3 Adative forgetting factor 4 Similarly, by w N,! lim!0 h i w! N, + w! N,, it can be shown (see Aendix A.3.5) that N,! N N,! + w N,!. (3.4) The derivative of x N,! can now be x N,! w N,!! w N,!! N, m! N,! N,, (3.5) w N,! and can be sequentially udated using the definitions of m N,! and w N,! in Equations (3.3)-(3.4), the udate equations for N,! and N,! in Equations (3.3) and (3.4), and the definition in Equation (3.5). It only remains to choose a cost function L N+,! that would be desirable to minimise. Since the mean is being estimated, one natural choice is L N+,! h x N,! x N+ i, (3.6) which has L N+,! x! N, x x N,!. (3.7) Minimising this cost function attemts to ensure that the current AFF mean is as close as ossible to the next observation. Of course, other cost functions could be used for the AFF mean. In Section 8., cost functions for the AFF variance are discussed. Indeed, the choice of the cost function will deend on the statistic that is being monitored (e.g. the mean or the variance). It should be noted that a similar rocedure was used to udate an adative forgetting factor! in [4], where the cost function was taken to be the negative log-likelihood. Interestingly, the rocedure used there results in the same udate equations for N,! and N,!,

43 Chater 3. Adative filtering for the mean 43 Observation value 4 0 AFF value Average AFF value Observation (a) Observation (b) Observation (c) Figure 3.5: (a) A single stream generated by X,...,X 50 N(0, ) and X 5,X 5, N(, ). (b) The behaviour of the AFF for the stream in (a). (c) Median value of! (over 000 streams) generated as in (a). The ste-size used is 0.0. Error bars indicating the emirical 70% confidence interval are rovided. and in the case where it can be assumed that the stream is generated from normal random variables, it results in the same cost function as in Equation (3.6) being used. Note that the formulation here does not require any assumtions about the distribution of the stream, which is one of the key contributions of this thesis Behaviour of! when encountering a change With the derivations now comleted, the behaviour of the AFF! is exlored. Figure 3.5(b) shows the behaviour of the AFF for a stream samled from X,X,...,X 50 N(0, ), X 5,X 5, N(, ), where one realisation of stream is shown in Figure 3.5(a). Figure 3.5(b) shows the value of! for this stream. Figure 3.5(c) shows the average behaviour of the AFF for such a stream, where the average is taken over 000 such streams. It can be observed from these figures that soon after the change occurs, there is a large dro in!. This is desirable, as it allows the revious regime to be more quickly forgotten. The otimal behaviour of a forgetting factor will be exlored in Section 3.4. Note that the ste-size used in Figures

44 3.3 Adative forgetting factor (b) and 3.5(c) is 0.0. The behaviour of! for different values of will be exlored in Section Plotting the average behaviour of! Note that while Figure 3. shows the average (mean) value of x N, with error bars that are one standard deviation of x N,, the average behaviour of! is lotted differently. The quantity! is likely to be asymmetric, and (as is described in Section 3.3.5) it is truncated to be in the interval [0.6, ]. Therefore, to illustrate the average behaviour, Figure 3.5(c) shows the median value of! with an emirical 70% confidence interval. All subsequent lots dislaying the average behaviour of! will follow this rocedure. A width of 70% was chosen in order to try and corresond to the case of x±s for a normal random variable, since aroximately 68% of the values of a normal random variable are within one standard deviation of its mean Truncating the range of! The value of the derivative in Equation (3.8) is not clearly bounded, and so udating! with Equation (3.8) could allow! to fall outside the interval [0.]. However, Figures 3.5(b) and 3.5(c) show that the value of! does not aear to dro below 0.6. This is by design; after udating N! N+, then the algorithm imlements the following rule N+ min( N+, max), (3.8) N+ max( N+, min), (3.9) where min and max are the minimum- and maximum-allowed values for!. Equations (3.8) and (3.9) ensure that N [ min, max]. It is clear that max should be set to max. One might think that min 0.0 is a good choice ( forget almost everything). However, this leads to two roblems. First, as Figure 3.6 shows, allowing! to decrease so that it is close to 0 can have the effect of making recovery to re-change values very slow. Figure 3.6(b) shows that with truncation at 0.0, after 50 observations! has still not recovered to re-change levels of around Second and this is a less obvious

45 Chater 3. Adative filtering for the mean 45 Average value of u with AFF truncation at Observation (a) Average value of u with AFF truncation at Observation (b) Figure 3.6: Median value of! (over 000 simulations) with (a) truncation at 0.6 (b) truncation at 0.0. The ste-size is 0.0. Error bars indicating the emirical 70% confidence interval are rovided. oint if! is too close to 0, the estimation of the AFF mean x! N, is subject to a greater amount of variance. Recall from Equation (3.7) that the variance of x! N, is controlled by the quantity u! N,. Figure 3.7 shows the median values of u! N,, corresonding to the median values of! in Figure 3.6. It can be seen that when truncation at min 0.0 is used the value of u! N, increases dramatically after the changeoint. On the other hand, when truncation at min 0.6is used, the increase in u! N, minor and short-lived. Furthermore, the error bars are much smaller for truncation at 0.6 than at 0.. Other values besides min 0.6could be used, but this value seems to offer a good balance between allowing as much of a dro in! as ossible, while avoiding the roblems of recovery and increasing u! N,. Figures 3.3 and 3.4 offer some justification for the choice of 0.6: for <0.6, values of w N, and u N, (for different values of N) aear to be indistinguishable. One might hoe that increasing the ste-size from 0.0 to 0.may may fix these roblems for min 0. While it is true that recovery is imroved for 0., there is still the roblem of an increase in u! N,. The next section discusses the choice of further Choice of ste size It may aear that the choice of forgetting factor has been relaced with a choice of ste size, where any value in the range [0.00, 0.] would seem reasonable. It would be exected that the value of would affect the behaviour of!, as shown in Figure 3.8 which,

46 3.3 Adative forgetting factor 46 Average value of u with AFF truncation at Observation (a) Average value of u with AFF truncation at Observation (b) Figure 3.7: Median value of u N,! (over 000 simulations), with (a) truncation of! at 0.6 (b) truncation of! at 0.0 The ste-size is 0.0. Error bars indicating the emirical 70% confidence interval are rovided. Average AFF with η Observation (a) Average AFF with η Observation (b) Average AFF with η Observation (c) Figure 3.8: Median value of AFF! (over 000 simulations) for 0. (a), 0.0 (b) and 0.00 (c), for stream generated by X,...,X 50 N(0, ) and X 5,X 5, N(, ). Error bars indicating the emirical 70% confidence interval are rovided. as in Figure 3. shows the average behaviour. However, Figure 3.9 aears to suggest that the choice of does not seem to affect the behaviour of the AFF mean x N,! as much as the choice of. Furthermore, it is shown in Aendix A.3.9 that for L N,! defined in Equation (3.6), L N,! O( ), (3.30)

47 Chater 3. Adative filtering for the mean 47 AFF mean η values η 0. η 0.0 η Observation Figure 3.9: Average behaviour of AFF mean x N,! (over 000 simulations) for 0., 0.0, 0.00 for stream generated by X,...,X 50 N(0, ) and X 5,X 5, N(, ). Error bars with a width of one standard deviation are rovided. and if is known or an estimate is obtained, then the derivative should be scaled by, i.e. the udate equation, Equation (3.8), should be L N+,!. (3.3) From now on, we shall assume that the derivative is always scaled by a (ossibly estimated) value of, or our udate equation is Equation (3.3). With this scaling, there will be little deendence on the choice of (at least for the range [0.00, 0.]); in Section 5.4 it is shown that the AFF mean change detection scheme erforms very similarly for different choices of, which rovides some evidence that the choice of is not crucial in continuous monitoring Comarison of the AFF mean and the FFF mean Figure 3.0 comares the average behaviours of the AFF mean x N,! and the FFF mean x N, as in Figure 3.. Figure 3.0 shows that, at least for this examle where there is a change from µ 0 0to µ 3, the average x! N, reacts to the change faster than any of the fixed FFF schemes, yet also exhibits stability when the stream is in control. This

48 3.3 Adative forgetting factor 48 Forgetting factor mean 3 0 Value of λ adative λ λ 0.9 λ 0.95 λ 0.99 λ Observation Figure 3.0: A comarison between x N, for different values of and x N,!, for X,...,X 00 N(0, ), X 0,...,X 300 N(3, ), averaged over 00 simulations. The ste size used in this figure is 0., but other values yield very similar results. Error bars with a width of one standard deviation are rovided. combination of the AFF mean x N,! being very resonsive to a change, yet also being stable during eriods of stationarity (Figure 3.8 shows! increasing back to re-change levels after a change), are ideal characteristics for the estimator of a time-varying quantity Relation to Kalman Filter Suose a random walk is characterised for i,,... by X i N(µ i, X), µ i µ i + i, i N(0, for some arameters X and. In [3, Sec. 3..6] it is remarked that for such a random walk, if the arameters X and are known and the otimal filter estimate after observa- ).

49 Chater 3. Adative filtering for the mean 49 tion N, given all observations x,x,...,x N, is denoted by bµ KF N E [X N x,...,x N ; X, ], then bµ KF N is recursively comutable by the Kalman Filter equations [83]. Furthermore, it is shown in [3, Sec. 3..6] that (for this secial case) bµ KF N bµ KF wn KF N +, bµ KF wn KF 0 0 (3.3) wn+ KF N wn KF +, w0 KF 0 (3.33) ale N X w KF N + (3.34) It is interesting to comare Equations (3.3) and (3.3) with Equations (3.4) and (3.5) for the AFF mean. It shows that the otimal filter equations are of the same form as the AFF mean equations, although the method for setting the forgetting factor N in Equation (3.34) is different. In this simle context, this argument rovides, to some extent, a theoretical interretation of the forgetting factor. 3.4 Otimal forgetting factors and! It is natural to wonder if there is an otimal value for a fixed forgetting factor or for an adative forgetting factor!. In order to roceed with this question, we first need to define what we might mean by otimal, and since we are ultimately concerned with change detection, it is natural to consider the roblem of finding an otimal with resect to a single change-oint in the data. Assume, as before, that we have a sequence of at least N observations x,x,... generated from i. i. d. random variables X,X,..., and suose that X,X,...,X N(µ 0, X +,X +,... N(µ, 0), (3.35) ), (3.36) where is the changeoint and µ 0 6 µ. For N>, define D N.

50 3.4 Otimal forgetting factors and! Definition of otimal Given the data stream in Equations (3.35) and (3.36), the otimal fixed 8 < arg inf b [0,] E ( X N, µ 0 ) A 0 for N ale,,n : arg inf [0,] E ( X N, µ ) A for N>, is defined to be (3.37) A 0 X,...,X N N(µ 0, A X,...,X N(µ 0, 0), 0),X +,...,X N N(µ, If more than one gives the same infimum, the larger value is used. For N,,... the otimal fixed forgetting factor vector ). b b,, b,,..., b,n,... (3.38) is obtained. It will be convenient to use the shorthand E ( X N, µ 0,,N ) A 0,, where we define µ 0,,N as and A 0, as 8 < µ 0 for N ale, µ 0,,N : µ for N>, 8 < A 0 for N ale, A 0, : for N>, A (3.39) (3.40) where A 0 and A are defined in Equation (3.37). It is shown in Aendix A.4 that for N>, E ( X N, µ ) A 0, ale w N, + w N, D w, µ 0 + w D, µ ale D w, 0 + w D, µ +, (3.4)

51 Chater 3. Adative filtering for the mean 5 while for N ale, E ( X N, µ ) A 0, (un, ) 0. This rovides us with sufficient information in order to solve for each b,n Solving for otimal fixed In order to find b,n arg inf E ( X N, µ 0,,N ) A 0,, (3.4) [0,] Equation (3.4) can be numerically evaluated for {0,,,..., (L ), L }, where /L and L is a large number (e.g. L 000). Note that while one could try to find an analytical solution for Equation (3.4) by trying E ( X N, µ 0,,N ) A 0, 0, (3.43) this aroach seems unlikely to succeed, since the derivative of the exression in Equation (3.4) is unlikely to have an analytical solution in terms of. However, we are justified in using an iterative (numerical) method to solve Equation (3.4) here, since we are exloring theoretical otimal values. In ractice, an iterative method would be unsuitable in a streaming data context (see Chater ) Results for otimal fixed : Examle The develoment above rovides the general framework for reasoning about otimal fixed. To illustrate this using a secific examle, suose consider the stream generated by X,X,...,X 50 N(0, ), (3.44) X 5,...,X 00, N(, ). (3.45)

52 3.4 Otimal forgetting factors and! 5.0 Otimal fixed lambda Observation Figure 3.: A lot of the otimal forgetting factor b 50 for the random variables given in Equations (3.44) and (3.45). A vertical line at observation 6 intersects with a horizontal line at otimal fixed value 0.8. This means that, for a changeoint at 50from N(0, ) to N(, ), the FFF mean that minimises Equation (3.37) after 6 observations is x N,0.8. If b 50,N is comuted for each N,,...,00,..., the vector b 50 b 50,, b 50,,..., b 50,00,..., is obtained. Figure 3. is a lot of b 50. Notice that b 50,,..., b 50,50, indicating that the otimal forgetting factor to use in this range is, which will give equal weight to all the random variables u until the changeoint at 50. Figure 3. shows that after the changeoint at 50, the otimal forgetting factor dros almost to 0, before increasing back towards So, for examle, in this simulation the otimal forgetting factor that will give the smallest residual for the first 6 observations, x,...,x 6, is aroximately Otimal adative! Analogously to Section 3.4., define the otimal adative forgetting factor vector c!,n, where c!,n,,,,...,,n

53 Chater 3. Adative filtering for the mean 53 consists of,i unrelated to the otimal fixed s of Section 3.4., and c!,n 8 >< >: arg inf,n [0,] E ( X N, c!,n arg inf,n [0,] E ( X N, c!,n A 0 x,...,x N N(µ 0, A x,...,x N(µ 0, 0), µ 0 ) for N ale, µ ) for N>, 0),x +,...,x N N(µ, ). (3.46) Note that in order to define c!,n, the information c!,n,,,,...,,n is needed. It will be convenient to use shorthand E ( X N,! µ 0,,N ) A 0, to refer to the exectation in Equation (3.46). For the rest of this section, in order to increase readability, define! c!,n. In other words, for the rest of this section! denotes the otimal adative forgetting factor. In order to calculate!, recall that D N and define +D P +D,! Y,. (3.47) Following the same rocedure shown in Aendix A.4 (the calculation of otimal fixed ), it can then be shown that E ( X N,! µ 0,,N ) A 0, " w N,! + P +D,! w,! µ 0 + w D,! µ " w N,! P +D,! w,! 0 + w D,! µ # + #. (3.48) Now that an exression for the cost function used in Equation (3.46) has been found (note the similarity to the solution for fixed case given in Equation 3.4), the otimal c! N, can

54 3.4 Otimal forgetting factors and! 54 be solved numerically, as in Section Furthermore, it can also be shown that E ale XN,! P +D w!,! (w,! )µ 0 +(w D,! )µ (3.49) N, ale Var XN,! (w! N, ) Note the similarity to the fixed otimal P +D,! (w,! ) 0 +(w D,! ). (3.50) equations. However, these two sets of equations lead to very different results. For the fixed case, it is shown in the Aendix A.4 (Equations (A.46) and (A.48)) that E XN, D (w, )µ 0 +(w D, )µ, w N, Var XN, D (w (w N, ), ) 0 +(w D, ) Results for otimal! : Examle As in Section 3.4.3, suose the stream x,x,... is generated by the random variables X,X,...,X 50 N(0, ) (3.5) X 5,...,X 00, N(5, ) (3.5) Again, the vector c!,n c! 50 c! 50,, c! 50,,..., c! 50,00 can be solved numerically and the otimal vector is shown in Figure 3.. When comared to the fixed case in Figure 3., Figure 3. suggests that adative forgetting can resond to changes faster and recover quicker than fixed forgetting, in rincile. The two schemes are comared more closely in the next section.

55 Chater 3. Adative filtering for the mean Otimal adative lambda Observation Figure 3.: A lot showing the behaviour of the otimal adative forgetting factor c! for the random variables given in Equations (3.5) and (3.5) Comarison of b 5,N and c! 5,N We comare the values of b,n, where is either the otimal fixed forgetting factor from Sections or the otimal adative forgetting factor! from Section We do this by looking at two articular examles, where in both cases the observations are X,X,...,X 50 N(0, ) X 5,... N(µ, ) where µ or µ 5and the changeoint is 50. The tables comare (where is either or! ):. 50, b,..., b 50,50, the otimal values of before the change-oint,. 5, b, the otimal value of immediately after the change-oint, 3. the effective samle size immediately after the change-oint w 5,, 4. the exectation of X 5,, and 5. the variance of X 5,, where the otimal value of the forgetting factor has been used for each observation.

56 3.5 Discussion 56 By numerically solving Equations (3.4) and (3.48), we obtain µ b 50,,..., b 50,50 b 50,5 w 5, E[ X 5, ] Var[ X 5, ] ! ! Table 3.: Comarison of otimal fixed 50. and adative! immediately after a change at First, we notice that b 50,,..., b 50,50 for both and!, as one might exect, since there is no change-oint. We then see that b 50,5 is close to zero in both exeriments, for both airs, with c! 50,5 being closer to zero than b 50,5. However, it is interesting that when we comare the airs of w 5,,E[ X 5, ] and Var[ X 5, ], we see all the airs have similar values. It seems as if there are otimal values for these quantities, regardless of whether the forgetting factor is fixed or adative. It is worth highlighting, again, the difference between the two aroaches: when µ 5, b 50, means that E ( X 5, µ 0,,N ) will be minimised if the fixed forgetting factor of is used for observations x,...x 50,x 5. In the adative case, though, ( X5, µ 0,,N ) will be minimised when the forgetting factors are (,..., 50, 5) (,...,, 0.005) for observations x,...x 50,x 5, when µ. Finally, as mentioned in the last section, comaring Figures 3. and 3. suggests that adative forgetting can resond to changes faster and recover quicker than fixed forgetting. 3.5 Discussion In this chater we introduced the forgetting factor framework, starting with the fixed forgetting factor scheme. Next we introduced the idea of an adative forgetting factor!, and defined the adative forgetting factor mean x! N,. A method for udating! was roosed,

57 Chater 3. Adative filtering for the mean 57 which needed the notion of a derivative with resect to!. All quantities of interest were shown to have recursive udate equations, making this framework suitable for sequential change detection. The idea of an otimal adative forgetting factor vector was also roosed, following on from the fixed case described in Sections Finally, a comarison between the otimal fixed and otimal adative forgetting factors was then rovided in Section 3.4.6, which exlored the relationshi between the two schemes. The erformance of! seems be behave similarly to the otimal. The next chater embeds this chater s adative estimation methodology into a change detection framework.

58 58 Chater 4 Change detection using adative estimation The revious chater develoed various adative estimation rocedures for the mean of a data stream. This chater extends that develoment by embedding the estimation schemes in a change detector. This is achieved by the imosition of a decision rule (see Section..). Several novel schemes are considered, including: assuming the stream is normally-distributed, distribution-free schemes based on robabilistic bounds, a scheme combining adative and fixed forgetting and a change detector based on monitoring the value of the adative forgetting factor! itself. Various issues arriving with these new detectors are discussed, and crude exerimental comarisons are conducted. Recalling that the final objective of this work is continuous monitoring, these comarisons are intended to fix ideas rather than seek otimal solutions. A concern throughout these comarisons is the role of control arameters and their imact on change detection erformance measures. Section 4. develos the embedding of adative estimation into a change detection framework. Additionally, the relationshi between FFF and EWMA is illuminated, and the role of the ste-size is exlored. Section 4. introduces a variety of distributionfree change detection aroaches with forgetting factors. Finally, Section 4.3 rovides exerimental results comaring these new algorithms with one another, and additionally, with standard aroaches.

59 Chater 4. Change detection using adative estimation 59 Label Descrition EWMA Exonentially weighted moving average scheme (see Section..5) CUSUM Cumulative sum scheme (see Section..4) FFF Fixed forgetting factor scheme AFF Adative forgetting factor scheme AFFcheby modification of AFF using Chebyshev s method AFFdro modification of AFF; significant decrease in! signals a change F-AFF modification of AFF; uses FFF mean to create control limits ARL0 Average number of observations between false alarms SDRL0 Standard deviation of ARL0 ARL Average delay in detecting a true changeoint SDRL Standard deviation of ARL Table 4.: Exlanation of labels used for different change detection schemes 4. Change detection using forgetting factors Suose the univariate stream x,x,... is generated by i.i.d. random variables X,X,... with E [X i ]µ, Var [X i ], i,,... (4.) Recall the definition of the AFF mean X N,! from Section 3.3. where it is shown that for such a stream h i h i E XN,! µ, Var XN,! (u! N, ), (4.) where the definition of u N,! is given in Section 3.3. and its derivation is given in Aendix A.3.7. From this starting oint, several change detection schemes can be defined. Table 4. rovides a reference for the abbreviations of the methods used in the subsequent tables. 4.. Normal streams Following the tradition of most statistical rocess control literature, suose that the random variables X,X,... are i.i.d. normal, X,X, N(µ, ). (4.3)

60 4. Change detection using forgetting factors 60 It then immediately follows from Equation (4.) that X N,! is also normally distributed, X! N, N µ, (u! N, ). Denoting the cumulative distribution function (cdf) of a N(µ, ) distribution by F N(µ, ), the quantity F N x µ,(u! N, )! N, [0, ], (4.4) rovides a measure for how well a articular value x! N, follows a N µ, (u! N, ) distribution, assuming that the stream is in-control and that all the observations are generated according to Equation (4.3). For a significance level, a 00( could be given by,. Alternatively, could be rescaled to be one-sided via )%-rediction interval 0, and a decision rule for signalling a change is then given by 0 <. (4.5) We call this the AFF change detection scheme. Note that we could consider 0 to be a - value, since it is a measure of how well x! N, follows a N(µ, (u! N, ) ) distribution. This terminology will be revisited in the multivariate setting in Chater 6. Although only the AFF mean x! N, has been discussed in this section, the same decision rule is derived for the FFF mean x N,, by simly considering the case when the AFF! (,,... ). This is called the FFF change detection scheme. In Section 4.. the FFF and EWMA schemes are shown to be closely-related, and their change detection erformance is comared.

61 Chater 4. Change detection using adative estimation Comarison between EWMA and FFF In Section 3..5 the relationshi between the FFF and EWMA schemes was discussed. In short, the FFF scheme tends to the EWMA scheme as N, the number of observations, tends to infinity. It therefore seems as if the FFF change detector should erform similarly to the EWMA change detector, although one might exect the FFF scheme to be better at detecting changeoints that occur near the start of the monitoring rocess. This section rovides a brief exerimental analysis comaring the FFF and EWMA schemes. The exeriments will involve reeated Monte Carlo trials for each method attemting to detect a single changeoint. Several locations of will be considered, {30, 50, 00, 00, 500}, to test erformance for short-, medium- and long-term changes. In this section, and throughout this chater, the analysis of the change detection methods will follow the standard rocedure of assuming that the re-change mean and variance of the stream are known. However, in later chaters these values will be estimated during a burn-in eriod; this aroach will be crucial when considering multile changeoints in a stream (Chater 5). Algo. Param. Param. Val. ARL0 SDRL0 ARL SDRL EWMA (r, L) (0.5, 3.00) (46.07).09 (7.3) EWMA (r, L) (0.0, 3.00) 0.90 (883.05) 0.0 (7.84) EWMA (r, L) (0.5,.00) 8.45 (33.5) 5.44 (3.77) FFF (, ) (0.95, 0.0) (403.3) 0.89 (4.93) FFF (, ) (0.95, 0.05) 9.9 (08.0) 7.9 (3.88) FFF (, ) (0.99, 0.0) (76.43) 9.65 (7.70) Table 4.: An ARL table for EWMA and FFF over 000 trials for 00. For ARL, X,...X 00 N(0, ) and X 0,...X 00 N(, ). A table summarising the results for 00is given in Table 4., for a selection of control arameters. This table rovides our first examle showing how ARL0 and ARL are couled; tuning arameters to imrove ARL0 has a negative effect on ARL if ARL0 increases, ARL also increases, and vice versa. This is further demonstrated in Figure 4.4. If the same exeriment is reeated for different values of, Figure 4. is obtained. Noticing the scale on the y-axis, this figure shows that the ARL does not vary for dif-

62 4. Change detection using forgetting factors 6.. ARL name EWMA FFF tau Figure 4.: EWMA (0.5, 3.00) and FFF (0.95, 0.0) showing the values of ARL for different locations of. ferent values of, if the re-change arameters are assumed to be known. The ARL0 of an algorithm does not deend on, and so is constant for different choices of. Of interest here are the arameter choices where ARL0 aroximately matches; in Table 4. EWMA (0.5, 3.00) and FFF (0.95, 0.0) both have ARL Figure 4. rovides some evidence in suort of the claim in Section 3..5 that FFF has better ARL for shortterm changes (smaller values of ) than EWMA. Admittedly, the variation in ARL in Figure 4. is small (all values in [0.7,.]) Performance of the AFF scheme for different choices of ste size The AFF scheme described in Section 3.3 relies on a gradient descent method to udate N! N+. In Section the value of, the ste size in the udate equation given in Equation (3.8), is discussed. Table 4.3 shows that the erformance of the AFF algorithm deends on the value of ste size to some extent. In articular, the false ositive rate as reresented by ARL0 is rather variable. While there is a difference in the case of a single changeoint, when there are multile changeoints, as in Chater 5, this deendence on is not as aarent,

63 Chater 4. Change detection using adative estimation 63 as Table 5. shows. Algo. Param. Param. Val. ARL0 SDRL0 ARL SDRL AFF (, ) (0., 0.0) (399.94) 0.4 (5.74) AFF (, ) (0.0, 0.0) 60.7 (577.).58 (6.43) AFF (, ) (0.00, 0.0) (73.6) 8.09 (6.8) AFF (, ) (0., 0.05) (87.44) 7. (4.04) AFF (, ) (0.0, 0.05) 4.46 (68.9) 9.35 (4.39) AFF (, ) (0.00, 0.05) 36.9 (40.66) 4.95 (6.4) Table 4.3: An ARL table for AFF for different ste size over 000 trials. For ARL, X,...X 00 N(0, ) and X 0,...X 00 N(, ). 4. Distribution-free forgetting factor methods Suose that µ and in Equation (4.) are known, but the underlying distribution of the stream cannot be assumed to be normal. In this case, it would be referable to use a decision rule that does not have any distributional assumtions in order to signal a change in the AFF mean. In this section, three methods are introduced that have the otential to rovide such a decision rule. 4.. AFF-Chebyshev method Under reasonable conditions, Chebyshev s inequality [48] states that for any random variable X with E[X] µ and Var[X], for any >0, Pr ( X µ ) ale, (4.6) which is equivalent to Pr ( X µ < ) >. Using the results in Equation (4.), Chebyshev s inequality can be alied to the AFF equations to yield Pr X! N, µ < ( u! N, ) >, (4.7) which for a choice of rovides a decision rule

64 4. Distribution-free forgetting factor methods 64 x N, µ < ( u! N, ) ) stream is in-control, x N, µ ( u! N, ) ) a changeoint has occurred. A choice of 5will give / 0.96, essentially roviding a 96% rediction interval. However, Chebyshev s inequality is conservative, and as Table 4.4 shows, choosing 5will result in a good ARL0, but a very oor ARL. Other values of give ARL airs in a more desirable range, but then relating these values back to the 00( / )% rediction interval does not rovide a good interretation. So, while this method is aealing, it is not clear how to make a theory-based choice of a value for. Table 4.4 shows the behaviour of this scheme for some choices of. This method will not be investigated further in this thesis. As a final oint, if µ and are unknown and estimated during a burn-in eriod (as in Chater 5), then a minor modification to the above argument should be made. Rather than using Equation (4.6), a version of Chebyshev s inequality based on estimated arameters [36] should be emloyed. This version of the inequality is used in Section Fixed-adative forgetting factor method It was shown in Figure 3.0 that the AFF mean reacts much faster to changes in the mean than the FFF mean when a high fixed value is used (e.g. when 0.99). This suggests that if the AFF mean is changing faster than the FFF mean, a changeoint may have occurred. This observation leads to the following decision rule for the AFF mean: for a arameter >0, and a choice of, and suosing the variance in the stream is, define a N x N,, b N x N, +. The, a decision rule for signalling a changeoint using the AFF mean x N,! is x N,! (a N,b N ) ) stream is in-control, x N,! 6 (a N,b N ) ) changeoint has occurred.

65 Chater 4. Change detection using adative estimation 65 Of course, this method requires a choice of and. While setting 0.99 may be straightforward enough, it is not clear what range of values could be used for. This method, which we abbreviate to F-AFF, bears some resemblance to the Shewhart chart (see Section..3), and could be considered to be an adative Shewhart chart. Table 4.4 shows the erformance of this method for some values of. Algo. Param. Param. Val. ARL0 SDRL0 ARL SDRL AFFdro (, ) (0.0, 0.65) (7.76) 9.87 (5.90) AFFdro (, ) (0.0, 0.70) (5.8) 9.4 (5.35) AFFcheby (, ) (0.0,.50) (5.70).9 (5.9) AFFcheby (, ) (0.0,.00).0 (4.0) 5.64 (3.46) F-AFF (,, ) (0.0,.00, 0.99) (499.8). (.40) F-AFF (,, ) (0.0, 0.50, 0.99) (40.03) 5. (7.88) Table 4.4: An ARL table for AFF for different learning rates over 000 trials. For ARL, X,...X 00 N(0, ) and X 0,...X 00 N(, ). A descrition of the abbreviations used can be found in Table Monitoring the AFF As Figure 3.8 shows, the AFF! reacts to a changeoint by droing in value. At the time, this was seen as a useful feature that allows the AFF mean to raidly adjust to the value of the new mean of the stream. However, a natural idea is to use this feature to signal that a changeoint has occurred at time N when N < for some threshold. The two key arts of this method are the choice of the cost function used for udating!, and the threshold at which a change is signalled. Section 3.3. described how the choice of cost function is directly related to the statistic (e.g. mean or variance) that is being monitored for a change. While there are methods that are designed for monitoring the mean and variance simultaneously [54, 6], most methods are designed for monitoring a single statistic. Indeed, there are different versions of CUSUM (and EWMA) for monitoring the mean and for monitoring the variance [04, 30]. Therefore, the choice of cost function is straightforward, since it is based on the statistic being monitored. On the other hand, it is not immediately clear how to set the value of the threshold. Clearly, [0, ), since! takes values in [0, ]. It also aears from Figure 3.8

66 4. Distribution-free forgetting factor methods 66 Average AFF with η Observation (a) Average AFF with η Observation (b) Average AFF with η Observation (c) Figure 4.: Median value of AFF! (over 000 simulations) for stream generated by X,...,X 50 N(0, ) and X 5,X 5, N(µ, ), where (a) µ 0.5, (b) µ and (c) µ. In all cases 0.0. Error bars indicating the emirical 70% confidence interval are rovided. and other exeriments that when the stream is in control! settles down to a value in the range (0.9, ). While values of 0.8 or 0.7 might seem to be aroriate for, the amount that! dros after a change in fact deends on the size of the changeoint, as Figure 4. shows. Consequently, it is not clear how to set when the size of the anticiated change is unknown. Table 4.4 shows this method for a few choices of threshold. One solution to this reliance on choosing would be to emloy a self-starting method or erhas use the AFF-Chebyshev method to monitor the stream,,... While this suggestion may have merit, we do not ursue it further here and leave it for future work Summary of distribution-free AFF methods The methods roosed in this section have the benefit of being free of distributional assumtions. However, they all require control arameters to be selected that are not easy to set. This laces these methods in a similar osition to CUSUM and EWMA, which also rely on the secification of control arameters values where there is no articular theoretical insight into what the values should be. For this reason, these methods will not be ursued further here and will rather be exlored in future work. As a final oint, though, it is worth noting that all of the above

67 Chater 4. Change detection using adative estimation 67 Algo. Param. Param. Val. ARL0 SDRL0 ARL SDRL CUSUM (k, h) (.00, 4.00) (489.3) 5.68 (9.36) CUSUM (k, h) (0.5, 8.00) (36.99) 9.87 (4.4) EWMA (r, L) (0.5, 3.00) (46.07).09 (7.3) EWMA (r, L) (0.0, 3.00) 0.90 (883.05) 0.0 (7.84) EWMA (r, L) (0.5,.00) 8.45 (33.5) 5.44 (3.77) FFF (, ) (0.95, 0.0) (403.3) 0.89 (4.93) FFF (, ) (0.95, 0.05) 9.9 (08.0) 7.9 (3.88) FFF (, ) (0.99, 0.0) (76.43) 9.65 (7.70) AFF (, ) (0.0, 0.0) 60.7 (577.).58 (6.43) AFF (, ) (0.0, 0.05) 4.46 (68.9) 9.35 (4.39) AFFdro (, ) (0.0, 0.65) (7.76) 9.87 (5.90) AFFdro (, ) (0.0, 0.70) (5.8) 9.4 (5.35) AFFcheby (, ) (0.0,.50) (5.70).9 (5.9) AFFcheby (, ) (0.0,.00).0 (4.0) 5.64 (3.46) F-AFF (,, ) (0.0,.00, 0.99) (499.8). (.40) F-AFF (,, ) (0.0, 0.50, 0.99) (40.03) 5. (7.88) Table 4.5: An ARL table for all the algorithms over 000 trials. For ARL, X,...X 00 N(0, ) and X 0,...X 00 N(, ). Normal streams, arameters are known. A descrition of the abbreviations used can be found in Table 4.. methods can be deloyed using estimated values of the stream s mean and variance, and so they also need not rely on assuming the stream arameters are known. 4.3 Exeriments and results Having embedded forgetting factor estimation methodology in various change detectors, we turn to consider their erformance. In this comarison we consider a single changeoint, following the style of work in statistical rocess control. The urose of this exercise is to gain a feel for how change detection algorithms behave against a single change when deloyed without a view of the ost-change distribution. Such a comarison is a recursor to the multile changeoint context, which we call continuous monitoring, considered in the next chater. The exeriments examine three situations: The streams are normally-distributed, and the re-change mean and variance are known, shown in Table 4.5,

68 4.3 Exeriments and results 68 Algo. Param. Param. Val. ARL0 SDRL0 ARL SDRL CUSUM (k, h) (.00, 4.00) (77.) 5.3 (.06) CUSUM (k, h) (0.5, 8.00) 49. (3.54) 0.09 (5.77) EWMA (r, L) (0.5, 3.00) (556.98). (0.87) EWMA (r, L) (0.0, 3.00) (78.76) 0.96 (.8) EWMA (r, L) (0.5,.00) (4.) 5.9 (3.60) FFF (, ) (0.95, 0.0) (385.00).4 (6.79) FFF (, ) (0.95, 0.05) 87.8 (00.6) 8.38 (4.7) FFF (, ) (0.99, 0.0) (64.84) 0.57 (.34) AFF (, ) (0.0, 0.0) (500.93) 3.30 (8.84) AFF (, ) (0.0, 0.05) (5.96) 9.80 (5.7) AFFdro (, ) (0.0, 0.65) 88.8 (6.3) 9.7 (5.97) AFFdro (, ) (0.0, 0.70) 5.73 (35.44) 8.97 (4.74) AFFcheby (, ) (0.0,.50) (439.4).60 (8.0) AFFcheby (, ) (0.0,.00) (38.86) 7.00 (4.09) F-AFF (,, ) (0.0,.00, 0.99) (53.34).95 (.) F-AFF (,, ) (0.0, 0.50, 0.99) (453.5) 4.9 (7.8) Table 4.6: An ARL table for all the algorithms over 000 trials. For ARL, X,...X 00 N(0, ) and X 0,...X 00 N(, ). Normal streams, but arameters are estimated during initial an burn-in eriod of length B 50. A descrition of the abbreviations used can be found in Table 4.. The streams are normally-distributed, and the re-change mean and variance are unknown and estimated during a burn-in eriod, shown in Table 4.6, The streams are not normally-distributed, but the re-change mean and variance are known, shown in Table 4.7. In all cases, the mean increases by one multile of the standard deviation. To reiterate, the urose of the exeriments is not to seek an otimal change detector, but rather to examine erformance under different choices of control arameters. As such, no attemt has been made to match algorithms on ARL0. The notable features in Table 4.5 are that all algorithms erformance deend critically on control arameters and, as has been discussed earlier, an increase in ARL0 leads to an increase in ARL. Indeed, this is thematic in all the tables arising from this exeriment. Table 4.6 shows results for unknown re-change arameters, which are estimated during a burn-in eriod. The arameter settings are the same as in Table 4.5. Notably, there

69 Chater 4. Change detection using adative estimation 69 is a decrease in ARL0, while ARL remains stable. An obvious exlanation for this henomenon is that the false ositive rate suffers as a consequence of estimation in the burn-in hase. An excetion to this observation is the F-AFF scheme, which has similar erformance in both tables. Algo. Param. Param. Val. ARL0 SDRL0 ARL SDRL CUSUM (k, h) (.00, 4.00) 5.68 (4.87) 5.77 (0.80) CUSUM (k, h) (0.5, 8.00) (30.8) 0.3 (4.6) EWMA (r, L) (0.5, 3.00) 8.6 (8.4) 0.49 (6.8) EWMA (r, L) (0.5,.00) 5.64 (7.86) 5.7 (.70) FFF (, ) (0.95, 0.0) 4.43 (49.86).5 (5.06) FFF (, ) (0.95, 0.05) 6.58 (0.69) 8.73 (3.94) FFF (, ) (0.99, 0.0) (77.4) 0.0 (7.6) AFF (, ) (0.0, 0.0) (448.78).97 (6.5) AFF (, ) (0.0, 0.05) (97.45) 9.97 (4.46) AFFdro (, ) (0.0, 0.65) (4.59) 0.77 (8.3) AFFdro (, ) (0.0, 0.70) (45.3) 9.4 (4.9) AFFcheby (, ) (0.0,.50) (40.6).5 (6.) AFFcheby (, ) (0.0,.00) 5.0 (6.64) 8.4 (.77) F-AFF (,, ) (0.0,.00, 0.99) 3.80 (707.5).5 (.0) F-AFF (,, ) (0.0, 0.50, 0.99) (404.87) 4.77 (7.86) Table 4.7: An ARL table for all the algorithms over 000 trials. For ARL, X,...X 00 (, ) and X 0,...X 00 (4, 0.5). Non-normal (Gamma) streams, and re-change mean and variance assumed known. A descrition of the abbreviations used can be found in Table 4.. Table 4.7 considers Gamma-distributed streams. This choice is made to assess detection erformance when assumtions are violated. The natural comarison is with Table 4.5 since re-change mean and variance are treated as known in both cases. On one hand, the ARL0 of CUSUM and EWMA aears to dramatically suffer for certain arameter airs. On the other hand, the forgetting factor methods all degrade more gently (some not at all). Two other interesting features arise from this exeriment. First, Figure 4.3 shows how different algorithms detection delay (ARL) varies with the size of the change. The algorithms exhibit very similar erformance. Second, Figure 4.4 shows how ARL0 and ARL are couled for CUSUM, EWMA, FFF and AFF, for different control arameter settings. An ideal algorithm would manifest in the bottom-right corner of each frame of the figure, having high ARL0 (infrequent false ositives) and low ARL (fast detections). However,

70 4.4 Discussion ARL Algorithm EWMA FFF AFF CUSUM AFFdro AFFcheby F AFF 0 3 µ Figure 4.3: EWMA (0.5, 3.00), FFF (0.95, 0.0), AFF (0., 0.0), CUSUM (0.5, 8.00), AFFdro (0.0, 0.65), AFFcheby (0.0,.50) and F-AFF (0.0, 0.50, 0.99) showing ARL for increasing values of µ. The arameters of the normally-distributed streams are assumed known. no choice of arameter air yields this behaviour; if ARL0 increases, so does ARL. 4.4 Discussion A collection of change detection schemes utilising the forgetting factor estimation framework have been roosed and exlored. The distribution-free methods aear to have some romise, but are comromised by the difficulty of selecting control arameter values. The main conclusions from this exloration are that FFF aears to detect sudden changes more effectively than EWMA and forgetting factor methods aear more robust to model missecification than traditional aroaches. Note however that these conclusions are based on detecting a single change. In the next chater certain of the forgetting factor methods are carried forward to the roblem of continuous monitoring.

71 Chater 4. Change detection using adative estimation 7 EWMA FFF 0 0 AFF CUSUM 0 ARL 0 0 AFFdro AFFcheby Algorithm EWMA FFF AFF CUSUM AFFdro AFFcheby F AFF 0 F AFF ARL0 Figure 4.4: ARL0 vs ARL for EWMA, FFF, AFF, CUSUM, AFFdro, AFFcheby and F- AFF for different choices of control arameters. The arameters of the normally-distributed streams are assumed known.

72 7 Chater 5 Continuous Monitoring The revious chater exlored the utility of the forgetting factor methods for detecting a single changeoint in streaming data. This included consideration of the effects of burn-in for arameter estimation. However, a data stream, as described in Chater, is otentially unending and is exected to contain multile changeoints. In this chater forgetting factor methods are alied to this more difficult roblem, referred to as continuous monitoring, of detecting multile changeoints in a data stream. This is a subtle and unexamined roblem, and it is not clear whether there is any extant method that is well-matched to meet its challenges. While some methods have characteristics that satisfy certain asects of the roblem, there does not seem to be a single method which satisfies all the requirements. The subtleties in continuous monitoring, in relation to existing literature, are discussed in Section 5... A key finding in this chater is that the AFF scheme is well-suited to continuous monitoring; it erforms comarably to CUSUM and EWMA, yet only requires a single control arameter, which can be easily set. This reduces the burden on the analyst to set control arameters in a streaming data context. Section 5. formulates the continuous monitoring framework, reviews some literature, and discusses erformance metrics in the continuous monitoring context. Section 5. describes how we construct streams for a simulation study, and how the erformance metrics are comuted. Section 5.3 resents results suggesting that, in the continuous monitoring context, the erformance of the AFF scheme does not deend on the choice of ste size. Section 5.4 resents results comaring the AFF scheme and restarting CUSUM and

73 Chater 5. Continuous Monitoring 73 EWMA. Finally, Section 5.5 demonstrates our change detection methodology in an alication related to financial data. 5. Detecting multile changeoints in streaming data Change detection algorithms are usually comared by their ability to find a single changeoint; this was exlored in Chater 4. However, in many real-world situations such as financial monitoring (exemlified in Section 5.5), multile changeoints are exected and an algorithm must continue to monitor the rocess for successive changes. Similar roblems occur in certain tyes of security and surveillance alications [5]. In this section we discuss the multile changeoint scenario and relevant erformance metrics. Denote a stream of observations as x,x,..., samled from i.i.d. random variables X,X,..., with changeoints,,..., such that X,X,...,X F, X +,X +,...,X F, (5.) X +,X +,...,X 3 F 3, etc, where F,F,... reresent distributions such that F k 6 F k+ for all k. Recall from Section. that the size of the ith change is defined to be E[X i ] E[X i ] for a change in the mean. As described in Sections..7 and..8, it will be necessary to estimate the stream arameters (mean and variance) for each new regime, i.e. when monitoring starts, and after each detected changeoint. We exect multile changeoints to occur and each regime could have a different underlying distribution. 5.. Recent aroaches and continuous monitoring In Chater 4 the forgetting factor methods were comared to CUSUM and EWMA because they are two of the most basic and well-studied aroaches for sequential change detection. Of course, many sohisticated variations have been roosed, each of which tyically handle only one of the challenges in continuous monitoring. Considering the requirements of

74 5. Detecting multile changeoints in streaming data 74 continous monitoring rovides a convenient way to artition the relevant literature: (A) Sequential and efficient comutation (B) Handling changes of unknown size (C) Few control arameters (D) Self-starting, or detects multile changes Requirement (B) has been studied extensively in the context of a single change-oint. Much of this work is related to so-called adative-cusum and adative-ewma, see [5] for a review. Note that we are not aware of any literature where both (B) and re-starting are addressed together. An otimal filtering mechanism is rovided in [5] which reduces to standard EWMA in secial cases. The aroach is shown to be effective for both large and small changes, and so satisfies (B). However, this aroach is inadequate for continuous monitoring due to (A) and (C), secifically the large number of coefficients that need to be estimated in the filter. In addition to addressing (B), [8] rooses a method that is suitable for different size shifts in the resence of ost-change deendence, in the context of a single change. This sohistication comes at some comutational cost, which make this aroach unsuitable for continuous monitoring in relation to (A) and (C). Again, with resect to (B), [78] roose a hybrid EWMA/CUSUM rocedure in the context of a single change. While this aroach looks effective in exeriments, there are four control arameters to be determined, which violates requirement (C). The issue of self-starting has been addressed in both univariate and multivariate contexts. For examle, [6] is an early aroach on multivariate self-starting. This method has two arameters, the setting of which is suggested by reference to standard tables, such as those in [99]. Other aroaches to self-starting include [46] and [63]. In all these examles, one way or another, there are control arameter settings that are challenging in the context of continuous monitoring, which violates requirement (C). There are self-starting methods [93, 50] that aear to be romising for a sequential analysis context, but require the storage of an increasing window of statistics, and so are

75 Chater 5. Continuous Monitoring 75 not suitable for a streaming data context. Moreover, it is not clear how these methods could be modified to detect multile changeoints. Finally, while there are aroaches for detecting multile changeoints in a stream (e.g. [63, 0]), in general these are either non-sequential [0] (violating (A)) or require several control arameters [63] (violating (C)). The methods discussed in this section are all good aroaches when considered in the context for which they were designed, however none of them seem to satisfy all the requirements for detecting changes in streaming data. Furthermore, as mentioned in Section..7, while most of the traditional SPC literature focuses on detecting a single change after a long eriod of stationarity, in this chater we shall consider the scenario where changeoints occur frequently, and so will be using shorter burn-in eriods. 5.. Performance measures Assessment of erformance becomes comlicated once we deart from the most basic sequential change detection setting. For examle, in the context of a multivariate change detection roblem, [46] is forced to develo an extra erformance measure. Performance assessment is comlicated in the continuous monitoring roblem, and extends beyond the standard aroaches used in the literature. We consider conventional metrics, then erformance metrics relevant to the continuous monitoring scenario. Average Run Length As described in Section.., two standard erformance measures are the Average Run Lengths, ARL0 and ARL [6]. ARL0 is comuted as the average number of observations until a changeoint is detected, when the algorithm is run over a sequence of observations with no changeoints, while ARL is the average number of observations between a changeoint occurring and the change being detected. Note that ARL tyical refers to a single change of a given magnitude. As noted in Chater, the challenge of continuous monitoring involves a sequence of changes of unknown and varying magnitude. These measures alone are insufficient to

76 5. A simulation study 76 characterise detection erformance in a continuous monitoring framework. Issues related to calculating the ARLs in a continuous monitoring setting are discussed in Section 5... Detection rates The ARL value neither reflects how many changeoints are detected nor how many are missed. Moreover, ARL and ARL0 together do not reflect the ratio of true detections to false ositives. In a single-change context, these might be difficult to measure, since any reasonable algorithm will detect a change given enough time. However, in a data stream, there is a finite amount of time between changeoints, and some changes might not be detected before another changeoint occurs, and we then classify these as missed changes. Now, suose that we have a data stream with C changeoints, and our algorithm makes a total of D detections, T of which are true (correct) detections, while D T are false detections. We then define: - CCD T/C, the roortion of changeoints correctly detected - DNF T/D, the roortion of detections that are not false detections These intuitive definitions are the same as sensitivity and redicted value ositive (PVP) in the surveillance literature [55, 5]. Similar metrics are discussed in [85]. Although the comlements of CCD and DNF are more intuitively defined (roortions of missed changeoints and false detections, resectively), these definitions are referred since the closer CCD and DNF are to, the better the erformance of the algorithm. 5. A simulation study In develoing new change detection methodology, it is customary to consider the case of normally-distributed data (e.g. [7, 6]). This simulation study follows this custom, letting F i N(µ i, i) for all i. For this simulation, however, i for all i. In order to obtain randomly saced changeoints, first samle i,,... Pois( ), for some value, to obtain random interval widths, and then ad these value with G and D.

77 Chater 5. Continuous Monitoring 77 4 Regions Burn in G τ τ G ξ ξ Legend Changeoint Detection end Grace end Value 0 Waiting Detection Data oints Changeoint Datastream Detection D D Observation (a) Observation (b) Figure 5.: (a) Generating the stream. (b) Schematic reresentation of detection regions. G is a grace eriod to give the algorithm time to estimate the streams arameters, and D is a eriod that allows the algorithm to detect a change. The changeoints are then secified by: G + k k + D + G + k, k {, 3,...M}. This is schematically reresented in Figure 5.(a). The stream is then generated in blocks [ k +, k+ ]. The first block is samled from a normal distribution with mean µ 0, and then block k is samled with mean µ k µ k + k, where k is a random jum size in some set S. For the simulations below, the stream is generated with arameters 30, G 30, D 30, M 50000, and the set of jum sizes k is uniformly samled from the set S {±0.5, ±0.5, ±, ±3}.

78 5. A simulation study Classifying the detected changes After running over the stream, an algorithm will return a sequence of detected changeoints {b, b,...}, and we must then classify these as correct, missed or false detections. In a simulation setting, we will use the sequence of true changeoints {,,...} to do this. Recall that after detecting a changeoint b n, our algorithm uses the next B observations in the interval [b n +, b n + B] as a burn-in region to estimate the arameters of the ostchange distribution. The algorithm then monitors the stream until the next true changeoint of the stream occurs at m. Now, if the next detected changeoint is b n+, then - if b n+ [b n + B, m ], then b n+ is a false detection, - if b n+ [ m +, m+ ], then b n+ is a correct detection, - if m and m+ occur without a detected changeoint in the interval [ m, m+ ], then m is a missed detection. In order to visualise the situation better, one can imagine that our stream is divided into three regions of different coloured backgrounds, burn-in, waiting, and detection regions. Then, a detected changeoint b n is classified according to the region in which it lies. For examle, in Figure 5.(b) the first detected changeoint is a correct detection, while the second detected changeoint is a false detection. To clarify the definitions, the waiting region is the interval [b n + B +, n+ ], i.e. the region between the end of the burn-in eriod and the next true changeoint. The detection region is the interval [ n+ +, b n+ ] or [ n+ +, n+ ], deending on whether a changeoint is detected or not. 5.. Average run length for a data stream The calculations of ARL0 and ARL are simle using this framework. The ARL0 is the sum of the lengths of the waiting regions between false detections, divided by the number of false detections. The ARL is the sum of the lengths of the detection regions between the correctly detected changeoints and their nearest true changeoints. Note that this excludes the detection regions that are between two true changeoints (missed detections).

79 Chater 5. Continuous Monitoring 79 Label Descrition EWMA Exonentially weighted moving average scheme (see Section..5) CUSUM Cumulative sum scheme (see Section..4) AFF Adative forgetting factor scheme CCD Proortion of changeoints correctly detected DNF Proortion of detections that are not false detections ARL0 Average number of observations between false alarms SDRL0 Standard deviation of ARL0 ARL Average delay in detecting a true changeoint SDRL Standard deviation of ARL Table 5.: Exlanation of labels used for different change detection schemes and erformance metrics. We also sequentially calculate the variances of ARL0 and ARL by recording the sum of squares of the lengths used in ARL0 and ARL. Care must be exercised in the calculation of the variance of ARL0, however, and we must ensure that we take the square of the sum of the lengths of the waiting regions between false detections. The standard deviations of ARL0 and ARL denoted by SDRL0 and SDRL, resectively. While these definitions of ARL0 and ARL are unconventional, as we are averaging the delays that occur for detecting changes of different sizes, these definitions are one way for us to obtain an estimate for the average run lengths of the detectors. Three algorithms will be comared in the next section: CUSUM, EWMA and AFF. A descrition of these abbreviations is available in Table Choice of ste size in the continuous monitoring context Although the AFF scheme described in Section 4. only has a single control arameter, Section 4..3 shows that the value of the ste size, used in Equation (3.8) to udate!, affects the erformance of the AFF scheme when detecting a single change. However, Table 5. shows that the AFF algorithm erforms relatively consistently, in the continuous monitoring context, for 0., 0.0, It is interesting that slightly different estimation rocedures give comarable change detection erformance. It now aears that, for ractical uroses, the AFF scheme only deends on the single control arameter, at least in the continuous monitoring context.

80 5.4 Exeriments and results 80 Algo Params Values CCD DNF ARL SDRL ARL0 SDRL0 AFF (, ) (0.00, 0.005) (0.48) 75.9 (75.45) AFF (, ) (0.00, 0.005) (0.55) 48.6 (5.94) AFF (, ) (0.00, 0.005) (9.46) (49.94) Table 5.: Summary of algorithm erformance, over changeoints, with {±0.5, ±0.5, ±, ±3} and burn-in B30. This table shows that AFF has similar erformance for 0., 0.0, CCD is the roortion of changeoints correctly detected, and DNF is the roortion of detections that are not false detection. These erformance metrics are introduced in Section Exeriments and results Table 5.3 dislays exemlar results for the CUSUM, EWMA and AFF algorithms for a choice of arameters, with a burn-in of B 30, with results averaged over a stream containing changeoints. For all of the algorithms, after a changeoint is detected the mean and variance of the new regime are estimated during the burn-in eriod. The algorithms then use these estimates to detect the next change. Parameters have been chosen in Table 5.3 to give each algorithm aroximately comarable erformance in terms of ARL0. Secifically, the CUSUM arameters used were indicated in [0, Section 8..3] to be common choices of CUSUM arameter airs. These are almost identical to those recommended in [63, Table ]. For EWMA, it is often recommended that r [0.05, 0.5] [0, Section 8..] and the arameter airs used are those recommended in [00]. We choose these default arameter values because we have no other choice in continuous monitoring; when there is no knowledge of the re- or ost-change distribution, or when there will be multile changeoints between different regimes, there is no oortunity to select the otimal arameter air. It therefore seems reasonable to try a selection of arameter airs that have good erformance for the single changeoint setting. In Table 5.3 the AFF arameter was chosen to be to give comarable ARL0 and ARL erformance to CUSUM and EWMA. Indeed, comaring AFF with to CUSUM with (k, h) (.00,.5) (both in bold), we see that the AFF has almost the same ARL0, slightly higher ARL, the same DNF value, and a higher CCD. Comarisons with the other two CUSUMs (with (k, h) (0.50, 4.77) and (k, h) (0.5, 8.0)) are similar, but the latter CUSUM has a higher CCD value than AFF. AFF with (0.0)

81 Chater 5. Continuous Monitoring 8 Algo Params Values CCD DNF ARL SDRL ARL0 SDRL0 CUSUM (k, h) (.00,.5) (.34) 77.7 (78.4) CUSUM (k, h) (0.50, 4.77) (9.47) 63.0 (60.99) CUSUM (k, h) (0.5, 8.0) (8.6) (58.73) EWMA (r, L) (0.0,.96) (0.03) (7.36) EWMA (r, L) (0.5,.998) (0.58) 97.9 (94.53) EWMA (r, L) (0.30, 3.03) (0.99) 0.83 (03.55) AFF ( ) (0.005) (0.48) 75.9 (75.45) AFF ( ) (0.0) (9.97) 9.03 (0.96) Table 5.3: Summary of algorithm erformance, over changeoints, with {±0.5, ±0.5, ±, ±3} and burn-in B30. Parameter values have been chosen to give comarable erformance in terms of ARL0. See Table 5. for a descrition of the abbreviations used. Highlighted entries are discussed in the text. (in red) shows that increasing decreases both the ARL0 and ARL, increases CCD and decreases the DNF. The comarison of AFF with EWMA in Table 5.3 is similar. Comaring AFF with to EWMA with (r, L) (0.0,.96) (also in bold), AFF has almost the same ARL0, slightly higher ARL, and broadly the same DNF and CCD. The situation with the other two EWMAs (with (r, L) (0.0,.988) and (r, L) (0.30, 3.03)) is similar, excet that the EWMAs have higher ARL0, but slightly lower CCD. Table 5.3 therefore indicates that AFF has broadly the same erformance as CUSUM and EWMA. However, the benefit of AFF is that it only requires a single control arameter. Table 5.4 shows how CUSUM, EWMA and AFF behave with different arameter airs. First of all, CUSUM with (k, h) (.5,.99) and (k, h) (.50,.6) (in blue) two recommended choices of arameter airs in [63] have similar erformance to those choices in Table 5.3, but with lower CCD. If arameter airs are mixed, as for (k,h) (0.5,.5) or (k, h) (.00, 8.0) (in red), this results in extreme behaviour; either erfect CCD or DNF, but at the exense of oor ARL0, CCD and DNF. However, a non-standard choice of a arameter air (k,h) (0.50, 8.0) (in green), can result in good (even suerior) erformance note how this case comares to CUSUM with (k,h) (.00,.5) in Table 5.3 (in bold): comarable CCD and ARL, but (k, h) (0.50, 8.0) (in green) has far better DNF and ARL0. This shows that setting of the CUSUM arameter air is non-trivial.

82 5.4 Exeriments and results 8 Algo Params Values CCD DNF ARL SDRL ARL0 SDRL0 CUSUM (k, h) (.5,.99) (.9) 8.0 (84.08) CUSUM (k, h) (.50,.6) (.85) (88.35) CUSUM (k, h) (0.5,.5) (9.9) 3.46 (.0) CUSUM (k, h) (.00, 8.0) (9.73) ( ) CUSUM (k, h) (0.50, 8.0) (9.77) (897.55) EWMA (r, L) (0.05,.65) (.5). (3.58) EWMA (r, L) (0.0,.84) (5.7).79 (3.5) AFF ( ) (0.05) (7.54) (44.4) Table 5.4: Summary of algorithm erformance, over changeoints, with {±0.5, ±0.5, ±, ±3} and burn-in B30. Parameter values have been chosen to show how oorly-chosen arameters can lead to oor erformance. Highlighted entries are discussed in the text. Also in Table 5.4 are two arameter air choices for EWMA that are recommended in [00], but have very oor erformance (very high CCD at the exense of oor erformance for the other three metrics). This also shows that setting the EWMA arameter air is nontrivial. Finally, AFF with 0.05 is shown to give good CCD, but at the exense of the other three metrics. Since increasing makes the AFF scheme more sensitive to changes, this behaviour is exected. However, since is the only control arameter, it is relatively easy to adjust the erformance of AFF by increasing or decreasing. Table 5.5 shows how CUSUM, EWMA and AFF behave when the burn-in is B 50, instead of B 30in Table 5.3. To make comarison fair, the grace eriod G is set to G 50, otherwise the stream arameters are unchanged. This table shows that the algorithms have comarably the same erformance, comared to Table 5.3, excet that ARL0 and ARL are increased for all algorithms. Note, however, that CCD and DNF are relatively similar, or only slightly increased. Recall from Section..5 that the original EWMA aer [8] shows that the EWMA scheme is sensitive to small change sizes. In the continuous monitoring scenario, changes of different sizes are lausible, hence Tables consider streams with change sizes {±0.5, ±0.5, ±, ±3} (i.e. both small and large changes). Table 5.6 comares the algorithms when all the changes are relatively large, and shows that, comared to Table 5.3, there is a large increase in CCD, a slight decrease in DNF, a large decrease in ARL, and similar ARL0. The increase in CCD and decrease in ARL should be exected, since the

83 Chater 5. Continuous Monitoring 83 Algo Params Values CCD DNF ARL SDRL ARL0 SDRL0 CUSUM (k, h) (.00,.5) (30.48) 6.78 (30.07) CUSUM (k, h) (0.50, 4.77) (7.0) 5.55 (7.8) CUSUM (k, h) (0.5, 8.0) (5.0) 4.90 (06.5) EWMA (r, L) (0.0,.96) (7.90) 54. (53.0) EWMA (r, L) (0.5,.998) (8.56) 63.9 (65.30) EWMA (r, L) (0.30, 3.03) (9.45) 70.0 (73.78) AFF ( ) (0.005) (8.) 4.9 (8.69) AFF ( ) (0.0) (7.3) 48.6 (53.57) Table 5.5: Summary of algorithm erformance, over changeoints, with {±0.5, ±0.5, ±, ±3} and burn-in B50. This table can be comared with Table 5.3 to show how the burn-in length affects erformance. Algo Params Values CCD DNF ARL SDRL ARL0 SDRL0 CUSUM (k, h) (.00,.5) (.6) 78.7 (8.3) CUSUM (k, h) (0.50, 4.77) (8.3) 6.85 (60.) CUSUM (k, h) (0.5, 8.0) (8.05) 64.8 (53.5) EWMA (r, L) (0.0,.96) (8.63).49 (45.) EWMA (r, L) (0.5,.998) (9.33) (94.3) EWMA (r, L) (0.30, 3.03) (0.09) 0.8 (03.80) AFF ( ) (0.005) (0.0) 7.40 (74.) AFF ( ) (0.00) (0.0) 5.7 (9.68) Table 5.6: Summary of algorithm erformance, over changeoints, with {±, ±, ±3, ±4} and burn-in B30. This table corresonds to Table 5.3, but the stream has a different set of jum sizes. jum sizes are larger. In conclusion, Tables 5.3, 5.5 and 5.6 show that AFF has similar erformance to CUSUM and EWMA when arameters are chosen for all algorithms to have similar ARL0. However, selecting arameter airs for CUSUM and EWMA is not easy. Table 5.4 shows that standard arameter air choices can lead to oor erformance, and non-standard choices can give better erformance. The benefit of AFF is that it only requires a single control arameter, since the adative forgetting factor! is automatically tuned. It can be seen that increasing erformance for one metric generally decreases erformance for other metrics. In articular, it aears that arameter choices that increase CCD lead to a decrease in DNF, and vice-versa. Therefore, can be increased or decreased to adjust for desired CCD or DNF erformance for AFF.

84 5.5 Foreign exchange data Foreign exchange data As discussed in the introduction, a rimary alication of continuous monitoring for data streams arises in financial trading. Here, the value of a financial instrument evolves over time, as a result of the behaviour of the market. Individual traders need to determine if the rice has made an unexected change in order to trigger trading actions. However, the data stream continues, uninterruted, as such trading decisions haen. For illustration, we will consider 5-minute Foreign Exchange (FX) tick data. Secifically, we consider a stream of Swiss Franc (CHF) and Pound Sterling (GBP). Our objective here is simly to detect changes in the rice-ratio, that could be used to trigger trading actions. It is well known that FX streams are non-stationary. The standard aroach to address this roblem is to transform the data, and analyse the so-called log-returns, LR t log(x t ) log(x t ). We erform change detection on the log-returns of the CHF/GBP data for the first 0000 observations. The data is from 07h05 on /0/00 until h5 on 0//00 (aroximately seven weeks), with one data oint every five minutes. We restrict to a short sequence simly for the uroses of clarity. Figure 5.(a) shows the changeoints (vertical lines) detected on the log-returns suerimosed on the raw data stream for the AFF scheme with Although this section is simly meant to rovide an examle of the AFF scheme deloyed on real data, we also rovide a comarison with PELT [87], an otimal offline detection algorithm, in order to rovide an indication of the true changeoints. The changeoints detected by PELT are shown in Figure 5.(b). PELT also has a single control arameter, which was chosen to be 0.0 for this figure, in order to detect a comarable number of changeoints. Figure 5. shows a high degree of agreement between the AFF scheme and PELT, with more than half of the changeoints common to both (within 3 observations of each other). This is articularly striking since the AFF scheme is an online method while PELT is an offline method. Similar figures are obtained for different arameter values that increase sensitivity and allow more changeoints to be detected. We use the R imlementation of PELT rovided in the changeoint ackage [86].

85 Chater 5. Continuous Monitoring Value (a) Observation, AFF detections with α Value (b) Observation, PELT detections with enalty 0.0 Figure 5.: Change detection on a CHF/GBP data stream using AFF and PELT. The raw data stream is lotted with the detected changeoints indicated by the vertical dashed lines, black lines indicate that both schemes detect that changeoint (within 3 observations of each other).

86 5.6 Discussion Discussion This chater has extended the case of detecting of a single changeoint using the forgetting factor framework (Chater 4) to the context of continuous monitoring, which more closely addresses the challenges of detecting changes in streaming data. Some recent literature is reviewed, and none of the methods seem to satisfy all the requirements of continuous monitoring. Performance metrics for continuous monitoring are then discussed. It is found that the AFF scheme erforms similarly for a wide range of ste size values, and so the AFF scheme truly requires only a single control arameter, namely the sensitivity. An extensive simulation study is erformed which shows that the AFF scheme has similar change detection erformance to CUSUM and EWMA. However, these two methods require two control arameters while the AFF scheme only requires one. This is imortant because, on the one hand, simly increasing or decreasing the AFF arameter will either increase or decrease the algorithm s sensitivity to detecting changes. On the other hand, setting the two control arameters of CUSUM and EWMA is non-trivial. While Table 5.3 shows that some recommended settings give good erformance, Table 5.4 shows that other recommended choices can lead to oor erformance (EWMA), and mixing arameter airs can lead to oor erformance or erformance suerior to that when recommended settings are used (CUSUM). Finally, the AFF scheme is alied to detect changes in the mean of a foreign exchange stream. For comarison, an otimal offline method is run over the same stream, and there is good agreement between the changeoints detected by the two methods. This rovides some evidence that the AFF scheme can detect changeoints in real-world data streams. Another alication in which this forgetting factor framework has been alied is the detection of relays, a susicious kind of behaviour in comuter network traffic, by extending the framework to an extreme-value scenario [6]. While the receding chaters have assumed that the data stream consists of univariate observations, the next chater extends our forgetting factor framework to the detection of changes in multivariate data.

87 87 Chater 6 Multivariate adative filtering and change detection While there are many alications that require the monitoring of a single stream for otential changes, there are situations where it could be desirable to monitor a collection of related streams simultaneously. In a comuter network, multile network traffic orts could be monitored for anomalous behaviour. In the world of finance, a collection of foreign exchange airs (e.g. see Section 5.5) or a ortfolio of share rices could be monitored for an increase in volatility. Indeed, there are scenarios that may not immediately sring to mind, but are no less imortant. For examle, an early reference [7] refers to samle bomb sites. In this chater, multivariate forgetting factor schemes are roosed to sequentially detect multile changeoints in the mean of a multivariate data stream. It is a natural extension of the work in discussed in Chaters 3, 4 and 5, although new issues have to addressed. As before, a articular concern is managing the deendence on control arameters. In the multivariate case, this is more comlicated and we settle on an easier interretation for setting the control arameters, rather than full automation. Section 6. reviews some of the multivariate change detection literature. Section 6. introduces the notation for the multivariate AFF mean and describes two methods for incororating an adative forgetting factor. Section 6.3 describes a decision rule for detecting a change using this AFF framework. Naturally, in the multivariate setting there are more issues to consider for defining such rules. A simulation study in Section 6.4 shows that

88 6. Multivariate change detection in the literature 88 this method erforms as well as, if not better than, some recently roosed methods even though it requires only a single control arameter to be secified. Finally, Section 6.5 alies our multivariate methodology to monitoring the volume of ort traffic in a comuter network. 6. Multivariate change detection in the literature A good overview of multivariate statistical rocess control can be found in [05]. Several of the univariate charts described in Section. have been extended to the multivariate setting. For examle, there are several multivariate extensions of EWMA [98, 64] and CUSUM [60, 70, 40]. A good comarison of the early methods can be found in []. More recently, a multivariate version of the changeoint model has been roosed in [64, 65]. Besides these, there are methods using regression [3, 5, 4, 66] and LASSO [68]. A recent method [6] using generalised likelihood ratio statistics assumes the streams are indeendent and normally distributed. Self-starting methods have been recently roosed in [47, 65, 0]. However, all of these methods, while sequential, are only designed to detect a single changeoint. While there are methods exlicitly designed for detecting multile changeoints in multivariate data [9, 0], these methods are usually offline (non-sequential). Again, as in the discussion in Section 5.., there are no multivariate methods that satisfy all the requirements of continuous monitoring. In Section 6.4, the SSMEWMA method described in [65] will be used as a basis of comarison for our multivariate forgetting factor methods. Although it was only considered in [65] in the context of detecting a single change, it is the method that can be most easily adated to the continuous monitoring context. However, its methodology relies on sequential regression on the comonents of the stream after each new data oint has been observed. Consequently, it is comutationally exensive when the number of comonents d is large. The multivariate CUSUM method referred to as MC in [] will also used as a benchmark for comarison in Section 6.4.

89 Chater 6. Multivariate adative filtering and change detection Multivariate adative forgetting factor mean Suose that the rocess being monitored is now multivariate, and that each observation in the stream x, x,... is d-dimensional, i.e. 0 x i x i, x i,. x i,d, i,,... C A This formulation also allows us to consider the sequences x,j,x,j,... for j,,...,d to be a collection of d streams that are being observed simultaneously. For an AFF! as defined in Section 3.3, the multivariate adative forgetting factor (MVAFF) mean x! N, is naturally defined as 0 x! N, x! N,, x! N,,. x N,!,d, i,,... C A where each x N,!,j is the AFF mean as defined in Section 3.3 of the jth comonent of the stream x, x,..., for j,,...,d. To be clear, in this case! is a single scalar forgetting factor for all the comonent streams. Another formulation is described in Section 6... The MVAFF mean can be equivalently defined for N x N,! by the vector equations h I d w N,! i mn,! (6.) m N,! N m N,! + x N, m 0,! 0 d w N,! N w N,! + d, w 0,! 0 d where I d is the d-dimensional identity matrix, d is a vector of length d with all entries equal to, and 0 d is a vector of length d with all entries equal to 0. In terms of udating!, as in Equation (3.8), one ossible cost function would be the multivariate analogue of

90 6. Multivariate adative forgetting factor mean 90 L N+,! (defined in Equation 3.6), defined by L N+,! h x N,! x N+ i T h x N,! x N+ i. Then the AFF! is udated the according to L N+,!, (6.) which is the natural multivariate analogue of Equation (3.8). As before, is the ste size Equation (6.) becomes, N+ T i h N, x N+i h x! N, x N+ ale i N h x N,!! N, x N+. The FFF scheme is defined as in Equation (6.), but by setting N, for all N,,..., for some fixed. 6.. Adative forgetting factors for each stream In the formulation above, the same AFF! is used for each comonent of the stream x, x,..., even though the d comonent streams x,j,x,j,... may be in different states of control. While the above formulation may have been the most straightforward, uon reflection it may seem desirable for each comonent stream to have its own forgetting factor. The d-dimensional MVAFF! is defined as! (,,...), where i is the diagonal matrix 0 i, i, 0 i, i,,... (6.3) C A i,d

91 Chater 6. Multivariate adative filtering and change detection 9 and the sequential udate equations in Equation (6.) become x N,! h I w N,! i mn,! (6.4) m N,! N m N,! + x N, m 0,! 0 d w N,! N w N,! + d, w 0,! 0 d. The jth comonent of i is udated by i N+,j N,j x N,! h x!,j N,,j x N+, (6.5) where the derivative of x N,!,j is given by Equation (3.5), and j is the ste-size for the jth comonent. It is ossible that this formulation may aear overly comlicated due to the number of subscrits involved. However, this formulation can be viewed as giving each comonent stream x,j,x,j,... its own AFF (,j,,j,...), for j,,...,d. Furthermore, the searate forgetting factors allows the ste-size to be scaled by h L N+,! as described in Section 3.3.6, which is a decided advantage, since it removed some of the deendence on the value of. Suose that the variance of d comonents are estimated to be b, b,...,b d, then the multivariate analogue for Equation (3.3) is N+,j N,j j b j x N,! h x!,j N,,j x N+, (6.6) As can be seen in Table 6., this scaling gives similar change detection erformance for a range of values of. For this reason, the AFF schemes considered in the simulation study in Section 6.4 utilise a searate forgetting factor for each stream. Note that for the multivariate FFF scheme, there is no difference between having a single or searate forgetting factors. In the next section, we discuss different decision rules for detecting a change in a multivariate stream.

92 6.3 Decision rules for multivariate change detection Decision rules for multivariate change detection Suose the d-dimensional stream x, x,... is distributed according to the multivariate normal distribution N(µ d, ). There are two cases: the comonent streams can be assumed to be indeendent, or the covariance can be taken into account Assuming the streams are indeendent If the comonent streams are considered to be indeendent, each stream can be considered to be distributed as x,j,x,j, N(µ j, j ), j,,...,d. (6.7) Estimates bµ j and b j can be obtained during a burn-in eriod. Then, at time N, a value j F N(bµj,b j ) x N,!,j (6.8) can be comuted, where F N(µ, ) is the cdf of N(µ, ). As in Section 4.., j can be turned into a -value by 0 j j and we could say that a change has been detected in the jth comonent stream if 0 j < for some [0, ]. So far, this is the same rocedure as for the univariate case considered in Chaters 4 and 5. However, we do not want to only detect a change in a single comonent, but rather a change in the stream x, x,... Therefore, we comute the - values 0, 0,..., 0 d, and combine these into an overall -value. There are two rominent methods for combining several -values: Fisher s method [49] and Stouffer s method [44] (also known as the Z-method). These two methods are briefly described in Aendix A.. There has been research [95, 96, 57, 35] comaring the two methods, but they are broadly

93 Chater 6. Multivariate adative filtering and change detection 93 similar. For Stouffer s method, the -values 0, 0,..., 0 d are combined to give 0 F N(0,) d dx i F N(0,) ( 0 i) where F N(0,) is the cdf and F N(0,) is the inverse of the cdf of the N(0, ) distribution. A change is then signalled when Fisher s method combines the -values via 0 F d 0 <.! dx log( 0 j), j where F d is the cdf of the chi-squared distribution with d degrees of freedom. Again, 0 < signals a change. Both these schemes assume that the -values (and hence the streams) are indeendent. Next we will consider the case when the streams are not assumed to be indeendent.!, 6.3. Estimating the covariance The covariance matrix of a multivariate stream x, x,... can be estimated from the first N observations by b N, which can be comuted sequentially [4] by x N b N b N b N x T Nx N. x N + N N x N, x N 0, b N + N N xt Nx N, N b 0, It is ossible to comute a forgetting factor version of b N, as in [4], but that raises the question of how to set the value of the forgetting factor. Certainly, the AFF methodology considered in Chater 3 could be emloyed with a suitable choice of cost function but, as Chater 8 will show, imlementing AFF estimation for the univariate variance is not

94 6.3 Decision rules for multivariate change detection 94 straightforward. There are now at least three ossibilities:. do not estimate the covariance matrix (assume streams are indeendent). estimate the covariance matrix during the burn-in, and assume the covariance remains the same after the burn-in 3. continue to estimate the covariance matrix continuously We either assume the streams are indeendent, or estimate the covariance during a burn-in eriod. The case where the covariance matrix is continuously estimated is not considered here Taking the covariance into account: Brown s method In many cases it is not reasonable to assume the comonent streams are indeendent, and so the covariance between the streams needs to be taken into account. Suose the covariance matrix is estimated during a burn-in eriod, using the equations given in Section First, the method of Section 6.3. is followed until the one-sided -values 0, 0,..., 0 have been comuted. Next, an extension of Fisher s method then rovides a method for combining the -values while taking the covariance between the streams into account. This method, originally ublished by M. B. Brown in [0] and slightly imroved in [88], is now briefly described. Start by defining X to be dx X ( 0 j). j This is simly Fisher s method described in Aendix A... If the -values are indeendent, then X follows a chi-squared distribution with d degrees of freedom. If the -values are not indeendent, then E X d, (6.9) Var X dx X 4d + Cov( log 0 i, log 0 j). (6.0) j i<j

95 Chater 6. Multivariate adative filtering and change detection 95 The covariance terms can then be aroximated using Gaussian quadrature [0] in terms of the correlation values ij. Recall that if (i, j)th entry of the covariance matrix is c ij, then ij c ij cii c jj. A third-order aroximation, given in [88], is then Cov( log 0 i, log 0 j)3.63 ij ij ij. This aroximation is said to work well [88] as long as 0.98 ale ij ale 0.98, which is a broad range of values since ij [, ]. Finally, the first two central moments of X given in Equations (6.9) and (6.0) are matched to the first two moments of a ( b k, ) b distribution (see Section 7.3., the Satterthwaite-Welch aroximation) by b k (E X ) /Var X, b Var X /E X. Then the combined -value is 0 ( b k, b ) (X ), where ( b k, b ) is the cdf of the (b k, b ) distribution and, again, a change is signalled if 0 <. 6.4 A simulation study We follow the method described in Section 5. for generating a single univariate normallydistributed data stream with multile changeoints, and again use the values 30,D 30,G30,M 0000, (6.) where is the Poisson arameter, D is the eriod allowed for the algorithm to detect a change, G is the grace eriod before a change can ossibly occur (to give the algorithm time to estimate the stream arameters) and M is the number of changeoints in the stream.

96 6.4 A simulation study 96 Label Descrition MVFFF-S Fixed forgetting factor, using Stouffer s method, assuming indeendent streams MVFFF-F Fixed forgetting factor, with Fisher s method, assuming indeendent streams MVFFF-Bcov Fixed forgetting factor, with Brown s method, taking covariance into account MVAFF-S Adative forgetting factor, using Stouffer s method, assuming indeendent streams MVAFF-F Adative forgetting factor, with Fisher s method, assuming indeendent streams MVAFF-Bcov Adative forgetting factor, with Brown s method, taking covariance into account MVCUSUM Multivariate version of CUSUM, described in [] as MC SSMEWMA Self-starting multivariate EWMA, described in [65] Table 6.: Exlanation of labels used for different change detection schemes Again, the size of the change in the mean for each regime is uniformly samled from {±0.5, ±0.5, ±, ±3}. Then, we combine this stream with three stationary N(0, )-distributed streams (no changeoints) and monitor these four streams for changes using multivariate adative estimation as described above. Of course, other formulations are ossible, but this formulation is simle and easy to analyse. If there are changes occurring in different streams at different times, difficulties could arise in the analysis; for examle, if two changes in different comonent streams occur close together (in time), and a change is detected soon after the later change, which changeoint is being detected? Therefore, we use the formulation where only a single comonent stream is changing, as in Chater 5. Then the changeoint in the multivariate stream is the location of the changeoint in the non-stationary univariate stream. The multivariate AFF and FFF estimation schemes are then used with either. Stouffer s method,. Fisher s method, 3. Brown s method, i.e. Fisher s method taking covariance into account, to create multivariate change detection schemes. Consequently, these six schemes are labelled as in Table 6.. As benchmarks for these forgetting factor methods, multivariate versions of CUSUM and EWMA are imlemented. The multivariate CUSUM rocedure used is MC from [] and is labelled MVCUSUM. The multivariate EWMA is a recent and sohisticated self-starting multivariate EWMA [65] that uses regression between the

97 Chater 6. Multivariate adative filtering and change detection 97 Algo Params Values CCD DNF ARL SDRL ARL0 SDRL0 MVAFF-S (, ) (0.00, 0.0) (.0) (6.4) MVAFF-S (, ) (0.00, 0.0) (4.5) (43.6) MVAFF-S (, ) (0.00, 0.0) (3.3) (459.08) MVAFF-S (, ) (0.00, 0.05) (9.7) (76.5) MVAFF-S (, ) (0.00, 0.05) (.6) (7.53) MVAFF-S (, ) (0.00, 0.05) (0.89) 8.77 (65.49) MVAFF-F (, ) (0.00, 0.0) (.6) (76.53) MVAFF-F (, ) (0.00, 0.0) (.64) (73.95) MVAFF-F (, ) (0.00, 0.0) (.34) 38.4 (303.7) MVAFF-F (, ) (0.00, 0.05) (8.8) (6.78) MVAFF-F (, ) (0.00, 0.05) (0.06) 5.37 (0.9) MVAFF-F (, ) (0.00, 0.05) (9.64) (4.53) MVAFF-Bcov (, ) (0.00, 0.0) (.60) (73.79) MVAFF-Bcov (, ) (0.00, 0.0) (.63) (7.8) MVAFF-Bcov (, ) (0.00, 0.0) (.8) (308.05) MVAFF-Bcov (, ) (0.00, 0.05) (8.89) (65.8) MVAFF-Bcov (, ) (0.00, 0.05) (0.) 5.03 (00.5) MVAFF-Bcov (, ) (0.00, 0.05) (9.83) 6.85 (45.93) Table 6.: Summary of algorithm erformance, listed, over 0000 changeoints, with {±0.5, ±0.5, ±, ±3} with burn-in B30. This table shows that the different MVAFF schemes have similar erformance for 0.0 and CCD is the roortion of changeoints correctly detected, and DNF is the roortion of detections that are not false detection. These erformance metrics are introduced in Section 5... Highlighted entries are discussed in the text. comonents of the observations, and is labelled SSMEWMA. These algorithms are all comared in Sections 6.4. and 6.4., which consider the two cases where the streams are (a) indeendent or (b) deendent Exeriments and results: indeendent streams In this section the streams are normally-distributed with covariance matrix inde I 4, the 4 4 identity matrix. Therefore, each stream is indeendent of the other streams, and each stream has variance. Recall that one of the streams is non-stationary, while the other three streams are stationary. First it is useful to comare the AFF schemes. Table 6. shows that in this scenario, when the streams are indeendent, for 0.00 and 0.00 the change detection er-

98 6.4 A simulation study 98 Algo Params Values CCD DNF ARL SDRL ARL0 SDRL0 MVAFF-S (, ) (0.00, 0.0) (.0) (6.4) MVAFF-F (, ) (0.00, 0.0) (.64) (73.95) MVAFF-Bcov (, ) (0.00, 0.0) (.63) (7.8) MVAFF-S (, ) (0.00, 0.05) (.6) (7.53) MVAFF-F (, ) (0.00, 0.05) (9.64) (4.53) MVAFF-Bcov (, ) (0.00, 0.05) (9.83) 6.85 (45.93) MVFFF-S (, ) (0.99, 0.05) (9.5) 5.3 (96.07) MVFFF-F (, ) (0.99, 0.05) (9.49) (9.43) MVFFF-Bcov (, ) (0.99, 0.05) (9.3).3 (93.47) MVAFF-F (, ) (0.0, 0.05) (0.06) 5.37 (0.9) MVAFF-Bcov (, ) (0.0, 0.05) (0.) 5.03 (00.5) Table 6.3: Summary of algorithm erformance, listed, over 0000 changeoints, with {±0.5, ±0.5, ±, ±3} with burn-in B30. This table shows that, when the streams are indeendent, Fisher s method and Brown s method yield very similar results for the MVAFF and MVFFF schemes. Highlighted entries are discussed in the text. formance of the MVAFF schemes is very similar. The CCD, DNF and ARL are extremely close, and ARL0 is very similar, although slightly larger for For examle, looking at MVAFF-S with 0.0 and 0.00, 0.00 (indicated in bold on Table 6.), the CCD, DNF and ARL values are almost exactly the same, while the ARL0 values are very similar. For the MVAFF-F and MVAFF-Bcov schemes with 0.0 there is similar agreement for 0.00 and However, for all the MVAFF schemes, using 0.00 results in different behaviour to using 0.00 and Therefore, although the value of may be unimortant as long as it is small enough, larger values of will roduce different erformance. This is not quite as strong as the univariate case in Chater 5, where Table 5. shows that 0., 0.0, 0.00 all roduce very similar change detection erformance. However, the range [0.00, 0.0] still rovides some freedom with which to choose and still obtain very similar results. Since 0.00 aears to roduce slightly better ARL0, with all other metrics being equal, this is the value used in Table 6.4 when MVAFF and MVFFF are comared to MVCUSUM and SSMEWMA. Table 6.3 shows that when the streams are indeendent, using Fisher s method and Brown s method yields almost identical results. For examle, for MVAFF-F and MVAFF- Bcov with (, ) (0.00, 0.05) (indicated in bold), the CCD, DNF, ARL and ARL0 values are virtually identical. Interestingly, Stouffer s method also erforms similarly, but

99 Chater 6. Multivariate adative filtering and change detection 99 Algo Params Values CCD DNF ARL SDRL ARL0 SDRL0 MVCUSUM (k, h) (.00,.490) (.35) 63.7 (65.0) MVCUSUM (k, h) (0.50, 4.770) (9.) 65.3 (63.0) MVCUSUM (k, h) (0.5, 8.000) (7.88) 6.47 (54.6) SSMEWMA (, h) (0.0,.907) (0.6) (84.09) SSMEWMA (, h) (0.0,.9) (0.80) (48.33) SSMEWMA (, h) (0.0,.94) (.84) 50.3 (49.64) MVFFF-Bcov (, ) (0.99, 0.050) (9.3).3 (93.47) MVFFF-Bcov (, ) (0.95, 0.005) (9.78) (87.09) MVFFF-Bcov (, ) (0.95, 0.00) (8.8) (67.9) MVAFF-Bcov ( ) (0.0) (.8) (308.05) MVAFF-Bcov ( ) (0.05) (9.83) 6.85 (45.93) MVAFF-Bcov ( ) (0.0) (8.9).9 (9.48) MVAFF-S ( ) (0.0) (3.3) (459.08) MVAFF-S ( ) (0.05) (0.89) 8.77 (65.49) MVAFF-S ( ) (0.0) (9.59) 7.3 (00.6) Table 6.4: Summary of algorithm erformance, listed, over 0000 changeoints, with {±0.5, ±0.5, ±, ±3} with burn-in B30. This table comares the forgetting factor methods to MVCUSUM and SSMEWMA when the streams are indeendent. The MVAFF methods use Highlighted entries are discussed in the text. with 0.0 rather than Table 6.3 also shows that the MVFFF (fixed forgetting) schemes erform very similarly, for Stouffer s, Fisher s and Brown s method. Table 6.4 comares the forgetting factor schemes to MVCUSUM and SSMEWMA. First, one notices that MVCUSUM has much lower ARL0 than the other methods, without a great imrovement in the other metrics. It is more interesting to comare the MVAFF schemes to SSMEWMA. Comaring SSMEWMA with (, h) (0.0,.907) to MVAFF- Bcov with 0.0 (both indicated in bold), we see that these two schemes both have similar DNF and ARL0, but the MVAFF-Bcov scheme has higher CCD and ARL. In fact, this MVAFF-Bcov (in bold) can be comared in the same way with all three SSMEWMA schemes (MVAFF-Bcov has the same or higher CCD, DNF and ARL0, but also has higher ARL). Comaring the same (bold) SSMEWMA scheme to MVAFF-S with 0. (also bold), we see that the MVAFF-S has the same CCD, and higher DNF, ARL and ARL0. Recall that lower ARL indicates better erformance, while higher values for all the other metrics indicate better erformance. Therefore, while the MVAFF schemes may have the

100 6.4 A simulation study 00 same or better values for CCD, DNF and ARL0, they also have higher ARL values, which is not an imrovement. Similar comarisons can be made for the other SSMEWMA schemes (different arameter choices). Therefore, it is not clear whether MVAFF or SS- MEWMA has better erformance, but they can at least be said to have comarable erformance. Table 6.4 also contains values for the fixed forgetting scheme MVFFF-Bcov. While the MVFFF-Bcov method has lower ARL0 than SSMEWMA, and so may not be directly comarable to it, we can comare MVFFF-Bcov to MVCUSUM. Comaring MVFFF- Bcov with (, ) (0.95, 0.005) (in blue) to MVCUSUM with (k, h) (0.50, 4.770) (also in blue), we see that they have the same CCD, but MVFFF-BCov has much higher DNF and ARL0, at the exense of a slightly higher ARL. Again, there is no clear winner here, but MVFFF-Bcov is erforming well in comarison to MVCUSUM. To summarise our results when the streams are indeendent, while the forgetting factor methods may not clearly outerform SSMEWMA and MVCUSUM, they do erform well in comarison. Also, it is relatively easy to set meaningful values for their control arameters, and the MVAFF schemes do not aear to deend on as long as it is small enough. The next section will consider the case when the comonent streams are deendent Exeriments and results: deendent streams Suose that a single univariate stream is generated with changes as described above, but now the other streams are generated so that each 4-dimensional observation is generated as before, but now the normally-distributed observations are generated with mean vector µ de and covariance matrix de : 0 0 µ µ de, de. (6.) 0 C B C A This matrix was obtained by randomly generating N(0.5, 0. )-distributed values for the uer-triangular entries of a 4 4 matrix, adding s to the diagonal, symmetrising, and then

101 Chater 6. Multivariate adative filtering and change detection 0 Algo Params Values CCD DNF ARL SDRL ARL0 SDRL0 MVAFF-F (, ) (0.00, 0.0) (.0) (93.) MVAFF-F (, ) (0.00, 0.0) (.4) (59.64) MVAFF-F (, ) (0.00, 0.0) (.36) (8.7) MVAFF-Bcov (, ) (0.00, 0.05) (.00) (9.54) MVAFF-Bcov (, ) (0.00, 0.05) (.45) (5.36) MVAFF-Bcov (, ) (0.00, 0.05) (.04) 4.07 (30.05) MVAFF-Bcov (, ) (0.00, 0.0) (3.3) (69.83) MVAFF-Bcov (, ) (0.00, 0.0) (5.3) (4.47) MVAFF-Bcov (, ) (0.00, 0.0) (3.6) (585.96) MVAFF-S (, ) (0.00, 0.0) (.3) (90.7) MVAFF-S (, ) (0.00, 0.0) (3.34) 83.0 (6.90) MVAFF-S (, ) (0.00, 0.0) (3.) (36.43) Table 6.5: Summary of algorithm erformance, listed, over 0000 changeoints, with {±0.5, ±0.5, ±, ±3} with burn-in B30. This table shows how, for deendent streams, the value of affects the erformance of the AFF schemes. Highlighted entries are discussed in the text. checking that it is ositive-definite. The value of µ, the first comonent in the mean vector µ de changes to µ+ i at the ith changeoint i. As in Section 6.4., i {±0.5, ±0.5, ±, ±3}. Table 6.5 shows that when the streams are deendent, as above, then the value of does affect how the MVAFF schemes erform. For examle, consider AFF-Bcov with 0.05 (in bold), as decreases from 0.00 to 0.00 to 0.00, CCD decreases while DNF, ARL and ARL0 all increase, to varying degrees. The differences may not be great in some cases, such as for AFF-Bcov with 0.0, which has fairly similar CCD, DNF and ARL values for 0.00, 0.00, 0.00, but there is a significant increase in ARL0. This table shows that we need to take some care in setting, and again it seems as if smaller values are better, at least in terms of ARL0. For this reason, it is again recommended to set 0.00 for the MVAFF schemes. Table 6.6 comares the forgetting factors schemes with MVCUSUM and SSMEWMA when the streams are deendent. MVCUSUM has imroved CCD, but even worse DNF and ARL0, and so would not be recommended for use in this setting. Again, we focus on comaring the MVAFF schemes to SSMEWMA. The most favourable match is erhas SSMEWMA with (, h) (0.0,.94) (in bold), comared to MVAFF-Bcov with (, ) (0.00, 0.05) (in bold). These two

102 6.4 A simulation study 0 Algo Params Values CCD DNF ARL SDRL ARL0 SDRL0 MVCUSUM (k, h) (.00,.490) (7.74) 5.35 (5.84) MVCUSUM (k, h) (0.50, 4.770) (8.05) (37.95) MVCUSUM (k, h) (0.5, 8.000) (7.56) 54.9 (50.03) SSMEWMA (, h) (0.0,.907) (0.3) (95.3) SSMEWMA (, h) (0.0,.9) (9.60) 47.0 (48.5) SSMEWMA (, h) (0.0,.94) (.33) (45.40) MVFFF-Bcov (, ) (0.99, 0.050) (0.80) (59.85) MVFFF-Bcov (, ) (0.95, 0.005) (.8) 0.5 (94.5) MVFFF-Bcov (, ) (0.95, 0.00) (.09) 5.96 (36.) MVAFF-Bcov (, ) (0.00, 0.0) (3.6) (585.96) MVAFF-Bcov (, ) (0.00, 0.05) (.04) 4.07 (30.05) MVAFF-Bcov (, ) (0.00, 0.0) (0.4) (40.36) MVAFF-Bcov (, ) (0.00, 0.05) (.45) (5.36) Table 6.6: Summary of algorithm erformance, listed, over 0000 changeoints, with {±0.5, ±0.5, ±, ±3} with burn-in B30. This table comares the erformance of the forgetting factor schemes with MVCUSUM and SSMEWMA, when the streams are deendent. Highlighted entries are discussed in the text. schemes have similar CCD, but MVAFF-BCov has significantly higher DNF and ARL0, but also has higher ARL. A similar comarison can be made with MVAFF-Bcov with (, ) (0.0, 0.05) (in green). Again, as before, there is no clear winner, but erformance is at least comarable. SSMEWMA with (, h) (0.0,.907) (in red)can be comared to MVAFF-Bcov with (, ) (0.00, 0.05) (in bold), and is found to have similar CCD, DNF, but higher ARL0 and lower ARL, suggesting this SSMEWMA scheme has better erformance. However, it would be difficult to know a riori that this choice of arameters may yield better erformance. The (fixed forgetting) MVFFF-BCov scheme erforms surrisingly well in comarison to the SSMEWMA schemes for deendent streams, when one considers that its erformance for indeendent streams was not esecially good. For examle, MVFFF-Bcov with (, ) (0.99, 0.050) (in blue) comared to SSMEWMA with (, h) (0.0,.9) (also in blue), shows similar CCD, better DNF and ARL0, but significantly higher ARL. In summary, for deendent streams it aears that SSMEWMA in most cases has slightly better erformance than the forgetting factor schemes, but there are cases where erformance is at least comarable, if not better, for MVAFF. Overall, it aears that AFF-

103 Chater 6. Multivariate adative filtering and change detection 03 Bcov erforms well, whether or not the streams are deendent. In addition, MVAFF-BCov is more efficient than SSMEWMA, because, as mentioned in Section 6., SSMEWMA erforms sequential linear regression for each new observation, which is at least of order O(d ), where d is the number of comonents in the stream. On the other hand, once the covariance matrix has been estimated, MVAFF-Bcov is of order O(d). This will make a difference when d is large. In the next section the multivariate MVAFF methodology is alied to detecting changes in comuter network traffic. This work originally aeared in [3]. 6.5 Monitoring a comuter network Attacks on comuter networks usually cause changes in network traffic that are often only observed in the final stages of the attack. Examles include worm-based attacks [55], distributed denial-of-service (DDoS) attacks [08], and ort-scanning [53], [8]. Over the last two decades there has been much research into methods that attemt to detect these attacks in their early stages. Intrusion detection systems (IDS) have historically been characterised as either signaturedetection systems or anomaly-detection systems [84]. Signature-based methods which usually oerate at network acket level detect attacks by comaring network behaviour against a database of known attack behaviours, called signatures. Examles of such methods are Bro [9] and Snort [30]. The strength of IDS is that they often oerate at a host-level, which distributes the comutational burden. Anomaly-detection methods attemt to detect any unusual activity in the network by monitoring for deviations from the network s standard behaviour [45]. Examles include D-WARD [09] and MULTOPS [56]. The advantage of these methods over signaturebased methods is that anomaly-detectors have the otential to detect a wider variety of attacks, and do not require the comilation (and regular udating) of a signature database. There are anomaly-detection methods in the networks literature, but many require offline rocessing (e.g. [9]). The analysis in this section is erformed using NetFlow data []. NetFlow is a rotocol for collecting and storing statistics on the acket volumes of IP flows through a router. It is

104 6.5 Monitoring a comuter network 04 Date flow start DurationProto Src IP Addr Dst IP Addr Src Pt Dst Pt Packets Bytes :04: TCP M :04: TCP M :04: TCP :04: TCP M Figure 6.: An examle of NetFlow data. a much coarser grained reresentation than acket cature. An (anonymised) examle of NetFlow data is given in Figure 6.. For a secific flow (a collection of ackets) between two IP addresses, NetFlow data embodies information about rotocols, acket numbers and volumes. NetFlow data allows an organisation-wide view of network traffic, but can be large and unwieldy to handle. Since flows may be characterised by their (source or destination) orts, we aly our AFF multivariate change detector to monitor the volume of network traffic flowing through selected TCP orts. In this way, our anomaly detector will oerate at the level of a router, or a comuter monitoring a router Change detection on NetFlow Data The multivariate change detection methodology develoed and tested above is now deloyed on real data, secifically, NetFlow data collected on a single router at Imerial College over a 4-day eriod in 009. There are numerous otions for selecting or designing features for an anomaly detector to run across. In this case, for simlicity, we consider two variables, the volume of traffic on destination Port 80 (htt), and the volume of traffic on all other orts. This choice was made artially based on knowledge of the router s role. Since NetFlow data is essentially continuous time, some binning is required for our methodology. In the examle we rovide, a binning of 00 minutes is used. Thus, the stream consists of sequential -vectors reorting the volume of traffic on Port 80 and all other orts. We use a log transformation of both variables since they have only ositive suort. We do not commit to these being the best or right choices, but rather intend to rovide an illustration of the methodology. Figure 6. rovides a reresentation of the raw data, with markers to demonstrate

105 Chater 6. Multivariate adative filtering and change detection 05 changeoints identified by our methodology (MVAFF-S). For this illustration, no attemt is made to handle the obvious seasonality resent in the data. Indeed, while it is ossible to attemt to track any data seasonalities, the benefit of doing so is not clear. A deviation from a secific seasonality may or may not be an indication of a changeoint. We emhasise that this methodology can be alied to detect changes on d-dimensional streams, where d >, and that we have just used d here merely for illustrative uroses. Furthermore, we reiterate that since this is real data, we have no way of knowing the location of the true changeoints (if indeed any occur), and that this section is simly to demonstrate the methodology in action. 3e+09 Total bytes er bin e+09 Streams ort 80 other orts e+09 0e Bin number Figure 6.: Detecting changes in NetFlow data traffic across two orts. Changeoints detected by MVAFF-S are indicated by vertical lines. In this context of broader alication, this analytic is intended to filter NetFlow data in an attemt to reduce information overload on the network analyst. Thus, the analyst would not be routinely concerned with the raw NetFlow data, but would be resented with summaries in relation to detected anomalies. It is worth noting, in the context of organisation-wide network traffic analysis, that triage of detected anomalies is always required. For examle, an organisation-wide software udate would very likely result in a detected anomaly, which is naturally exlained by the analyst. Consider Figure 6.3 as a simle examle of what could be rovided to an analyst. Each

6.6 Discussion 06 Figure 6.3: Activity diagram of nodes in network when changes are detected. Note that the thicker the arrow, the greater the volume of traffic across that connection.

106 6.6 Discussion 06 Figure 6.3: Activity diagram of nodes in network when changes are detected. Note that the thicker the arrow, the greater the volume of traffic across that connection. of the four grahs refers to the flows (source and destination IPs) observed in the four bins which are flagged in Figure 6.. Additionally, the widths of the edges reresent the volume of traffic for those edges. A notable feature is that there are distinct tyes of anomaly, some involving few nodes, with others on the order of hundreds. It is worth noting that this router handles around 5000 nodes, so this is a significant reduction. Of course, much more refined analytics can be roosed, but such roosals must account for comutational asects, articularly data storage. 6.6 Discussion In this chater the AFF scheme has been extended from the univariate to the multivariate setting. Two formulations are roosed for an MVAFF framework, one using a single scalar! and the other using an AFF for each comonent of the multivariate stream. The latter

CHAPTER-II Control Charts for Fraction Nonconforming using m-of-m Runs Rules

CHAPTER-II Control Charts for Fraction Nonconforming using m-of-m Runs Rules. Introduction: The is widely used in industry to monitor the number of fraction nonconforming units. A nonconforming unit is