Wild Binary Segmentation for multiple change-point detection

Size: px

Start display at page:

Download "Wild Binary Segmentation for multiple change-point detection"

Marilyn Hicks
5 years ago
Views:

for multiple change-point detection Piotr Fryzlewicz p.fryzlewicz@lse.ac.

1 for multiple change-point detection Piotr Fryzlewicz Department of Statistics, London School of Economics, UK Isaac Newton Institute, 14 January 2014

2 Segmentation in a simple function + noise model We consider the canonical function + noise model X t = f t + ε t, t = 1,..., T where f t is piecewise-constant with an unknown number N of change-points, possibly increasing with T, and ε t s are iid Gaussian (for simplicity; can be extended to various more complex settings). Objective: estimating the number and the locations of (any) change-points in f t.

3 Segmentation in a simple function + noise model We consider the canonical function + noise model X t = f t + ε t, t = 1,..., T where f t is piecewise-constant with an unknown number N of change-points, possibly increasing with T, and ε t s are iid Gaussian (for simplicity; can be extended to various more complex settings). Objective: estimating the number and the locations of (any) change-points in f t Days Log-returns on daily closing values of S&P 500 over approximately 8 trading years ending 26 October Volatility removed via a GARCH(1,1) fit. Any change-points here?

4 Existing approaches A substantial number of techniques. A brief literature review: Least-squares (or generally likelihood-type fit) + AIC or BIC-type penalty: Yao (1988), Yao and Au (1989), Lee (1995), Lavielle (1999, 2005), Lavielle & Moulines (2000), Lebarbier (2005) Pan & Chen (2006), Boysen et al. (2009). Minimum Description Length: Davis et al. (2006). L 1 -type penalties: Davies & Kovac (2001), Rinaldo (2009), Harchaoui & Levy-Leduc (2010). Classical wavelet transform: Wang (1995). Binary Segmentation: Vostrikova (1981), Venkatraman (1992), Bai (1997), Chen et al. (2011), Fryzlewicz & Subba Rao (2012), Cho & Fryzlewicz (2012, 2013).

5 Existing approaches: criticisms No technique is perfect. Some comments / criticisms: Least-squares (or generally likelihood-type fit) + AIC or BIC-type penalty: slow computational speed, typically of order O(T 2 ). However some efforts to reduce this, e.g. Rigail (2010) (but still O(T 2 ) in the worst case), Killick et al. (2012) (PELT). Both will be revisited in the simulation study. MDL: minimisation not obvious, via a genetic algorithm in Davis et al. (2006), often (very) random output. L 1 -type penalties: not optimal for change-point detection, see Brodsky & Darkhovsky (1993). Often lead to spurious detections. Classical wavelets: hopeless in noisy settings. Binary Segmentation: more details soon.

6 Focus on Binary Segmentation Generic algorithm for Binary Segmentation (BS): 1 Find f i, a step function with one change-point, minimising T (X t f t ) 2. t=1 2 Denote the location of the change-point in f t by b. 3 Perform similar fitting on 1,..., b and b + 1,..., T. 4 Continue in the same manner until a certain criterion is satisfied. In principle, Binary Segmentation is fast (typically O(T log T )), conceptually simple, easy to code, tractable theoretically (with some effort), and easy to transfer to other more complex settings.

7 Binary Segmentation Haar wavelet interpretation Denote by f t s,b,e a step function (vector) starting at index s, with a change-point at b, ending at e. We have b 0 := arg min b e (X t t=s f s,b,e t ) 2 = arg max X, ψ s,b,e, b where ψ s,b,e is an Unbalanced Haar vector, i.e. a vector which is constant positive for i = s,..., b, is constant negative for i = b + 1,..., e, sums to zero and sums to one when squared. Thus, change-point candidates are located by inspecting the maxima of X, ψ s,b,e over b.

8 Binary Segmentation when can expect good performance? Since BS fits a one-step function to the current interval [s,e], we can expect the performance to be good if [s,e] contains no more than one change-point. However, things can go disastrously wrong if this is not the case. In the following example, we demonstrate how BS can (spectacularly) fail if the interval [s, e] contains more than one change-point.

9 Binary Segmentation good versus bad performance Example of global (blue) and local (red) CUSUM X, ψ s,b,e as a function of b, on data X in black. z Time

10 Main idea of Clearly, it would have been preferable to use the maximum of the red curve as a locator for a change-point candidate. However, it is obviously not clear a priori what starting point s and end-point e to choose. Motivated by this, we propose the following Wild Binary Segmentation (WBS) locator statistics WBS = arg max s,b,e X, ψs,b,e, where s, e are drawn uniformly over the current data segment [s, e] a suitable number of times. Checking all s, e would have resulted in cubic computational complexity, which would be prohibitive hence the random draws. The b that achieves the above maximum is taken as a change-point candidate.

11 Motivation for WBS If the number of draws is large enough, we will be able to guarantee, with high probability, particularly favourable draws for which e.g. [s, e ] contains only one change-point (or is sufficiently close to this situation, as in the example above). The number of draws guaranteed to achieve this is not large, as will be shown later.

12 Stopping criteria for BS and WBS Stopping criteria for BS and WBS: two different approaches. 1 Thresholding. In BS combined with the thresholding approach, we stop on the current interval [s, e] when max b X, ψ s,b,e < ζ T. In WBS, we stop when max s,b,e X, ψs,b,e < ζ T. The threshold ζ T will be different for both algorithms. 2 New information criterion for WBS. Alternatively, for WBS, we propose what we call the strengthened Schwarz Information Criterion (ssic). It works by performing WBS to the end, then pruning back to retain only those estimated change-points that correspond to the k 0 largest statistics max b X, ψ s,b,e, where k 0 = arg min k=0,...,k T 2 log ˆσ2 k + k log α T, with ˆσ 2 k being the MLE of the residual variance and α > 1.

13 Comparison of BS and WBS in theory Assumption 1. 1 The random sequence ε t is iid Gaussian with mean zero and variance 1. 2 The sequence f t is bounded, i.e. f t < f <. 3 The magnitudes of the change-points are bounded from below, i.e. min i=1,...,n f ηi f ηi 1 > f > 0. Assumption 2. (for BS) The minimum spacing between change-points satisfies min i=1,...,n+1 η i η i 1 > δ T, where δ T = O(T Θ ) with Θ (3/4, 1]. Assumption 3. (for WBS) The minimum spacing between change-points satisfies min i=1,...,n+1 η i η i 1 > δ T, where δ T C log T for a large enough C.

14 Consistency of the BS algorithm Theorem (BS). Suppose Assumptions 1 and 2 hold. Let N and η 1,..., η N denote, respectively, the number and locations of change-points. Let ˆN denote the number, and ˆη 1,..., ˆη N the locations, sorted in increasing order, of the change-point estimates obtained by the standard Binary Segmentation algorithm with the thresholding stopping criterion. Let the threshold parameter satisfy ζ T = c 1 T θ where θ (1 Θ, Θ 1/2) if Θ ( 3 4, 1), or ζ T c 2 log p T (p > 1/2) and ζ T c 3 T θ (θ < 1/2) if Θ = 1, for any positive constants c 1, c 2, c 3. Then there exists a positive constant C such that P(A T ) 1, where A T = { ˆN = N; max ˆη i η i Cɛ T } i=1,...,n with ɛ T = λ 2 2 T 2 δ 2 T, where λ 2 is such that P(A T ) 1, where { A T = (e b + 1) 1/2 e ε i < λ 2 i=b 1 b e T }. (1)

15 Consistency of the WBS algorithm Theorem (WBS). Suppose Assumptions 1 and 3 hold. Let N and η 1,..., η N denote, respectively, the number and locations of change-points. Let ˆN denote the number, and ˆη 1,..., ˆη N the locations, sorted in increasing order, of the change-point estimates obtained by the algorithm with the thresholding stopping criterion. There exist two constants C, C such that if C log 1/2 T ζ T Cδ 1/2 T, then P(A T ) 1, where A T = { ˆN = N; max ˆη i η i C log T } i=1,...,n for a certain positive C, where the guaranteed speed of convergence of P(A T ) to 1 is no faster than T δ 1 T (1 δ2 T T 2 /9) M, with M denoting the overall number of random draws. Note: similar results hold for ssic-bs and ssic-wbs.

16 Choice of the number of draws M Note that only one set of M intervals needs to be drawn, i.e. we do not need to draw new intervals at each binary stage as we can just as well reuse the previously drawn intervals that fall within each current interval [s, e]. Considering the bound from the WBS consistency theorem, suppose we wish to have T δ 1 T (1 δ2 T T 2 /9) M T α for a certain positive α. This is practically equivalent to M 9T 2 δ 2 T log(t 1+α δ 1 T ). In the easy case of δ T = O(T ), this results in a logarithmic number of draws. Naturally, M progressively increases as δ T decreases.

17 Parameter choice in practice Choice of M: We have tested, and recommend, M = 5000 or M = for datasets of length T not exceeding a few thousand. Part of the algorithm is coded in C so it takes a fraction of a second on a standard PC. Note that WBS can be fully parallelized e.g. on a GPU as each interval can be drawn and processed independently of others. In this sense, in a parallel computing environment, WBS is actually faster than BS! Choice of threshold ζ T : We use multiples of the universal threshold, i.e. ζ T = C ˆσ(2 log T ) 1/2, with C = 1.0 (which tends to perform well or slightly over-estimate N) or C = 1.3 (which tends to perform well or slightly under-estimate N). Choice of the α parameter in ssic-wbs: We use α = 1.01 in order to stay close to the standard SIC.

18 Simulation study (1) The blocks signal: Time Time

19 Simulation study (2) The fms signal: Time Time

20 Simulation study (3) The mix signal: Time Time

21 Simulation study (4) The teeth10 signal: Time Time

22 Simulation study (5) The stairs10 signal: Time Time

23 Simulation study Best available competitors from R packages publicly available on CRAN: PELT: method from the changepoint package, see Killick et al. (2012), B&P: method from the strucchange package, see Bai and Perron (2003), cumseg: method from the cumseg package, see Muggeo and Adelfio (2011), S3IB: method from the Segmentor3IsBack package, see Rigaill (2010).

24 Simulation study Results for the blocks signal. ˆN N Method Model MSE PELT B&P cumseg S3IB (1) WBS C = WBS C = WBS ssic BS C = BS C =

25 Simulation study Results for the fms signal. ˆN N Method Model MSE PELT B&P cumseg S3IB (2) WBS C = WBS C = WBS ssic BS C = BS C =

26 Simulation study Results for the mix signal. ˆN N Method Model MSE PELT B&P cumseg S3IB (3) WBS C = WBS C = WBS ssic BS C = BS C =

27 Simulation study Results for the teeth10 signal. ˆN N Method Model MSE PELT B&P cumseg S3IB (4) WBS C = WBS C = WBS ssic BS C = BS C =

28 Simulation study Results for the stairs10 signal. ˆN N Method Model MSE PELT B&P cumseg S3IB (5) WBS C = WBS C = WBS ssic BS C = BS C =

29 Real data example We now revisit the example from the start of the talk. The time-threshold map below shows the estimated change-points depending on the threshold chosen. Blue line: C =

30 Real data example contd Cumulative sum of X t, change-points corresponding to ssic (thick solid vertical lines), ζ T = 3.83 (thin and thick solid vertical lines), ζ T = 3.1 (all vertical lines) Time

31 Some final thoughts Some final thoughts: Change-point detection is neither an entirely global problem nor an entirely local one, so a multiscale approach, such as that offered by WBS (in that both short and long intervals are used), appears to be helpful. Can similar local-global randomised approaches be used in other nonparametric problems?

32 References for multiple change-point detection. P. Fryzlewicz (2013). Under revision. Available from Package wbs. R. Baranowski & P. Fryzlewicz (2014). Available from

arxiv: v1 [math.st] 4 Nov 2014

arxiv: v1 [math.st] 4 Nov 2014 The Annals of Statistics 2014, Vol. 42, No. 6, 2243 2281 DOI: 10.1214/14-AOS1245 c Institute of Mathematical Statistics, 2014 arxiv:1411.0858v1 [math.st] 4 Nov 2014 WILD BINARY SEGMENTATION FOR MULTIPLE