Kernels to detect abrupt changes in time series

Size: px

Start display at page:

Download "Kernels to detect abrupt changes in time series"

Willa Washington
5 years ago
Views:

1 1 UMR 8524 CNRS - Université Lille 1 2 Modal INRIA team-project 3 SSB group Paris joint work with S. Arlot, Z. Harchaoui, G. Rigaill, and G. Marot Computational and statistical trade-offs in learning IHES Paris, March 22nd, 216 1/47

2 Outline 1 Motivating examples and framework (kernels) 2 KCP Algorithm and computational complexity 3 Where are the change-points (D fixed)? 4 How many change-points? 2/47

3 Change-point detection: 1-D signal (example) Signal Reg. func. Signal.2.2.4?? Position t 3/47

4 Detect abrupt changes... General purposes: 1 Detect changes in (features of) the distribution (not only in the mean) 4/47

5 Abrupt changes in high-order moments Detecting changes in the mean is useless 5/47

6 Detect abrupt changes... General purposes: 1 Detect changes in (features of) the distribution (not only in the mean) 2 Complex data: High-dimension: measures in R d, curves,... Structured: audio/video streams, graphs, DNA sequence,... 6/47

7 Motivating example 1: Structured objects Description: Video sequences from Le grand échiquier, 7s-8s French talk show. At each time, one observes an image (high-dimensional). Each image is summarized by a histogram. 7/47

8 Motivating example 2: Structured objects Observe networks along the time Goal: Detect abrupt changes in some features of the network 8/47

9 Detect abrupt changes... General purposes: 1 Detect changes in (features of) the distribution (not only in the mean) 2 Complex data: High-dimension: measures in R d, curves,... Structured: audio/video streams, graphs, DNA sequence,... 3 Fusion of heterogeneous data Deal simultaneously with different types of complex data 4 Efficient algorithm allowing to deal with large data sets ( Big data challenge) 9/47

10 I Kernel framework 1/47

11 Kernel and Reproducing Kernel Hilbert Space (RKHS) X 1,..., X n X : initial observations. k(, ) : X X R: reproducing kernel (Aronszajn (195)) H: RKHS associated with k(, ) (φ : X H s.t. φ(x) = k(x, ): canonical feature map) Assets: Versatile tool to work with different types of data Complex data (high dimensional/structured) 11/47

12 Instances of kernels Gaussian kernel: (with R d -valued data) [ ] x y 2 k δ (x, y) = exp, δ >. δ χ 2 -kernel:... (with histogram-valued data) k I (p, q) = exp [ I i=1 (p i q i ) 2 p i + q i ] 12/47

13 Model where 1 i n, Y i = φ(x i ) = µ i + ε i H, µ i H: mean element of P Xi (distribution of X i ) i, ε i := Y i µ i, with Eε i =, v i := E [ ] ε i 2 H Mean element of P Xi The mean element of P Xi : (H separable and E [ k(x, X ) ] < + ) < µ i, f > H = E Xi [ < φ(x i ), f > H ], f H. With characteristic kernels, P Xi P Xj µ i µ j.. 13/47

14 Estimation rather than identification Assumption µ = (µ 1,..., µ n) H n : piecewise constant. 1 Signal: Y Reg. func. s Fact: With finite sample, it is impossible to recover change-point in noisy regions. Purpose: Estimate µ to recover change-points. Performance measure: µ µ 2 := n i=1 µ i µ i 2 H 14/47

15 II Algorithm 15/47

16 Notation Segmentation with D segments: τ = (τ,..., τ D ), with = τ < τ 1 < τ 2 < < τ D = n Quality of a segmentation τ: Following Hachaoui and Cappé (27), R n (τ) = 1 n k(x i, X i ) 1 D 1 n n τ l τ l 1 i=1 l=1 τ l τ l i=τ l 1 +1 j=τ l 1 +1 Rk: With the linear kernel k(x, x ) =< x, x > on X = R d, R n (τ) reduces to the usual least-squares empirical risk. k(x i, X j ). 16/47

17 KCP Algorithm Input: observations: X 1,..., X n X, kernel: k : X X R, 17/47

18 KCP Algorithm Input: observations: X 1,..., X n X, kernel: k : X X R, Step 1: 1 D D max, compute: τ(d) Argmin τ T D n dynamic programming { } Rn (τ) T D n = { (τ,..., τ D ) N D+1 / = τ < τ 1 < τ 2 < < τ D = n } 17/47

19 KCP Algorithm Input: observations: X 1,..., X n X, kernel: k : X X R, Step 1: 1 D D max, compute: τ(d) Argmin τ T D n dynamic programming { } Rn (τ) Step 2: Find: { } D Argmin 1 D Dmax Rn ( τ(d)) + pen ( τ(d)) model selection Output: sequence of change-points: τ = τ ( D T D n = { (τ,..., τ D ) N D+1 / = τ < τ 1 < τ 2 < < τ D = n } ). 17/47

20 Computational complexity (Naive approach) Dynamic programming (DP) update rule: 2 D D max, L D,n = min t n 1 {L D 1,t + C t,n }, where L D 1,t : cost of the best segmentation in D 1 segments up to time t, C t,n : cost of the segment t,n. C s,t = t i=s+1 k(x i, X i ) 1 t s t t i=s+1 j=s+1 k(x i, X j ) Complexity (Naive approach): time: O(D max n 4 ) (computation of {C s,t } 1 s,t n ) space: O(n 2 ) (storage of the cost matrix) 18/47

21 Computational complexity (Improvement) Ideas: (with G. Rigaill and G. Marot) Never store the cost matrix Update each column C,t+1 from C,t Pseudo-code: 1: for t = 1 to n 1 do 2: Compute the (t + 1)-th column C,t+1 from C,t 3: for D = 2 to min(t, D max ) do 4: L D,t+1 = min s t {L D 1,s + C s,t+1 } 5: end for 6: end for Computational complexity Space: O(D max n) (only store C,t R n ) Time: O(D max n 2 ) (update rule+dp complexity) 19/47

22 Runtime Open questions: Reduce computation time by low-rank matrix approx. Quantify what has been lost by the approx. 2/47

23 III Where are the change-points for a fixed D? 21/47

24 KCP Algorithm (reminder) Input: observations: X 1,..., X n X, kernel: k : X X R, Step 1: 1 D D max, compute: τ(d) Argmin τ T D n dynamic programming { } Rn (τ) Step 2: Find: { } D Argmin 1 D Dmax Rn ( τ(d)) + pen ( τ(d)) model selection Output: sequence of change-points: τ = τ ( D T D n = { (τ,..., τ D ) N D+1 / = τ < τ 1 < τ 2 < < τ D = n } ). 22/47

25 Distance between segmentations Hausdorff distance: { d H (τ, τ ) = max max min 1 i D τ 1 1 j D τ 1 τ i τ, j max min 1 j D τ 1 1 i D τ 1 } τ i τ j Frobenius distance: d F (τ, τ ) = M τ M τ F = 1 i,j n ( 2, Mi,j i,j) τ Mτ where M τ i,j = 1 {i and j belong to the same segment of τ} Card(segment of τ containing i and j). 23/47

26 Empirical assessment Scenario 1: Changes in (mean,variance) R-valued X 1,..., X n with n = 1 True partition of 1,n in D = 11 segments In each segment, randomly choose a distrib. among 7 of them /47

27 Scenario 1: Changes in (mean,variance) with D = 11 Hausdorff and Frobenius distances 1 Frobenius Hausdorff 5 1 Frobenius Hausdorff Frobenius dist Hausdorff dist. Frobenius dist Hausdorff dist Dimension Dimension (a) Gaussian (k G ) (b) Linear (k Lin ) K Lin puts changes in noise 25/47

28 Scenario 1: Changes in (mean,variance) Cont. Change-points frequencies for D = D (5 repetitions).6.6 Freq. of selected chgpts Freq. of selected chgpts Position Position (a) Gaussian (k G ) (b) Linear (k Lin ) K Lin puts changes in noise 26/47

29 Empirical assessment Scenario 2: No change in (mean,variance) R-valued X 1,..., X n with n = 1 True partition of 1,n in D = 11 segments In each segment, randomly choose a distrib. among 3 of them /47

30 Scenario 2: No change in (mean,variance) Hausdorff and Frobenius distances 1 Frobenius Hausdorff 5 1 Frobenius Hausdorff Frobenius dist Hausdorff dist. Frobenius dist Hausdorff dist Dimension Dimension (a) Gaussian (k G ) (b) Linear (k Lin ) K Lin puts changes in noise 28/47

31 Scenario 2: No change in (mean,variance) Cont. Change-points frequencies for D = D.6.6 Freq. of selected chgpts Freq. of selected chgpts Position Position (a) Gaussian (k G ) (b) Linear (k Lin ) K Lin puts changes in noise 29/47

32 Scenario 2: No change in (mean,variance) Cont. Change-points frequencies for D = D.6.6 Freq. of selected chgpts Freq. of selected chgpts Position Position (a) Gaussian (k G ) (b) Hermite (k H 5 ) K H 5 less sensitive to changes than K G (characteristic kernels) 3/47

33 Empirical assessment Scenario 3: Histogram-valued data Histogram-valued X 1,..., X n with 2 bins and n = 1 True partition of 1,n in D = 11 segments In each segment, randomly choose DP(p 1,..., p 2 ) (Dirichlet) /47

34 Scenario 3: Histogram-valued data Hausdorff and Frobenius distances 1 Frobenius Hausdorff 5 1 Frobenius Hausdorff Frobenius dist Hausdorff dist. Frobenius dist Hausdorff dist Dimension Dimension (a) χ 2 (k χ2 ) (b) Gaussian (k G ) K G misses change-points by ignoring the structure of the data 32/47

35 Scenario 3: Histogram-valued data Cont. Change-points frequencies for D = D.6.6 Freq. of selected chgpts Freq. of selected chgpts Position Position (a) χ 2 (k χ2 ) (b) Gaussian (k G ) potential gain in exploiting the structure of the data 33/47

36 IV How many change-points? 34/47

37 KCP Algorithm (reminder) Input: observations: X 1,..., X n X, kernel: k : X X R, Step 1: 1 D D max, compute: τ(d) Argmin τ T D n dynamic programming { } Rn (τ) Step 2: Find: { } D Argmin 1 D Dmax Rn ( τ(d)) + pen ( τ(d)) model selection Output: sequence of change-points: τ = τ ( D T D n = { (τ,..., τ D ) N D+1 / = τ < τ 1 < τ 2 < < τ D = n } ). 35/47

38 Empirical risk minimizer Assumption: 1 i n, Y i = µ i + ε i µ = (µ 1,..., µ n) : piecewise-constant Model τ = (τ, τ 1,..., τ D ), (with τ = and τ D = n) Vector space (model): F τ = { (f 1,..., f n ) H n f τl 1 +1 = = f τl, 1 l D τ } (D τ : number of segments of τ) Estimator of µ : µ τ = Argmin f Fτ { Y f 2}, with f 2 = n f i 2 H i=1 µ τ = Π Fτ Y : orthogonal projection onto F τ 36/47

39 Choose the number of change-points Ideal penalty: τ Argmin τ Tn µ µ τ 2 (oracle segmentation) } = Argmin τ Tn { Y µ τ 2 + pen id (τ), with pen id (τ) := 2 Π τ ε 2 2 < (I Π τ )µ, ε >. Strategy 1 Concentration inequalities for linear and quadratic terms. 2 Derive a tight upper bound pen pen id with high probability. 37/47

40 Concentration of the quadratic term Assumptions: max i Y i H M a.s. (Db). [ ] max i E ε i 2 H v max (Vmax). Theorem (Quadratic term) Assuming (Db)-(Vmax), then for every τ T n, x >, θ (, 1], [ ] [ Π τ ε 2 E Π τ ε 2 ] θe Π τ µ µ τ 2 + θ 1 Lv max x, with probability at least 1 2e x, where L is a constant. Rks: No Gaussian or constant-variance assumption Deals with Hilbert-valued vectors (not only in R d ) The x deviation term allows large collections 38/47

41 Oracle inequality Theorem Assume (Db)-(Vmax). For every x >, } τ Argmin τ { Y µ τ 2 + pen(τ) where pen(τ) = D τ [ C 1 ln ( n Then with prob. 1 2e x, µ µ τ 2 1 inf τ D τ ) ] + C 2 (C 1, C 2 > ). { } µ µ τ 2 + pen(τ) where 1 1 and 2 > is a remainder term., + 2, Rk: [ ( ) In Birgé, Massart (21), pen(τ) = D τ c 1 ln n D τ + c 2 ]. 39/47

42 Model selection procedure Algorithm 1 For every 1 D D max, τ(d) Argmin τ, Dτ =D { Y µ τ 2}, 2 Define { Y [ ( D = Argmin D µ τ(d) 2 n ) ]} + D C 1 ln + C 2 D. where C 1, C 2 : computed by simulations (slope heuristics). 3 Final estimator: µ τ := µ τ( D). 4/47

43 Scenario 1: Changes in (mean,variance) Behavior of the penalized criterion 15 4 x Penalized crit Risk Empirical risk 1 Penalized crit Risk Empirical risk Dimension Dimension (a) Gaussian (k G ) (b) Hermite (k H 5 ) crit( τ(d)) looks like the risk for both k G and k H 5 41/47

44 Scenario 1: Changes in (mean,variance) Cont. Change-points frequencies and ˆD.5 Freq. of selected chgpts Position (a) Fequencies (exact recovery) (b) Selected dimension (D = 11) 42/47

45 Scenario 2: No change in (mean,variance) Behavior of the penalized criterion 15 8 x Penalized crit Risk Empirical risk 2 2 Penalized crit Risk Empirical risk Dimension Dimension (a) Gaussian (k G ) (b) Hermite (k H 5 ) crit( τ(d)) looks like the risk for both k G and k H 5 43/47

46 Scenario 2: No change in (mean,variance) Cont. Change-points frequencies and ˆD.5 Freq. of selected chgpts Position (a) Fequencies (exact recovery) (b) Selected dimension (D = 11) 44/47

47 Scenario 3: histogram-valued (Cont.) Behavior of the penalized criterion Penalized crit Risk Empirical risk Dimension Penalized crit Risk Empirical risk Dimension (a) χ 2 (k χ2 ) (b) Gaussian (k G ) Crit looks like the risk for both k G and k χ2 45/47

48 Concluding remarks Summary: detect changes in the distribution (not only in the mean) efficient and theoretically grounded procedure deal with both vectorial (R d ) and structured (graphs,... ) objects 46/47

49 Concluding remarks Summary: detect changes in the distribution (not only in the mean) efficient and theoretically grounded procedure deal with both vectorial (R d ) and structured (graphs,... ) objects Statistical precision/computation trade-offs: Open challenges Reduce the O(n 2 ) time complexity approx. to the Gram matrix Investigate the link between kernel and abrupt changes Revisit the slope heuristic to: (i) preserve accuracy, and (ii) save computation resources Thank you! 46/47

50 Concluding remarks Summary: detect changes in the distribution (not only in the mean) efficient and theoretically grounded procedure deal with both vectorial (R d ) and structured (graphs,... ) objects Statistical precision/computation trade-offs: Open challenges Reduce the O(n 2 ) time complexity approx. to the Gram matrix Investigate the link between kernel and abrupt changes Revisit the slope heuristic to: (i) preserve accuracy, and (ii) save computation resources Thank you! 46/47

51 47/47

52 Scenario 3: Histogram-valued (Cont.) Change-points frequencies and ˆD Freq. of selected chgpts Freq. of selected chgpts Position Position (a) χ 2 (k χ2 ) (b) Gaussian (k G ) 48/47

53 Sketch of proof 1 Π τ ε 2 = λ m 1 n λ i λ ε i 2 H = λ m T λ. { } 2 are independent r.v.. i λ ε i 2 H λ m 3 Bernstein s inequality to Π τ ε 2 ( ). 4 For every q 2, upper bound of E [ T q ] λ. 5 Pinelis-Sakhanenko s inequality on i λ ε i H : [ x >, P ε i > x x 2 2 exp 2 ( σ 2 H λ + b λx ) i λ ], with b λ = 2M/3 and σ 2 λ = i λ v i. 49/47

54 Bernstein rather than Talagrand Talagrand s inequality Π τ ε = sup f Bn < f, Π τ ε >= sup n f Bn i=1 < f i, (Π τ ε) i > H [ P Π τ ε E [ Π τ ε ] + 2vx + b3 ] x, with v = n i=1 sup f E ( < f i, (Π τ ε) i > 2 H) + 16bE [ Πτ ε ]. Bernstein s inequality σ 2 = sup f n i=1 E ( < f i, (Π τ ε) i > 2 ) [ ] H = E Π τ ε 2. 5/47

Kernel change-point detection

Kernel change-point detection 1,2 (joint work with Alain Celisse 3 & Zaïd Harchaoui 4 ) 1 Cnrs 2 École Normale Supérieure (Paris), DIENS, Équipe Sierra 3 Université Lille 1 4 INRIA Grenoble Workshop Kernel methods for big data, Lille,