Kernel change-point detection

Size: px

Start display at page:

Download "Kernel change-point detection"

Amie Ramsey
5 years ago
Views:

1 1,2 (joint work with Alain Celisse 3 & Zaïd Harchaoui 4 ) 1 Cnrs 2 École Normale Supérieure (Paris), DIENS, Équipe Sierra 3 Université Lille 1 4 INRIA Grenoble Workshop Kernel methods for big data, Lille, 1st April /23

2 1-D signal (example) Signal Position t 2/23

3 1-D signal (example): Find abrupt changes in the mean Signal Reg. func. Signal ?? Position t 2/23

4 Estimation rather than identification 1 Signal: Y Reg. func. s 0.8 With a finite sample, it is impossible to recover some change-points in noisy regions Purpose: 1 Estimate the regression function. 2 Use the quadratic loss l(u, v) = u v 2. Rk: Without too strong noise, recover all change-points. 3/23

5 Detect abrupt changes... General purposes: 1 Detect changes in the whole distribution (not only in the mean) Mean: homoscedastic: Birgé & Massart (2001), Comte & Rozenholc (2002, 2004), Baraud, Giraud & Huet (2010)... heteroscedastic: A. & Celisse (2011) Mean and variance: Picard et al. (2007) 4/23

6 Detect abrupt changes... General purposes: 1 Detect changes in the whole distribution (not only in the mean) Mean: homoscedastic: Birgé & Massart (2001), Comte & Rozenholc (2002, 2004), Baraud, Giraud & Huet (2010)... heteroscedastic: A. & Celisse (2011) Mean and variance: Picard et al. (2007) 2 High-dimensional data of different nature: Vectorial: measures in R d, curves (sound recordings,... ) Non vectorial: phenotypic data, graphs, DNA sequence,... Both vectorial and non vectorial data. 4/23

7 Detect abrupt changes... General purposes: 1 Detect changes in the whole distribution (not only in the mean) Mean: homoscedastic: Birgé & Massart (2001), Comte & Rozenholc (2002, 2004), Baraud, Giraud & Huet (2010)... heteroscedastic: A. & Celisse (2011) Mean and variance: Picard et al. (2007) 2 High-dimensional data of different nature: Vectorial: measures in R d, curves (sound recordings,... ) Non vectorial: phenotypic data, graphs, DNA sequence,... Both vectorial and non vectorial data. 3 Efficient algorithm allowing to deal with large data sets 4/23

8 Kernel and Reproducing Kernel Hilbert Space (RKHS) X : initial input space. X 1,..., X n : initial observations. k(, ) : X X R: reproducing kernel (H: RKHS). φ( ) : X H s.t. φ(x) = k(x, ): canonical feature map. Asset: Enables to work with high-dimensional heterogeneous data. Rk: Estimators depend on the Gram matrix K := {k(x i, X j )} 1 i,j n. 5/23

9 Model Mapping of the initial data 1 i n, Y i = φ(x i ) H. (t 1, Y 1 ),..., (t n, Y n ) [0, 1] H : independent. 6/23

10 Model where 1 i n, Y i = s i + ε i H, s i H: mean element of P Xi (distribution of X i ) si, f H = E X i [ φ(x i ), f H ], f H. [ ] i, ε i := Y i si with E [ε i ] = 0 and v i := E ε i 2 H. 6/23

11 Model where 1 i n, Y i = s i + ε i H, s i H: mean element of P Xi (distribution of X i ) si, f H = E X i [ φ(x i ), f H ], f H. [ ] i, ε i := Y i si with E [ε i ] = 0 and v i := E ε i 2 H Assumptions 1 max i Y i H M a.s. (Db).. 2 max i v i v max (Vmax). 3 s = (s 1,..., s n ) H n : piecewise constant. s µ 2 := n i=1 s i µ i 2 H. Goal: Estimate s to recover change-points. 6/23

12 Least-squares estimator Empirical risk minimizer over S m (= model): ŝ m arg min u S m Rn (u) where Rn (u) = 1 n u Y 2 = 1 n Regressogram: ŝ m = λ m β λ 1 λ βλ = 1 Card {t i λ} n u i Y i 2 H. i=1 Y i. t i λ 7/23

13 Model selection Models: M n = {m, segmentation of {1,..., n }}, D m = Card(m). m { I 1 = [0, t m1 ], I 2 = (t m1, t m2 ],..., I Dm = (t mdm 1, 1] }. S m = {µ : (t 1,..., t n ) H, piecewise const. on all λ m } subspace of H n. Strategy: (S m ) m Mn (ŝ m ) m Mn ŝ m??? Oracle model: m argmin m Mn s ŝ m 2. 8/23

14 Model selection Models: M n = {m, segmentation of {1,..., n }}, D m = Card(m). m { I 1 = [0, t m1 ], I 2 = (t m1, t m2 ],..., I Dm = (t mdm 1, 1] }. S m = {µ : (t 1,..., t n ) H, piecewise const. on all λ m } subspace of H n. Strategy: (S m ) m Mn (ŝ m ) m Mn ŝ m??? Oracle model: m argmin m Mn s ŝ m 2. Goal: Oracle inequality (in expectation, or with large probability): { } s ŝ m 2 C inf s ŝ m 2 + R(m, n) m M n 8/23

15 Choose (D 1) change-points... Assumption: (Harchaoui & Cappé (2007)) Question: The number (D 1) of change-points is known. Find the locations of the (D 1) change-points? Strategy: (D is given). The best segmentation in D pieces is obtained by applying the ERM algorithm over D m=d S m : ERM algorithm: m ERM (D) = argmin m D m=d Rk: Based on dynamic programming. R n (ŝ m ). 9/23

16 Quality of the segmentations 10/23

17 Elementary calculations Ideal criterion: (Π m : orthog. proj. operator onto S m ) s ŝ m 2 = s Π m s 2 + Π m ε 2. Empirical risk: Y ŝ m 2 = s Π m s 2 Π m ε (I Π m )s, ε + ε 2. 11/23

18 Elementary calculations Ideal criterion: (Π m : orthog. proj. operator onto S m ) Empirical risk: s ŝ m 2 = s Π m s 2 + Π m ε 2. Y ŝ m 2 = s Π m s 2 Π m ε (I Π m )s, ε + ε 2. Expectations (v λ = 1 Card(λ) i λ v i) [ ] E s ŝ m 2 = s Π m s 2 + v λ, λ m [ ] E Y ŝ m 2 = s Π m s 2 v λ + Cst, λ m Conclusion: ERM prefers models with large λ m v λ (overfitting). 11/23

19 Choose the number of change-points From { ŝ md, choose D amounts to choose the best model. }D Ideal penalty: m argmin s ŝ m 2 m M = argmin m M { } Y ŝ m 2 + pen id (m), with pen id (m) =: 2 Π m ε 2 2 (I Π m )s, ε. Strategy 1 Concentration inequalities for linear and quadratic terms. 2 Derive a tight upper bound pen pen id with high probability. Previous work: Birgé & Massart (2001): Gaussian assumption + real valued functions. cannot be extended to Hilbert framework. 12/23

20 Concentration of the linear term Theorem (Linear term) Assume (Db) (Vmax) hold true. Then, for every segmentation m M n, for every x > 0 with probability at least 1 2e x, Π m s s, ε θ Π m s s 2 + for every θ > 0. ( vmax θ + 4M2 3 ) x, 13/23

21 Concentration of the quadratic term Theorem (Quadratic term) Assume (Db) (Vmax), and κ 1, 0 < M2 κ min v i (Vmin). i Then, for every m M n, x > 0, and θ (0, 1], Π m ε 2 E [ Π ] [ m ε 2 ] θe Π m s ŝ m 2 + θ 1 L(κ)v max x, with probability at least 1 2e x, where L(κ) is a constant. Idea of the proof: Pinelis-Sakhanenko s inequality ( i λ ε i H ). Bernstein s inequality (upper bounding moments) 14/23

22 Oracle inequality Theorem Assume (Db)-(Vmin)-(Vmax) and define { } 1 m argmin m n Y ŝ m 2 + pen(m), [ ( ) ] where pen(m) = vmaxdm n C 1 ln n D m + C 2 for constants C 1, C 2 > 0. Then, for every x 1, with probability at least 1 2e x, { } 1 n s ŝ m inf m n s ŝ m 2 + pen(m) + 2v max x n where 1 1 and 2 > 0 are absolute constants. In Birgé & Massart (2001), pen(m) = σ2 D m n [ c 1 ln ( n D m ) + c 2 ]., 15/23

23 Model selection procedure Algorithm pen(m) = v maxd m n 1 For every 1 D D max, 2 Define D = argmin D [ ( ) ] n C 1 ln + C 2 = pen(d m ). D m m D argmin m, D m=d { Y ŝ m 2 }, { 1 Y ŝ n md 2 v max D [ ( n ) ] } + C 1 ln + C 2. n D where C 1, C 2 : computed by simulation experiments. 3 Final estimator: ŝ m =: ŝ m D. 16/23

24 Changes in the distribution (synthetic data) Description: 1 n = 1 000, D 1 = 9, N rep = In each segment, observations generated according to one distribution within a pool of 10 distributions with same mean and variance. 3 Kernel-based approach enables to distinguish them (higher order moments) [ ] 4 Gaussian kernel: k h (x, y) = exp x y 2 /(2h 2 ). 17/23

25 Changes in the distribution (synthetic data), cont. Results Emp. risk Crit. True risk Hausdorff distance: ± /23

26 Changes in the distribution (synthetic data), cont. Results: estimated number of change-points /23

27 Le grand échiquier, 70s-80s French talk show Audio and video recordings. Audio: different situations can be distinguished from sound recordings (music, applause, speech,... ). Video: different video scenes can be distinguished by their backgrounds or specific actions of people (clapping hands, discussing,... ). 20/23

28 Audio signal Description: n = 500, D 1 = 4. At each t i, one observes a multivariate vector of dimension 12. [ ] Gaussian kernel: k h (x, y) = exp x y 2 /(2h 2 ). Results: Hausdorff distance ± /23

29 Video sequence Description: n = , D 1 = 4. Each image summarized by a histogram with bins. χ 2 kernel: k d (x, y) = d (x i y i ) 2 i=1 x i +y i Results: Hausdorff distance ± /23

30 Conclusion Take-home message: Change-point detection algorithm for possibly high-dimensional or complex data Data-driven choice of the number of change-points Non-asymptotic oracle inequality (guarantee on the risk) Experiments: changes in less usual properties of the distribution, audio or video data 23/23

31 Conclusion Take-home message: Change-point detection algorithm for possibly high-dimensional or complex data Data-driven choice of the number of change-points Non-asymptotic oracle inequality (guarantee on the risk) Experiments: changes in less usual properties of the distribution, audio or video data Open questions: 1 Influence of the choice of kernel 2 Data-driven choice of the kernel 3 Relax the assumption on the variance 4 Extend our model selection theorem to other regression settings 23/23

Kernels to detect abrupt changes in time series

Kernels to detect abrupt changes in time series 1 UMR 8524 CNRS - Université Lille 1 2 Modal INRIA team-project 3 SSB group Paris joint work with S. Arlot, Z. Harchaoui, G. Rigaill, and G. Marot Computational and statistical trade-offs in learning IHES