Hypothesis Testing with Kernel Embeddings on Interdependent Data

Size: px

Start display at page:

Download "Hypothesis Testing with Kernel Embeddings on Interdependent Data"

Elinor Underwood
5 years ago
Views:

1 Hypothesis Testing with Kernel Embeddings on Interdependent Data Dino Sejdinovic Department of Statistics University of Oxford joint work with Kacper Chwialkowski and Arthur Gretton (Gatsby Unit, UCL) 9 April 215 Dagstuhl D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 1 / 19

2 Kernel Embedding feature map: x k(, x) H k instead of x (ϕ 1 (x),..., ϕ s (x)) R s k(, x), k(, y) Hk = k(x, y) inner products easily computed D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 2 / 19

3 Kernel Embedding feature map: x k(, x) H k instead of x (ϕ 1 (x),..., ϕ s (x)) R s k(, x), k(, y) Hk = k(x, y) inner products easily computed embedding: P µ k (P) = E X P k(, X ) H k instead of P (Eϕ 1 (X ),..., Eϕ s (X )) R s µ k (P), µ k (Q) Hk = E X,Y k(x, Y ) inner products easily estimated D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 2 / 19

4 Kernel Embedding feature map: x k(, x) H k instead of x (ϕ 1 (x),..., ϕ s (x)) R s k(, x), k(, y) Hk = k(x, y) inner products easily computed embedding: P µ k (P) = E X P k(, X ) H k instead of P (Eϕ 1 (X ),..., Eϕ s (X )) R s µ k (P), µ k (Q) Hk = E X,Y k(x, Y ) inner products easily estimated µ k (P) represents expectations w.r.t. P, i.e., E X f (X ) = E X f, k(, X ) Hk = f, µ k (P) Hk f H k D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 2 / 19

5 Kernel MMD Denition Kernel metric (MMD) between P and Q: MMD k (P, Q) = E X k(, X ) E Y k(, Y ) 2 H k = E XX k(x, X ) + E YY k(y, Y ) 2E X Y k(x, Y ) D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 3 / 19

6 Kernel MMD A polynomial kernel k(x, x ) = ( 1 + x x ) s on R p captures the dierence in rst s (mixed) moments only For a certain family of kernels (characteristic/universal): MMD k (P, Q) = i P = Q: Gaussian exp( 1 2σ 2 z z 2 2 ), Laplacian, inverse multiquadratics, B 2n+1 - splines... Under mild assumptions, k-mmd metrizes weak* topology on probability measures (Sriperumbudur, 21): MMD k (P n, P) P n P D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 4 / 19

7 Nonparametric two-sample tests Testing H : P = Q vs. H A : P Q based on samples {x i } nx i=1 P, {y i} ny i=1 Q. Test statistic is an estimate of MMD k (P, Q) = E XX k(x, X ) + E YY k(y, Y ) 2E X Y k(x, Y ): 1 1 MMD k = k(x i, x j ) + k(y i, y j ) n x (n x 1) n y (n y 1) Degenerate U-statistic: 2 i j n x n y i,j k(x i, y j ). i j 1 n -convergence to MMD under H A, 1 n -convergence to under H. O(n 2 ) to compute (n = n x + n y ) various approximations (block-based, random features) trade computation for power. Gretton et al (NIPS 29, JMLR 212, NIPS 212) D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 5 / 19

8 Test threshold For i.i.d. data, under H : P = Q: n x n y MMDk n x +n y r=1 λ r ( Z 2 r 1 ), {Z r } r=1 i.i.d. N (, 1) {λ r } depend on both k and P: eigenvalues of T : L 2 L 2, ˆ (Tf ) (x) f (x ) k(x, x ) dp(x ). }{{} centred Asymptotic null distribution typically estimated using a permutation test. For interdependent samples, {Z r } r=1 are correlated, with the correlation structure dependent on the correlation structure within the samples. D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 6 / 19

9 Nonparametric independence tests H : X Y H A : X Y D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 7 / 19

10 Nonparametric independence tests H : X Y P X Y = P X P Y H A : X Y P X Y P X P Y Test statistic: HSIC(X, Y ) = µ κ (ˆPXY ) µ κ (ˆPX ˆPY ) 2, H κ with κ = k l Gretton et al (25, 28); Smola et al (27); Related to distance covariance (dcov) in statistics literature Szekely et al (AoS 27, AoAS 29); S. et al (AoS 213) D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 7 / 19

11 HSIC computation k(, ) around and take naps... trademark terrier temperament... ) world - he loves to lay are alert and active with the couch potato of the terrier little bundles of energy. They Sealyham Terrier is the l(the,cairn Terriers are independent D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 8 / 19

12 HSIC computation k(, ) around and take naps... trademark terrier temperament... ) world - he loves to lay are alert and active with the couch potato of the terrier little bundles of energy. They Sealyham Terrier is the l(the,cairn Terriers are independent K= L= HSIC measures average similarity between the kernel matrices: HSIC(X, Y ) = 1 n 2 HKH, HLH H = I 1 n 11 (centering matrix) D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 8 / 19

They Sealyham Terrier is the l(the,cairn Terriers are independent K= L= HSIC measures average similarity

conditional independence testing (Fukumizu, Gretton, Sun and Schölkopf, 28; Zhang, Peters, Janzing and

13 HSIC computation k(, ) around and take naps... trademark terrier temperament... ) world - he loves to lay are alert and active with the couch potato of the terrier little bundles of energy. They Sealyham Terrier is the l(the,cairn Terriers are independent K= L= HSIC measures average similarity between the kernel matrices: HSIC(X, Y ) = 1 n 2 HKH, HLH H = I 1 n 11 (centering matrix) Extensions: conditional independence testing (Fukumizu, Gretton, Sun and Schölkopf, 28; Zhang, Peters, Janzing and Schölkopf, 211), three-variable interaction / V-structure discovery (S., Gretton and Bergsma, 213) D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 8 / 19

14 Kernel tests on time series Kacper Chwialkowski Arthur Gretton D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 9 / 19

15 Test calibration for dependent observations Is P the same distribution as Q? D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 1 / 19

16 Kernel MMD Denition Kernel metric (MMD) between P and Q: MMD k (P, Q) = E X k(, X ) E Y k(, Y ) 2 H k = E XX k(x, X ) + E YY k(y, Y ) 2E X Y k(x, Y ) D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 11 / 19

17 Permutation test on AR(1): X t+1 = ax t + 1 a 2 ɛ t Density of V n Histogram of V n,p Correct threshold Bootstrapped threshold a =.1 D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 12 / 19

18 Permutation test on AR(1): X t+1 = ax t + 1 a 2 ɛ t Density of V n Histogram of V n,p Correct threshold Bootstrapped threshold a =.2 D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 12 / 19

19 Permutation test on AR(1): X t+1 = ax t + 1 a 2 ɛ t Density of V n Histogram of V n,p Correct threshold Bootstrapped threshold a =.3 D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 12 / 19

20 Permutation test on AR(1): X t+1 = ax t + 1 a 2 ɛ t Density of V n Histogram of V n,p Correct threshold Bootstrapped threshold a =.4 D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 12 / 19

21 Permutation test on AR(1): X t+1 = ax t + 1 a 2 ɛ t Density of V n Histogram of V n,p Correct threshold Bootstrapped threshold a =.5 D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 12 / 19

22 Permutation test on AR(1): X t+1 = ax t + 1 a 2 ɛ t Density of V n Histogram of V n,p Correct threshold Bootstrapped threshold a =.6 D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 12 / 19

23 Permutation test on AR(1): X t+1 = ax t + 1 a 2 ɛ t Density of V n Histogram of V n,p Correct threshold Bootstrapped threshold a =.7 D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 12 / 19

24 Permutation test on AR(1): X t+1 = ax t + 1 a 2 ɛ t Density of V n Histogram of V n,p Correct threshold Bootstrapped threshold a =.8 D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 12 / 19

25 Wild Bootstrap Wild bootstrap process (Leucht and Neumann, 213): W t,n = e 1/l n W t 1,n + 1 e 2/l i.i.d. nɛ t where W,n, ɛ 1,..., ɛ n N (, 1), and n W t,n = W t,n 1 W n j=1 j,n. MMD k,wb := 1 n 2 x n x n x i=1 j=1 2 n x n y i=1 W (x) (x) i,n x W j,n x k(x i, x j ) 1 nx 2 n x n y j=1 W (x) (y) i,n x W j,n y k(x i, y j ). n y n y i=1 j=1 W (y) i,n y W (y) j,n y k(y i, y j ) Theorem (Chwialkowski, S. and Gretton, 214) Let k be bounded and Lipschitz continuous, and let {X t } P and {Y t } Q both be τ-dependent with τ(r) ( = O(r 6 ɛ ), but independent ) of each other. nx n Then, under H : P = Q, ϕ y n MMDk n x +n y, x n y P MMDk,b n x +n y as n x, n y, where ϕ is the Prokhorov metric. D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 13 / 19

26 Wild Bootstrap on AR(1): X t+1 = ax t + 1 a 2 ɛ t Density of V n Histogram of V n,w Correct threshold Bootstrapped threshold a =.1 D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 14 / 19

27 Wild Bootstrap on AR(1): X t+1 = ax t + 1 a 2 ɛ t Density of V n Histogram of V n,w Correct threshold Bootstrapped threshold a =.2 D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 14 / 19

28 Wild Bootstrap on AR(1): X t+1 = ax t + 1 a 2 ɛ t Density of V n Histogram of V n,w Correct threshold Bootstrapped threshold a =.3 D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 14 / 19

29 Wild Bootstrap on AR(1): X t+1 = ax t + 1 a 2 ɛ t Density of V n Histogram of V n,w Correct threshold Bootstrapped threshold a =.4 D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 14 / 19

30 Wild Bootstrap on AR(1): X t+1 = ax t + 1 a 2 ɛ t Density of V n Histogram of V n,w Correct threshold Bootstrapped threshold a =.5 D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 14 / 19

31 Wild Bootstrap on AR(1): X t+1 = ax t + 1 a 2 ɛ t Density of V n Histogram of V n,w Correct threshold Bootstrapped threshold a =.6 D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 14 / 19

32 Wild Bootstrap on AR(1): X t+1 = ax t + 1 a 2 ɛ t Density of V n Histogram of V n,w Correct threshold Bootstrapped threshold a =.7 D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 14 / 19

33 Wild Bootstrap on AR(1): X t+1 = ax t + 1 a 2 ɛ t Density of V n Histogram of V n,w Correct threshold Bootstrapped threshold a =.8 D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 14 / 19

34 Test calibration for dependent observations Two-sample test experiment \ method perm. wild MCMC diagnostics i.i.d. vs i.i.d. (H ).4.12 i.i.d. vs Gibbs (H ) Gibbs vs Gibbs (H ) value value time time time series 2 time series time series time series 1 type I error AR coeffcient type II error V b1 V b2 Shift Extinction rate D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 15 / 19

Time Series Coupled at a Lag type II error rate 1.8.6.4.2 mean distance correlation.1.9.8.7.6.5 1 15 2 25 3 sample size.4 2 15 1 5 5 1 15 2 lag 1.8.6.4.2 KCSD HSIC 2 25 3 sample size X t = cos(φ t,1), φ t,1 = φ t 1,1 +.

35 Time Series Coupled at a Lag type II error rate mean distance correlation sample size lag KCSD HSIC sample size X t = cos(φ t,1), φ t,1 = φ t 1,1 +.1ɛ 1,t + 2πf 1T s, ɛ 1,t i.i.d. N (, 1), Y t = [2 + C sin(φ t,1)] cos(φ t,2), φ t,2 = φ t 1,2 +.1ɛ 2,t + 2πf 2T s, ɛ 2,t i.i.d. N (, 1). Parameters: C =.4, f 1 = 4Hz,f 2 = 2Hz, 1 T s = 1Hz. - M. Besserve, N.K. Logothetis, and B. Schölkopf. Statistical analysis of coupled time series with kernel cross-spectral density operators. NIPS 213. D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 16 / 19

36 Summary Interdependent data lead to incorrect Type I control for kernel tests (too many false positives). Consistency of a wild bootstrap procedure under weak long-range dependencies (τ -mixing), applicable to both two-sample and independence tests Applications: MCMC diagnostics, time series dependence across multiple lags D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 17 / 19

37 Open questions Interdependent case: how to select parameters of the wild bootstrap / block bootstrap - requires estimating mixing properties of the time series rst? Large-scale testing: tradeos between computation and power How to interpret the discovered dierences in distributions / discovered dependence? Do we really care about all possible dierences between distributions? Tuning parameters - can select kernels/hyperparameters to directly optimize relative eciency of the test, but how does this aect tradeos with interdependent data? Sensitive interplay between the kernel hyperparameter and the wild bootstrap parameters Multivariate interaction and graphical model selection - approximations? D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 18 / 19

38 References K. Chwialkowski, DS and A. Gretton, A wild bootstrap for degenerate kernel tests. Advances in Neural Information Processing Systems (NIPS) 27, Dec DS, B. Sriperumbudur, A. Gretton and K. Fukumizu, Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Ann. Statist. 41(5): , Oct M. Besserve, N.K. Logothetis and B. Schölkopf, Statistical analysis of coupled time series with kernel cross-spectral density operators. Advances in Neural Information Processing Systems (NIPS) 26, Dec A. Leucht and M.H. Neumann, Dependent wild bootstrap for degenerate U- and V-statistics. J. Multivar. Anal. 117:257-28, 213. A. Gretton, K.M. Borgwardt, M.J. Rasch, B. Schölkopf and A. Smola, A Kernel Two-Sample Test. J. Mach. Learn. Res. 13(Mar): , 212. D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 19 / 19

39 Kernels and characteristic functions 3 Smooth density 3 Rough density Y Y X X m=512, α=.5 E-distance/dCov of Székely and Rizzo (24,25), Székely et al (27): V 2 (X, Y ) = E XY E X Y X X 2 Y Y 2 + E X E X X X 2 E Y E Y Y Y 2 [ 2E XY EX X X 2 E Y Y Y ] Type II error dist, q=1/6 dist, q=1/3 dist, q=2/3 dist, q=1 dist, q=4/3 gauss (median) frequency DS, B. Sriperumbudur, A. Gretton and K. Fukumizu, Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Annals of Statistics 41(5), p , 213. D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 2 / 19

40 Embeddings in Mercer's Expansion Mercer's Expansion For a compact metric space X, and a continous kernel k, k(x, y) = λ r Φ r (x)φ r (y), r=1 with {λ r, Φ r } r 1 eigenvalue, eigenfunction pairs of f f (x)k(, x)dp(x) on L 2 (P). H k k(, x) H k µ k (P) µ k (ˆP) µ k ( ˆQ) 2 = H k { λr Φ r (x)} l 2 { λr EΦ r (X )} l 2 ( 1 n x λ r r=1 n x t=1 Φ r (X t ) 1 n y n y t=1 Φ r (Y t ) ) 2 D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 21 / 19

41 Wild Bootstrap ρ x = n x /n, ρ y = n y /n {W t,n } 1 t n, EW t,n =, E [W t,n W t,n] = ζ lim u ζ(u) 1 ρ x ρ y n MMDk = ρ x ρ y n MMDk,wb = r=1 r=1 λ r ( ρy λ r ( ρy n x t=1 n x t=1 Φ r (X t ) nx ( t t l n ), with ρ x Φ r (X t ) W (y) t,n x nx n y t=1 ρ x ( E [Φ r (X 1 )W 1,nΦ s (X t )W t,n ] = E [Φ r (X 1 )Φ s (X t )] ζ t 1 l n ) ) 2 Φ r (Y t ) ny n y t=1 n E [Φ r (X 1 )Φ s (X t )], t, r, s provided dependence between X 1 and X t disappears fast enough (a τ-mixing condition). ) (y) 2 Φ r (Y t ) W t,n y ny D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 22 / 19

42 ICML Workshop on Large-Scale Kernel Learning Lille, France, 11 July 215 (collocated with ICML 215) Foundational algorithmic techniques for large-scale kernel learning: matrix factorization, randomization and approximation, variational inference and sampling, inducing variables, random Fourier features, unifying frameworks Interface between kernel methods and deep learning architectures Tradeos between statistical and computational eciency in kernel methods Stochastic gradient techniques with kernel methods Large-scale multiple kernel learning Large-scale representation learning with kernels Large-scale kernel methods for complex data types beyond perceptual data Conrmed speakers: Francis Bach, Neil Lawrence, Russ Salakhutdinov, Marius Kloft, Zaid Harchaoui Deadline for Submissions: Friday, May 1st, 215, 23: UTC. D. Sejdinovic (Statistics, Oxford) Testing with Kernels 9/4/15 23 / 19

Big Hypothesis Testing with Kernel Embeddings

Big Hypothesis Testing with Kernel Embeddings Dino Sejdinovic Department of Statistics University of Oxford 9 January 2015 UCL Workshop on the Theory of Big Data D. Sejdinovic (Statistics, Oxford) Big