Big Hypothesis Testing with Kernel Embeddings

Size: px

Start display at page:

Download "Big Hypothesis Testing with Kernel Embeddings"

Homer Lynch
5 years ago
Views:

1 Big Hypothesis Testing with Kernel Embeddings Dino Sejdinovic Department of Statistics University of Oxford 9 January 2015 UCL Workshop on the Theory of Big Data D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 1 / 22

2 Heiko Strathmann Soumyajit De Wojciech Zaremba Matthew Blaschko Arthur Gretton D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 2 / 22

3 Making Hard Inference Possible many dimensions low signal-to-noise ratio highly non-linear assocations higher-order interactions X1 X2 Y D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 3 / 22

4 Making Hard Inference Possible many dimensions low signal-to-noise ratio highly non-linear assocations higher-order interactions X1 X2 Y need an expressive model and a very large number of observations D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 3 / 22

5 Making Hard Inference Possible many dimensions low signal-to-noise ratio highly non-linear assocations higher-order interactions X1 X2 Y need an expressive model and a very large number of observations cannot use batch algorithms D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 3 / 22

6 Overview 1 Kernel Embeddings and MMD 2 Scaling up Kernel Tests 3 Experiments D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 4 / 22

7 Kernel Embeddings and MMD Outline 1 Kernel Embeddings and MMD 2 Scaling up Kernel Tests 3 Experiments D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 4 / 22

8 Kernel Embeddings and MMD Kernel Embedding feature map: x k(, x) H k instead of x (ϕ 1 (x),..., ϕ s (x)) R s k(, x), k(, y) Hk = k(x, y) inner products easily computed D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 5 / 22

9 Kernel Embeddings and MMD Kernel Embedding feature map: x k(, x) H k instead of x (ϕ 1 (x),..., ϕ s (x)) R s k(, x), k(, y) Hk = k(x, y) inner products easily computed embedding: P µ k (P) = E X P k(, X ) H k instead of P (Eϕ 1 (X ),..., Eϕ s (X )) R s µ k (P), µ k (Q) Hk = E X,Y k(x, Y ) inner products easily estimated D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 5 / 22

10 Kernel Embeddings and MMD Kernel MMD (1) Denition Kernel metric (MMD) between P and Q: MMD 2 k(p, Q) = Ek(, X ) Ek(, Y ) 2 H k = E XX k(x, X ) + E YY k(y, Y ) 2E X Y k(x, Y ) D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 6 / 22

11 Kernel Embeddings and MMD Kernel MMD (2) A polynomial kernel k(z, z ) = ( 1 + z z ) s captures the dierence in rst s moments only For a certain family of kernels (characteristic): MMD k (P, Q) = 0 if and only if P = Q: Gaussian exp( 1 2σ 2 z z 2 2 ), Laplacian, inverse multiquadratics, B 2n+1 - splines... Under mild assumptions, k-mmd metrizes weak* topology on probability measures (Sriperumbudur, 2010): MMD k (P n, P) 0 P n P D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 7 / 22

12 Kernel Embeddings and MMD Nonparametric two-sample tests Testing H 0 : P = Q vs. H A : P Q based on samples {x i } nx i=1 P, {y i} ny i=1 Q. Test statistic is an estimate of MMD 2 k (P, Q) = E XX k(x, X ) + E YY k(y, Y ) 2E X Y k(x, Y ): MMD = 1 n x (n x 1) k(x i, x j ) + i j 2 k(x i, y j ). n x n y i,j 1 n y (n y 1) k(y i, y j ) i j O(n 2 ) to compute (n = n x + n y ) 1 Degenerate U-statistic: n -convergence to MMD under H A, 1 n -convergence to 0 under H 0. Gretton et al (2009, 2012) D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 8 / 22

13 Kernel Embeddings and MMD Nonparametric independence tests H 0 : X Y H A : X Y D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 9 / 22

14 Kernel Embeddings and MMD Nonparametric independence tests H 0 : X Y P X Y = P X P Y H A : X Y P X Y P X P Y Test statistic: HSIC(X, Y ) = µ κ (ˆPXY ) µ κ (ˆPX ˆPY ) 2, H κ with κ = k l Gretton et al (2005, 2008); Smola et al (2007) D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 9 / 22

Kernel Embeddings and MMD Nonparametric independence tests H 0 : X Y P X Y = P X P Y H A : X Y P X Y P X P Y Test statistic: HSIC(X, Y ) = µ κ (ˆPXY ) µ κ (ˆPX ˆPY ) 2, H κ with κ = k l Gretton et al

15 Kernel Embeddings and MMD Nonparametric independence tests H 0 : X Y P X Y = P X P Y H A : X Y P X Y P X P Y Test statistic: HSIC(X, Y ) = µ κ (ˆPXY ) µ κ (ˆPX ˆPY ) 2, H κ with κ = k l Gretton et al (2005, 2008); Smola et al (2007) Extensions: conditional independence testing (Fukumizu, Gretton, Sun and Schölkopf, 2008; Zhang, Peters, Janzing and Schölkopf, 2011), three-variable interaction (DS, Gretton and Bergsma, 2013) D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 9 / 22

16 Scaling up Kernel Tests Outline 1 Kernel Embeddings and MMD 2 Scaling up Kernel Tests 3 Experiments D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 9 / 22

Scaling up Kernel Tests Test threshold under H 0 : P = Q: MMDk r=1 λ r n x n y n x +n y {λ r } depend on both k and P. expensive threshold computation: ( Z 2 r 1 ), {Z r } i.i.d. N (0, 1) Estimate leading λ r 's (requires eigendecomposition of the kernel matrix): O(n 3 ) Permutation test: #shues O(n 2 ) D.

17 Scaling up Kernel Tests Test threshold under H 0 : P = Q: MMDk r=1 λ r n x n y n x +n y {λ r } depend on both k and P. expensive threshold computation: ( Z 2 r 1 ), {Z r } i.i.d. N (0, 1) Estimate leading λ r 's (requires eigendecomposition of the kernel matrix): O(n 3 ) Permutation test: #shues O(n 2 ) D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 10 / 22

18 Scaling up Kernel Tests Limited data, unlimited time MMD 2 k(p, Q) = E XX k(x, X ) + E YY k(y, Y ) 2E X Y k(x, Y ) Estimate with MMD = 1 n x (n x 1) k(x i, x j ) + i j 2 k(x i, y j ). n x n y i,j 1 n y (n y 1) k(y i, y j ) i j Complexity: O ( n 2). D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 11 / 22

19 Scaling up Kernel Tests Limited time, unlimited data MMD1 MMD2... Process mini-batches of size B at a time: ˆη k = B n/b MMD n b=1 k,b Complexity: O(nB). Provided B/n 0: 1 n -convergence to MMD if MMD n/b MMD 0, 0 under H 0. 1 nb -convergence to - A.Gretton, B.Sriperumbudur, DS, H.Strathmann, S.Balakrishnan, M.Pontil and K.Fukumizu, Optimal kernel choice for large-scale two-sample tests, NIPS W.Zaremba, A.Gretton, M.Blaschko, B-test: A Non-Parametric, Low Variance Kernel Two-Sample Test, NIPS D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 12 / 22

20 Scaling up Kernel Tests Full statistic vs. mini-batch statistic U-statistic mini-batch time O(n 2 ) O(nB) storage O(n 2 ) O(B 2 ) null distribution innite sum of chi-squares normal computing p-value O(n 3 ) or #shues O(n 2 ) O(nB) convergence rate 1/n 1/ nb n x n y (n x +n y ) 3/2 B ˆηk N ( 0, σ 2 k) under H0 σ 2 k (depends on k and P) can be unbiasedly estimated on each block in O(B 2 ) time D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 13 / 22

21 Scaling up Kernel Tests Asymptotic eciency criterion - A. Gretton, B. Sriperumbudur, DS, H. Strathmann, S. Balakrishnan, M. Pontil and K. Fukumizu, Optimal kernel choice for large-scale two-sample tests, NIPS Proposition For given P and Q. Let η k = MMD 2 k (P, Q), and let σ2 k variance of the linear-time statistic ˆη k. Then be the asymptotic k = arg max k K η k/σ k minimizes the asymptotic Type II error probability on K. D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 14 / 22

22 Scaling up Kernel Tests Asymptotic eciency criterion - A. Gretton, B. Sriperumbudur, DS, H. Strathmann, S. Balakrishnan, M. Pontil and K. Fukumizu, Optimal kernel choice for large-scale two-sample tests, NIPS Proposition For given P and Q. Let η k = MMD 2 k (P, Q), and let σ2 k variance of the linear-time statistic ˆη k. Then be the asymptotic k = arg max k K η k/σ k minimizes the asymptotic Type II error probability on K. We only have estimates of η k and σ k! Will the kernel optimization using plug-in esimates be consistent? Over what families of kernels can we perform such optimization eciently? D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 14 / 22

23 Scaling up Kernel Tests Asymptotic eciency criterion - A. Gretton, B. Sriperumbudur, DS, H. Strathmann, S. Balakrishnan, M. Pontil and K. Fukumizu, Optimal kernel choice for large-scale two-sample tests, NIPS Proposition For given P and Q. Let η k = MMD 2 k (P, Q), and let σ2 k variance of the linear-time statistic ˆη k. Then be the asymptotic k = arg max k K η k/σ k minimizes the asymptotic Type II error probability on K. We only have estimates of η k and σ k! Will the kernel optimization using plug-in esimates be consistent? yes! Over what families of kernels can we perform such optimization eciently? linear combinations (MKL) D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 14 / 22

24 Experiments Outline 1 Kernel Embeddings and MMD 2 Scaling up Kernel Tests 3 Experiments D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 14 / 22

25 Experiments Hard-to-detect dierences: Gaussian blobs Dicult problems: lengthscale of the dierence in distributions not the same as that of the distributions. D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 15 / 22

26 Experiments Hard-to-detect dierences: Gaussian blobs Dicult problems: lengthscale of the dierence in distributions not the same as that of the distributions. We distinguish grids of Gaussian blobs with dierent covariances. 35 Blob data p 35 Blob data q x2 20 y x y1 Figure : 3 3 blobs, ratio ε = 3.2 of largest-to-smallest eigenvalues of blobs in Q. D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 15 / 22

27 Experiments Gaussian blobs (2) blobs with ε = 1.4. Linear time statistic vs. Quadratic time statistic. Fixed kernel. D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 16 / 22

28 Experiments Gaussian blobs (2) blobs with ε = 1.4. Linear time statistic vs. Quadratic time statistic. Fixed kernel. m per trial Type II error Trials Quadratic 5,000 [0.7996, ] ,000 [0.5161, ] 367 > 10, 000 Buy more RAM! D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 16 / 22

29 Experiments Gaussian blobs (2) blobs with ε = 1.4. Linear time statistic vs. Quadratic time statistic. Fixed kernel. m per trial Type II error Trials Quadratic 5,000 [0.7996, ] ,000 [0.5161, ] 367 > 10, 000 Buy more RAM! Linear 100, 000, 000 [0.2250, ] , 000, 000 [0.1873, ] , 000, ± D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 16 / 22

30 Experiments Gaussian blobs (3) Type II error Opt L 2 MaxMMD MaxMMD Median ratio between eigenvalues Figure : m = 10, 000; family generated by gaussian kernels with bandwiths {2 5,..., 2 15 }. D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 17 / 22

31 Experiments Hard-to-detect dierences: UCI HIGGS - P. Baldi, P. Sadowski, and D. Whiteson. Searching for Exotic Particles in High-energy Physics with Deep Learning. Nature Communications 5, benchmark dataset for distinguishing a signature of Higgs boson vs. background joint distributions of the azimuthal angular momenta ϕ for four particle jets: low-signal, low-level features Do joint angular momenta carry any discriminating information? D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 18 / 22

32 Experiments Hard-to-detect dierences: UCI HIGGS - P. Baldi, P. Sadowski, and D. Whiteson. Searching for Exotic Particles in High-energy Physics with Deep Learning. Nature Communications 5, benchmark dataset for distinguishing a signature of Higgs boson vs. background joint distributions of the azimuthal angular momenta ϕ for four particle jets: low-signal, low-level features Do joint angular momenta carry any discriminating information? sample size: 1e4 5e4 1e5 5e5 1e6 p-value (gauss-med): D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 18 / 22

33 Experiments Hard-to-detect dierences: UCI HIGGS - P. Baldi, P. Sadowski, and D. Whiteson. Searching for Exotic Particles in High-energy Physics with Deep Learning. Nature Communications 5, benchmark dataset for distinguishing a signature of Higgs boson vs. background joint distributions of the azimuthal angular momenta ϕ for four particle jets: low-signal, low-level features Do joint angular momenta carry any discriminating information? sample size: 1e4 5e4 1e5 5e5 1e6 p-value (gauss-med): train/test size: 2e3/8e3 1e4/4e4 2e4/8e4 1e5/4e5 2e5/8e5 p-value (gauss-opt): e e-18 D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 18 / 22

34 Experiments Experiment: Independence Test ( sign ) X N (0, I d ), Y = 2 d d/2 j=1 sign (X 2j 1X 2j ) Z j + Z d 2 +1, where Z N ( 0, I d 2 +1 ) d = 50 d = 100 sign, gauss-med Power (1- Type II error) number of samples D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 19 / 22

35 Experiments Experiment: Independence Test (sine ) X 1, X 2 i.i.d. Unif [0, 2π], Y = sin(x 1 + X 2 ) + 10Z, with Z N (0, 1). sine, B = 100 brown: q = 1 brown: opt gauss: med gauss: opt N = 5e5, 1-Type II.277 ± ± ± ±.061 Type I.035 ± ± ± ±.027 N = 1e6, 1-Type II.460 ± ± ± ±.041 Type I.055 ± ± ± ±.033 D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 20 / 22

36 Experiments Shogun Written in C++ with interfaces to Python, Matlab, Java, R. Google Summer of Code (2012, 2014). D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 21 / 22

37 Experiments Summary Hypothesis testing based on kernel embeddings reveals hard-to-detect dierences between distributions and non-linear low-signal associations. D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 22 / 22

38 Experiments Summary Hypothesis testing based on kernel embeddings reveals hard-to-detect dierences between distributions and non-linear low-signal associations. A simple mini-batch procedure allows us to run the tests on large-scale problems and on streaming data. D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 22 / 22

39 Experiments Summary Hypothesis testing based on kernel embeddings reveals hard-to-detect dierences between distributions and non-linear low-signal associations. A simple mini-batch procedure allows us to run the tests on large-scale problems and on streaming data. Can select kernel parameters on-the-y in order to explicitly maximise test power. D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 22 / 22

40 Experiments Summary Hypothesis testing based on kernel embeddings reveals hard-to-detect dierences between distributions and non-linear low-signal associations. A simple mini-batch procedure allows us to run the tests on large-scale problems and on streaming data. Can select kernel parameters on-the-y in order to explicitly maximise test power. Both kernel selection and testing in O(n) time and O(1) storage (if B = const). D. Sejdinovic (Statistics, Oxford) Big Testing with Kernels 09/01/15 22 / 22

Hypothesis Testing with Kernel Embeddings on Interdependent Data

Hypothesis Testing with Kernel Embeddings on Interdependent Data Dino Sejdinovic Department of Statistics University of Oxford joint work with Kacper Chwialkowski and Arthur Gretton (Gatsby Unit, UCL)