Fast Two Sample Tests using Smooth Random Features

Size: px

Start display at page:

Download "Fast Two Sample Tests using Smooth Random Features"

Tyrone Douglas
6 years ago
Views:

1 Kacper Chwialkowski University College London, Computer Science Department Aaditya Ramdas Carnegie Mellon University, Machine Learning and Statistics School of Computer Science Dino Sejdinovic University of Oxford, Department of Statistics Arthur Gretton University College London, Gatsby Computational Neuroscience Unit Abstract We propose a nonparametric two-sample test with cost linear in the number of samples. Our test statistic uses differences in smoothed characteristic functions: these are able to distinguish a larger class of alternatives than the nonsmoothed characteristic functions used in previous linear-time tests, while being much faster than the current state-of-the-art tests based on kernels or distances, which are quadratic in the sample size. Experiments on artificial benchmarks and on challenging real life testing problems demonstrate that our test gives a better time/power tradeoff than competing approaches, including sub-quadratic-time variants of the kernel tests. This performance advantage is retained even in high dimensions, and in cases where the difference in distributions is not observable in low order statistics. 1. Introduction Testing whether two random variables are identically distributed without imposing any parametric assumptions on their distributions is important in a variety of scientific applications, e.g., data integration in bioinformatics (Borgwardt et al., 26), benchmarking for steganography (Pevnỳ & Fridrich, 28) or an automated model checking (Lloyd & Ghahramani, 214). These problems are addressed in the statistics literature via two-sample tests (also known as homogeneity tests). Kernel two sample tests, such as those based on Maximum Proceedings of the 31 st International Conference on Machine Learning, Lille, France, 215. JMLR: W&CP volume 37. Copyright 215 by the author(s). KACPER.CHWIALKOWSKI@GMAIL.COM ARAMDAS@CS.CMU.EDU DINO.SEJDINOVIC@GMAIL.COM ARTHUR.GRETTON@GMAIL.COM Mean Discrepancy (MMD) (Gretton et al., 212a), Kernel Fisher Discriminant (KFD) (Harchaoui et al., 28), or block-based MMD (Zaremba et al., 213), use embeddings of probability distributions into reproducing kernel Hilbert spaces (RKHS) (Sriperumbudur et al., 21; 211). For translation-invariant kernel functions, MMD using RKHS embeddings is precisely a weighted distance between empirical characteristic functions (Alba Fernández et al., 28), where the kernel is the Fourier transform of the weight function used. 1 Epps & Pulley (1983) and Hall & Welsh (1983) used a statistic of this form to create a goodness-of-fit test for normality. Most of these tests have at least quadratic time complexity - the exceptions being linear and sub-quadratic block-based tests investigated in (Ho & Shieh, 26; Gretton et al., 212a;b; Zaremba et al., 213). Recently, Bochner s theorem has been exploited in kernel learning literature by Rahimi & Recht (27); Le et al. (213) in order to construct randomized explicit feature representations of data, thus speeding up various kernel learning tasks, by running primal algorithms with greatly reduced complexity in the number of observations. The idea of employing empirical characteristic functions evaluated at random frequencies equivalently, averaged random Fourier features of Rahimi & Recht (27) has a long history in the statistical testing literature. Empirical characteristic functions (E) evaluated at a single frequency were studied by Heathcote (1972; 1977) in the context of goodness-of-fit tests, with cost linear in the sample size. They showed that the power of their test can be 1 Note that when kernels are not invariant to translation, twosample tests may still be constructed, although the Bochner argument no longer applies. One class of such tests uses the N- distance or energy distance (Zinger et al., 1992; Székely, 23; Baringhaus & Franz, 24), which turns out to be an MMD-based test for a particular family of kernels (Sejdinovic et al., 213).

2 maximized against fully specified alternative hypotheses, where is the probability of correctly rejecting the null hypothesis that the distributions are the same. In other words, if the class of distributions being differentiated is known in advance, then the test can focus at particular frequencies where the characteristic functions differ the most. This approach was generalized to evaluating the E at multiple distinct frequencies by Epps & Singleton (1986), who propose using a Hotelling s t-statistic on such evaluations, thus avoiding the problem of needing to know the best frequency in advance (the test remains linear in the sample size). Our work builds on these ideas: for comparison, we describe a multivariate extension of the test by Epps & Singleton (1986), and compare against this in experiments. In the present work, we revisit the ideas of parsimonious frequency representations for testing. We construct novel two sample tests that measure differences in smoothed empirical characteristic functions at a set of frequencies, where these smoothed characteristic functions are defined in Section 2. We emphasize that this smoothing can be carried out without a time complexity penalty: cost remains linear in the number of samples. Our test has a number of theoretical and computational advantages over previous approaches. Comparing first with Epps & Singleton (1986), our test is consistent for a broader class of distributions, namely those with (1 + ɛ)-integrable density functions, for some ɛ > (by consistent, we mean that the approaches one as we see more samples). By contrast, a test that looks at a fixed set of frequencies in the nonsmoothed characteristic functions is consistent only under much more onerous conditions, which are not satisfied for instance if the two characteristic functions agree on an interval. This same weakness was used by Alba Fernández et al. (28) in justifying a test that integrates over the entire frequency domain (albeit at cost quadratic in the sample size). Compared with such quadratic time tests (including quadratic-time MMD), our test can be conducted in linear time, although we would expect some loss of power, in line with findings for linear- and sub-quadratic-time MMDbased tests (Ho & Shieh, 26; Gretton et al., 212a;b; Zaremba et al., 213). The most important advantage of the new test is observed in its performance on experimental benchmarks (Section 4). For challenging artificial data (both high dimensional, and where the difference in distributions is very subtle), our test gives a better power/computation tradeoff than the characteristic function-based tests of Epps & Singleton (1986), the previous sub-quadratic-time MMD tests (Gretton et al., 212b; Zaremba et al., 213), and the quadratic-time MMD test (Gretton et al., 212a). The final case is especially interesting: even though the quadratic-time test is more powerful for a given number of samples, the computational cost is far higher - thus, in the big data regime, it is much better to simply use more data with a smoothed characteristic function-based test. Finally, we compare test performance for data on distinguishing signatures of the the Higgs boson from background noise (Baldi et al., 214), and for amplitude modulated audio data, which are both challenging multivariate testing problems. Our test again gives the best power/computation tradeoff. 2. Smoothed characteristic functions In this section we introduce the smoothed characteristic function, and show that it distinguishes a substantially larger class of differences in distributions than the classical characteristic function, given that we are able to evaluate these differences only at a finite set of frequencies (and assuming the absence of prior knowledge of where best to place those frequencies). The characteristic function of a random variable X is the inverse Fourier transform of its probability density function ϕ X (w) = Ee iw X. (1) The smoothed characteristic function is a convolution of the characteristic function with a smoothing function f ˆ φ X (t) = ϕ X (w)f(w t)dw, (2) R d for t R d (we use w to denote the frequency in the characteristic function and t for the smoothed characteristic function). The smoothed characteristic function can be written as the expected value of a function of X, rather then as a convolution. Proposition 1. Let f be an integrable function and T f its Fourier transform. Then φ X (t) = E[e it X T f(x)] (3) All proofs are presented in the Appendix. Advantages of Smoothing The above proposition has two useful implications, one computational and one theoretical. Computationally, the empirical smoothed characteristic function can be calculated in linear time - which is discussed in Section 3. As for the theoretical advantage, we will show in Theorem 1 below that smoothed characteristic functions of two different random variables must differ almost everywhere (subject to a mild condition). By contrast, two distinct characteristic functions may nonetheless agree on a finite interval (Lukacs, 1972, p.11) and the example below Theorem 1. Before we get to our main result on smoothed characteristic functions, let us first identify a sufficient condition for characteristic functions of X, Y to differ almost everywhere.

3 The following lemma combines several results from the literature and provides such a condition. Lemma 1. Let M Z,k = E Z k. Suppose that M Z,k is finite and lim sup k k M Z,k /k! is bounded for Z {X, Y }. Then, if X and Y have different distributions, the characteristic functions of X and Y differ a.e. We now proceed to our main result, which explains why smoothing the characteristic function is a good idea from the point of view of two sample testing. Theorem 1. Suppose that X and Y have a (1 + ɛ)- integrable densities. Let the Fourier transform of the smoothing function be of form T f = e x s for some s > 1. X and Y have different distributions if and only if then the smoothed characteristic functions of X and Y differ almost everywhere. Note. Any function of a form e x s, for s > 1, has an inverse Fourier transform (Rudin, 1987, Chapter 9). For s = 2 the smoothing function is proportional to a Gaussian density Figure 1. Smooth vs non-smooth. The left plot presents the distance between empirical smoothed characteristic functions of two random variables X and Y, and the right plot presents the distance between empirical characteristic functions. The random variables used are illustrated in Figure 4 - these are grids of Gaussian distributions discussed in detail in Section 4. Note the difference between maximal values of the distance, 14 for the characteristic function and 6 for the smoothed characteristic function. We discuss implications of this Theorem in the Appendix C 3. Proposed Test We describe in detail the linear-time two-sample test that uses smoothed characteristic functions Difference in smoothed characteristic functions Our test statistic is based on the differences between smoothed characteristic functions at randomly chosen frequencies. Since our goal is to test whether this vector of differences is all zeros, we make use of Hotelling s t-statistic. Let {Z i } 1 i n be a collection of n random d-dimensional vectors. Let µ n denote the empirical mean of the sequence {Z i } 1 i n, and Σ n the empirical covariance matrix. We assume that lim n Σ n = Σ and lim n µ n = µ, where all equalities in probability. Hotelling s t-statistic is S n = nµ n Σ 1 n µ n. The relevant asymptotic properties of Hotelling s t-statistic are as follows: Proposition 2 (Asymptotic behavior of Hotelling s t-statistic). If EZ i =, then under the usual assumptions of the multivariate CLT for independent random variables, the statistic S n is asymptotically distributed as a χ 2 random variable with d degrees of freedom (as n with d fixed). If EZ i, then for any fixed r, P (S n > r) as n. We now apply the above proposition in obtaining a statistical test. The empirical characteristic function, for observations {X i } 1 i N, is ˆϕ X (w) = 1 n exp(iw X i ). n i=1 The empirical smoothed characteristic function, as described in Proposition 1, is ˆφ X (t) = 1 n exp(it X i )T f(x i ). n i=1 Note that this formula is much more computationally efficient than the explicit convolution of f with the empirical characteristic function. In particular, the vector (T f(x i )) 1 i n can be reused for different frequencies. Test 1 (Smoothed ). Let W i be the difference between empirical smoothed characteristic functions at datapoints X i, Y i when evaluated at some chosen frequencies t k, i.e. Wi k = ˆφ Xi (t k ) ˆφ Yi (t k ) for 1 k d. Defining Z i := (Re(Wi 1),, Re(W i d), Im(W i 1),, Im(W i d)), our test statistic is S n = nµ n Σ 1 n µ n, where µ n is the empirical mean of the sequence Z i. We choose a threshold r α corresponding to the 1 α quantile of this distribution under the null hypothesis (that X and Y have the same distribution), and reject the null whenever S n is larger than r α. There are a number of valid choices for the frequencies t k at which we evaluate the differences in characteristic functions. In our case, we draw these t k independently and identically from the Fourier transform of the kernel used in the MMD-based test (see next section for details). In this manner, the Fourier features used by our test are identical to those used by the MMD-based test.

4 3.2. Other tests Other tests considered in the Experiments section are Quadratic-time MMD test (Gretton et al., 212a). Subquadratic time MMD test (Zaremba et al., 213) and Unsmoothed test Epps & Singleton (1986). See appendix B for a detailed description. 4. Experiments We compare the proposed method with the linear time two sample tests described in the previous section, i.e. the B- test and the test. Where computationally feasible, we also compare with the the quadratic time MMD test. We evaluate performance on four datasets: high dimensional random variables, grids of Gaussians, amplitude modulated audio signals, and features in the Higgs dataset. To simplify the parameter selection procedure, we set all bandwidths to one (including the width of the smoothing window for the smoothed test), and directly scaled the data instead (this distinction is irrelevant for the kernel and unsmoothed tests, but the smoothing windown may be suboptimal for our smoothed test). The data scaling was chosen so as to maximize on a held-out training set. The full details are described in the Appendix. Note that we did not use the popular median heuristic for kernel bandwidth choice (MMD and ), since it gives poor results for the Blobs and AM Audio datasets (Gretton et al., 212b). The original test was proposed without any parameters, but we used the same data scaling (equivalently kernel width) parameter as in other tests to ensure a fair comparison. Simulation 1: High Dimensions It has been recently shown, in theory and in practice, that the two sample problem gets more difficult as the number of the dimensions increases on which the distributions do not differ (Ramdas et al., 215; Reddi et al., 215). As a corollary, for the MMD test, the authors show that the number of samples needs to grow at some rate with the dimensionality of the random variables in order to achieve high power. In the following experiments, we study the power of the two sample tests as a function of dimension of the the random vector. In both experiments we compare Gaussian random vectors which differ only in the first dimension, i.e., Dataset I: X N( d, I d ) Y N((1,,, ), I d ) Dataset II: X N( d, I d ) Y N( d, diag((2, 1,, 1))), where d is a d-dimensional vector of, I d is a d- dimensional identity matrix and a function diag crates an diagonal matrix out of the vector. Obviously, in the dataset I the means are different, and in the dataset II the variance is different. The power of the different two sample tests is presented in Figure 2. The Smoothed test yields best performance for differences in variances, and performance equal to the unsmoothed test for differences in mean dimensions dimensions Figure 2. Power comparison for different tests on high dimensional data. Left: on dataset II (difference in variances). Right: for the dataset I (difference in means). The pale envelope indicates 95% percent confidence intervals for the results, as averaged over 1 runs. Simulation 2: Blobs The Blobs dataset is a grid of two dimensional Gaussian distributions (see Figure 3), which is known to be a challenging two-sample testing task. The difficulty arises from the fact that the difference in distributions is encoded at a much smaller lengthscale than the overall data. In this experiment both X and Y were drawn from a five by five grid of Gaussians, where X had unit covariance matrix in each mixture component while each component of Y had a non unit covariance matrix. It was shown by Gretton et al. (212b) that a good choice of kernel is crucial for this task: we used the procedure outlined in the Appendix. Figure 3 presents the results of various two sample tests on the Blobs dataset. The full MMD has the best power as function of sample size, but a much worse power-execution time tradeoff than the -based tests. The Smoothed has the best power as function of the sample size among the linear- and sub-quadratic time tests. Real Data 1: Higgs dataset The next experiment we consider is on the UCI Higgs dataset (Lichman, 213) described in Baldi et al. (214) - the task is to distinguish signatures of processes which produce Higgs bosons and background processes which do not. We consider a twosample test on certain extremely low-signal low-level features in the dataset - kinematic properties measured by the particle detectors, i.e., the joint distributions of the azimuthal angular momenta ϕ for four particle jets. We denote by P the jet ϕ-momenta distribution of the background process, and by Q that of the process that produces Higgs bosons (both are distributions on R 4 ). As discussed in Baldi et al. (214, Fig. 2), such angular momenta, unlike 2 25

5 .8.8 random variable X.8 AM signals, X AM signals, Y Full MMD random variable Y Test power sample size log time Figure 3. Power and execution time of different two sample tests on the blob dataset. Left: as the function of the sample size. Center: vs execution time. All plotted results are averaged over 1 runs. Right: illustration of the blob dataset. Each mixture component in the upper plot is a standard Gaussian, whereas those in the lower plot have the direction of the largest variance rotated by π/4 and amplified so the standard deviation towards this direction is 2. transverse momenta p T, individually carry very little discriminating information for signal vs. background benchmark events. Therefore, we would like to test the null hypothesis that the distributions of angular momenta, P and Q, are the same. The results for different algorithms are presented in the Figure 4. We observe that the join distribution of the angular momenta is in fact a discriminative feature. Clearly, the Smoothed test has significantly higher power than the other two tests, both as a function of sample size and execution time Added noise Figure 5. Left: Power comparison on the music dataset. The pale envelope around the solid lines represents 95% confidence intervals for the results as averaged over 1 runs. Right: four different realizations of the X variable and Y variable sample size dimensions Block MMD Figure 6. Type I error of the blobs dataset (left) and the dimensions dataset (right). The dashed lines denote the 99 % confidence interval for the Type I error sample size log time 1-1 the same album, song X and song Y (further details of these data are described by Gretton et al., 212b, in Section 5). We conduced two sample tests using ten thousand samples from each signal.to increase the difficulty of the testing problem, independent Gaussian noise of increasing variance was added to the signals. The results are presented in the Figure 5. The is competitive with smoothed and unsmoothed tests for low noise levels, but both and are less powerful than our test for medium- to high noise levels. Figure 4. Power for two different two sample tests based on the four explanatory variables from the Higgs dataset. Left: power of the test as the function of sample size. Right: power of the test as a function of execution time. Results are averaged over 1 runs (zooming in will display the extremely thin error envelopes). Real Data 2: Amplitude Modulated Music Amplitude modulation is the earliest technique used to transmit voice over the radio. In the following experiment X and Y were one thousand dimensional samples of carrier signals that were modulated with two different input audio signals from Type I error simulations. References In Figure 6, we present Type I error for the Alba Fernández, V., Jiménez-Gamero, M., and Muñoz Garcia, J. A test for the two-sample problem based on empirical characteristic functions. Computational Statistics and Data Analysis, 52: , 28. Baldi, P., Sadowski, P., and Whiteson, D. Searching for ex-

6 otic particles in high-energy physics with deep learning. Nature Communications, 5, 214. Baringhaus, L and Franz, C. On a new multivariate twosample test. Journal of Multivariate Analysis, 88(1): 19 26, 24. Borgwardt, K.M., Gretton, A., Rasch, M.J., Kriegel, H.- P., Schölkopf, B., and Smola, A. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49 e57, 26. Cuesta-Albertos, Juan Antonio, Fraiman, Ricardo, and Ransford, Thomas. A sharp form of the cramér wold theorem. Journal of Theoretical Probability, 2(2):21 29, 27. Epps, Thomas W and Pulley, Lawrence B. A test for normality based on the empirical characteristic function. Biometrika, 7(3): , Epps, T.W. and Singleton, K.J. An omnibus test for the two-sample problem using the empirical characteristic function. Journal of Statistical Computation and Simulation., 26(3-4):177 23, Gretton, A., Fukumizu, K., Harchaoui, Z., and Sriperumbudur, B. A fast, consistent kernel two-sample test. In Advances in Neural Information Processing Systems, 29. Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., and Smola, A. A kernel two-sample test. Journal Machine Learning Research, 13: , 212a. Gretton, A., Sriperumbudur, B., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., and Fukumizu, K. Optimal kernel choice for large-scale two-sample tests. In Advances in Neural Information Processing Systems, 212b. Hall, Peter and Welsh, AH. A test for normality based on the empirical characteristic function. Biometrika, 7(2): , Harchaoui, Z., Bach, F.R., and Moulines, E. Testing for Homogeneity with Kernel Fisher Discriminant Analysis. In Advances in Neural Information Processing Systems, pp Heathcote, CE. A test of goodness of fit for symmetric random variables. Australian Journal of Statistics, 14 (2): , Heathcote, CR. The integrated squared error estimation of parameters. Biometrika, 64(2): , Ho, H.-C. and Shieh, G. Two-stage U-statistics for hypothesis testing. Scandinavian Journal of Statistics, 33(4): , 26. Le, Q., Sarlos, T., and Smola, A. Fastfood - computing Hilbert space expansions in loglinear time. In Proceedings of the International Conference on Machine Learning, JMLR W&CP, volume 28, pp , 213. Lichman, M. UCI machine learning repository, 213. URL Lloyd, J.R. and Ghahramani, Z. Statistical model criticism using kernel two sample tests. Technical report, 214. Lukacs, Eugene. A survey of the theory of characteristic functions. Advances in Applied Probability, pp. 1 38, Lukacs, Eugene and Szasz, Otto. On analytic characteristic functions. Pacific Journal of Mathematics, Pevnỳ, Tomáš and Fridrich, Jessica. Benchmarking for steganography. In Information Hiding, pp Springer, 28. Rahimi, A. and Recht, B. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, 27. Ramdas, Aaditya, Reddi, Sashank, Póczos, Barnabás, Singh, Aarti, and Wasserman, Larry. On the decreasing power of kernel- and distance-based nonparametric hypothesis tests in high dimensions. 29th AAAI Conference on Aritificial Intelligence, 215. Reddi, Sashank, Ramdas, Aaditya, Póczos, Barnabás, Singh, Aarti, and Wasserman, Larry. On the highdimensional power of linear-time kernel two-sample testing under mean-difference alternatives. 18th International Conference on Artificial Intelligence and Statistics, 215. Rudin, Walter. Real and complex analysis. Tata McGraw- Hill Education, Sejdinovic, D., Sriperumbudur, B., Gretton, A., and Fukumizu, K. Equivalence of distance-based and RKHSbased statistics in hypothesis testing. Annals of Statistics, 41(5): , 213. Sriperumbudur, B., Gretton, A., Fukumizu, K., Lanckriet, G., and Schölkopf, B. Hilbert space embeddings and metrics on probability measures. Journal Machine Learning Research, 11: , 21. Sriperumbudur, B., Fukumizu, K., and Lanckriet, G. Universality, characteristic kernels and RKHS embedding of measures. Journal Machine Learning Research, 12: , 211.

7 Székely, GJ. E-statistics: The energy of statistical samples. Bowling Green State University, Department of Mathematics and Statistics Technical Report, (3-5):2 23, 23. Zaremba, W., Gretton, A., and Blaschko, M. : A nonparametric, low variance kernel two-sample test. In Advances in Neural Information Processing Systems, 213. Zinger, AA, Kakosyan, AV, and Klebanov, LB. A characterization of distributions by mean values of statistics and certain probabilistic metrics. Journal of Mathematical Sciences, 59(4):914 92, Fast Two Sample Tests using Smooth Random Features

Big Hypothesis Testing with Kernel Embeddings

Big Hypothesis Testing with Kernel Embeddings Dino Sejdinovic Department of Statistics University of Oxford 9 January 2015 UCL Workshop on the Theory of Big Data D. Sejdinovic (Statistics, Oxford) Big