Fast Two Sample Tests using Smooth Random Features

Size: px
Start display at page:

Download "Fast Two Sample Tests using Smooth Random Features"

Transcription

1 Kacper Chwialkowski University College London, Computer Science Department Aaditya Ramdas Carnegie Mellon University, Machine Learning and Statistics School of Computer Science Dino Sejdinovic University of Oxford, Department of Statistics Arthur Gretton University College London, Gatsby Computational Neuroscience Unit Abstract We propose a nonparametric two-sample test with cost linear in the number of samples. Our test statistic uses differences in smoothed characteristic functions: these are able to distinguish a larger class of alternatives than the nonsmoothed characteristic functions used in previous linear-time tests, while being much faster than the current state-of-the-art tests based on kernels or distances, which are quadratic in the sample size. Experiments on artificial benchmarks and on challenging real life testing problems demonstrate that our test gives a better time/power tradeoff than competing approaches, including sub-quadratic-time variants of the kernel tests. This performance advantage is retained even in high dimensions, and in cases where the difference in distributions is not observable in low order statistics. 1. Introduction Testing whether two random variables are identically distributed without imposing any parametric assumptions on their distributions is important in a variety of scientific applications, e.g., data integration in bioinformatics (Borgwardt et al., 26), benchmarking for steganography (Pevnỳ & Fridrich, 28) or an automated model checking (Lloyd & Ghahramani, 214). These problems are addressed in the statistics literature via two-sample tests (also known as homogeneity tests). Kernel two sample tests, such as those based on Maximum Proceedings of the 31 st International Conference on Machine Learning, Lille, France, 215. JMLR: W&CP volume 37. Copyright 215 by the author(s). KACPER.CHWIALKOWSKI@GMAIL.COM ARAMDAS@CS.CMU.EDU DINO.SEJDINOVIC@GMAIL.COM ARTHUR.GRETTON@GMAIL.COM Mean Discrepancy (MMD) (Gretton et al., 212a), Kernel Fisher Discriminant (KFD) (Harchaoui et al., 28), or block-based MMD (Zaremba et al., 213), use embeddings of probability distributions into reproducing kernel Hilbert spaces (RKHS) (Sriperumbudur et al., 21; 211). For translation-invariant kernel functions, MMD using RKHS embeddings is precisely a weighted distance between empirical characteristic functions (Alba Fernández et al., 28), where the kernel is the Fourier transform of the weight function used. 1 Epps & Pulley (1983) and Hall & Welsh (1983) used a statistic of this form to create a goodness-of-fit test for normality. Most of these tests have at least quadratic time complexity - the exceptions being linear and sub-quadratic block-based tests investigated in (Ho & Shieh, 26; Gretton et al., 212a;b; Zaremba et al., 213). Recently, Bochner s theorem has been exploited in kernel learning literature by Rahimi & Recht (27); Le et al. (213) in order to construct randomized explicit feature representations of data, thus speeding up various kernel learning tasks, by running primal algorithms with greatly reduced complexity in the number of observations. The idea of employing empirical characteristic functions evaluated at random frequencies equivalently, averaged random Fourier features of Rahimi & Recht (27) has a long history in the statistical testing literature. Empirical characteristic functions (E) evaluated at a single frequency were studied by Heathcote (1972; 1977) in the context of goodness-of-fit tests, with cost linear in the sample size. They showed that the power of their test can be 1 Note that when kernels are not invariant to translation, twosample tests may still be constructed, although the Bochner argument no longer applies. One class of such tests uses the N- distance or energy distance (Zinger et al., 1992; Székely, 23; Baringhaus & Franz, 24), which turns out to be an MMD-based test for a particular family of kernels (Sejdinovic et al., 213).

2 maximized against fully specified alternative hypotheses, where is the probability of correctly rejecting the null hypothesis that the distributions are the same. In other words, if the class of distributions being differentiated is known in advance, then the test can focus at particular frequencies where the characteristic functions differ the most. This approach was generalized to evaluating the E at multiple distinct frequencies by Epps & Singleton (1986), who propose using a Hotelling s t-statistic on such evaluations, thus avoiding the problem of needing to know the best frequency in advance (the test remains linear in the sample size). Our work builds on these ideas: for comparison, we describe a multivariate extension of the test by Epps & Singleton (1986), and compare against this in experiments. In the present work, we revisit the ideas of parsimonious frequency representations for testing. We construct novel two sample tests that measure differences in smoothed empirical characteristic functions at a set of frequencies, where these smoothed characteristic functions are defined in Section 2. We emphasize that this smoothing can be carried out without a time complexity penalty: cost remains linear in the number of samples. Our test has a number of theoretical and computational advantages over previous approaches. Comparing first with Epps & Singleton (1986), our test is consistent for a broader class of distributions, namely those with (1 + ɛ)-integrable density functions, for some ɛ > (by consistent, we mean that the approaches one as we see more samples). By contrast, a test that looks at a fixed set of frequencies in the nonsmoothed characteristic functions is consistent only under much more onerous conditions, which are not satisfied for instance if the two characteristic functions agree on an interval. This same weakness was used by Alba Fernández et al. (28) in justifying a test that integrates over the entire frequency domain (albeit at cost quadratic in the sample size). Compared with such quadratic time tests (including quadratic-time MMD), our test can be conducted in linear time, although we would expect some loss of power, in line with findings for linear- and sub-quadratic-time MMDbased tests (Ho & Shieh, 26; Gretton et al., 212a;b; Zaremba et al., 213). The most important advantage of the new test is observed in its performance on experimental benchmarks (Section 4). For challenging artificial data (both high dimensional, and where the difference in distributions is very subtle), our test gives a better power/computation tradeoff than the characteristic function-based tests of Epps & Singleton (1986), the previous sub-quadratic-time MMD tests (Gretton et al., 212b; Zaremba et al., 213), and the quadratic-time MMD test (Gretton et al., 212a). The final case is especially interesting: even though the quadratic-time test is more powerful for a given number of samples, the computational cost is far higher - thus, in the big data regime, it is much better to simply use more data with a smoothed characteristic function-based test. Finally, we compare test performance for data on distinguishing signatures of the the Higgs boson from background noise (Baldi et al., 214), and for amplitude modulated audio data, which are both challenging multivariate testing problems. Our test again gives the best power/computation tradeoff. 2. Smoothed characteristic functions In this section we introduce the smoothed characteristic function, and show that it distinguishes a substantially larger class of differences in distributions than the classical characteristic function, given that we are able to evaluate these differences only at a finite set of frequencies (and assuming the absence of prior knowledge of where best to place those frequencies). The characteristic function of a random variable X is the inverse Fourier transform of its probability density function ϕ X (w) = Ee iw X. (1) The smoothed characteristic function is a convolution of the characteristic function with a smoothing function f ˆ φ X (t) = ϕ X (w)f(w t)dw, (2) R d for t R d (we use w to denote the frequency in the characteristic function and t for the smoothed characteristic function). The smoothed characteristic function can be written as the expected value of a function of X, rather then as a convolution. Proposition 1. Let f be an integrable function and T f its Fourier transform. Then φ X (t) = E[e it X T f(x)] (3) All proofs are presented in the Appendix. Advantages of Smoothing The above proposition has two useful implications, one computational and one theoretical. Computationally, the empirical smoothed characteristic function can be calculated in linear time - which is discussed in Section 3. As for the theoretical advantage, we will show in Theorem 1 below that smoothed characteristic functions of two different random variables must differ almost everywhere (subject to a mild condition). By contrast, two distinct characteristic functions may nonetheless agree on a finite interval (Lukacs, 1972, p.11) and the example below Theorem 1. Before we get to our main result on smoothed characteristic functions, let us first identify a sufficient condition for characteristic functions of X, Y to differ almost everywhere.

3 The following lemma combines several results from the literature and provides such a condition. Lemma 1. Let M Z,k = E Z k. Suppose that M Z,k is finite and lim sup k k M Z,k /k! is bounded for Z {X, Y }. Then, if X and Y have different distributions, the characteristic functions of X and Y differ a.e. We now proceed to our main result, which explains why smoothing the characteristic function is a good idea from the point of view of two sample testing. Theorem 1. Suppose that X and Y have a (1 + ɛ)- integrable densities. Let the Fourier transform of the smoothing function be of form T f = e x s for some s > 1. X and Y have different distributions if and only if then the smoothed characteristic functions of X and Y differ almost everywhere. Note. Any function of a form e x s, for s > 1, has an inverse Fourier transform (Rudin, 1987, Chapter 9). For s = 2 the smoothing function is proportional to a Gaussian density Figure 1. Smooth vs non-smooth. The left plot presents the distance between empirical smoothed characteristic functions of two random variables X and Y, and the right plot presents the distance between empirical characteristic functions. The random variables used are illustrated in Figure 4 - these are grids of Gaussian distributions discussed in detail in Section 4. Note the difference between maximal values of the distance, 14 for the characteristic function and 6 for the smoothed characteristic function. We discuss implications of this Theorem in the Appendix C 3. Proposed Test We describe in detail the linear-time two-sample test that uses smoothed characteristic functions Difference in smoothed characteristic functions Our test statistic is based on the differences between smoothed characteristic functions at randomly chosen frequencies. Since our goal is to test whether this vector of differences is all zeros, we make use of Hotelling s t-statistic. Let {Z i } 1 i n be a collection of n random d-dimensional vectors. Let µ n denote the empirical mean of the sequence {Z i } 1 i n, and Σ n the empirical covariance matrix. We assume that lim n Σ n = Σ and lim n µ n = µ, where all equalities in probability. Hotelling s t-statistic is S n = nµ n Σ 1 n µ n. The relevant asymptotic properties of Hotelling s t-statistic are as follows: Proposition 2 (Asymptotic behavior of Hotelling s t-statistic). If EZ i =, then under the usual assumptions of the multivariate CLT for independent random variables, the statistic S n is asymptotically distributed as a χ 2 random variable with d degrees of freedom (as n with d fixed). If EZ i, then for any fixed r, P (S n > r) as n. We now apply the above proposition in obtaining a statistical test. The empirical characteristic function, for observations {X i } 1 i N, is ˆϕ X (w) = 1 n exp(iw X i ). n i=1 The empirical smoothed characteristic function, as described in Proposition 1, is ˆφ X (t) = 1 n exp(it X i )T f(x i ). n i=1 Note that this formula is much more computationally efficient than the explicit convolution of f with the empirical characteristic function. In particular, the vector (T f(x i )) 1 i n can be reused for different frequencies. Test 1 (Smoothed ). Let W i be the difference between empirical smoothed characteristic functions at datapoints X i, Y i when evaluated at some chosen frequencies t k, i.e. Wi k = ˆφ Xi (t k ) ˆφ Yi (t k ) for 1 k d. Defining Z i := (Re(Wi 1),, Re(W i d), Im(W i 1),, Im(W i d)), our test statistic is S n = nµ n Σ 1 n µ n, where µ n is the empirical mean of the sequence Z i. We choose a threshold r α corresponding to the 1 α quantile of this distribution under the null hypothesis (that X and Y have the same distribution), and reject the null whenever S n is larger than r α. There are a number of valid choices for the frequencies t k at which we evaluate the differences in characteristic functions. In our case, we draw these t k independently and identically from the Fourier transform of the kernel used in the MMD-based test (see next section for details). In this manner, the Fourier features used by our test are identical to those used by the MMD-based test.

4 3.2. Other tests Other tests considered in the Experiments section are Quadratic-time MMD test (Gretton et al., 212a). Subquadratic time MMD test (Zaremba et al., 213) and Unsmoothed test Epps & Singleton (1986). See appendix B for a detailed description. 4. Experiments We compare the proposed method with the linear time two sample tests described in the previous section, i.e. the B- test and the test. Where computationally feasible, we also compare with the the quadratic time MMD test. We evaluate performance on four datasets: high dimensional random variables, grids of Gaussians, amplitude modulated audio signals, and features in the Higgs dataset. To simplify the parameter selection procedure, we set all bandwidths to one (including the width of the smoothing window for the smoothed test), and directly scaled the data instead (this distinction is irrelevant for the kernel and unsmoothed tests, but the smoothing windown may be suboptimal for our smoothed test). The data scaling was chosen so as to maximize on a held-out training set. The full details are described in the Appendix. Note that we did not use the popular median heuristic for kernel bandwidth choice (MMD and ), since it gives poor results for the Blobs and AM Audio datasets (Gretton et al., 212b). The original test was proposed without any parameters, but we used the same data scaling (equivalently kernel width) parameter as in other tests to ensure a fair comparison. Simulation 1: High Dimensions It has been recently shown, in theory and in practice, that the two sample problem gets more difficult as the number of the dimensions increases on which the distributions do not differ (Ramdas et al., 215; Reddi et al., 215). As a corollary, for the MMD test, the authors show that the number of samples needs to grow at some rate with the dimensionality of the random variables in order to achieve high power. In the following experiments, we study the power of the two sample tests as a function of dimension of the the random vector. In both experiments we compare Gaussian random vectors which differ only in the first dimension, i.e., Dataset I: X N( d, I d ) Y N((1,,, ), I d ) Dataset II: X N( d, I d ) Y N( d, diag((2, 1,, 1))), where d is a d-dimensional vector of, I d is a d- dimensional identity matrix and a function diag crates an diagonal matrix out of the vector. Obviously, in the dataset I the means are different, and in the dataset II the variance is different. The power of the different two sample tests is presented in Figure 2. The Smoothed test yields best performance for differences in variances, and performance equal to the unsmoothed test for differences in mean dimensions dimensions Figure 2. Power comparison for different tests on high dimensional data. Left: on dataset II (difference in variances). Right: for the dataset I (difference in means). The pale envelope indicates 95% percent confidence intervals for the results, as averaged over 1 runs. Simulation 2: Blobs The Blobs dataset is a grid of two dimensional Gaussian distributions (see Figure 3), which is known to be a challenging two-sample testing task. The difficulty arises from the fact that the difference in distributions is encoded at a much smaller lengthscale than the overall data. In this experiment both X and Y were drawn from a five by five grid of Gaussians, where X had unit covariance matrix in each mixture component while each component of Y had a non unit covariance matrix. It was shown by Gretton et al. (212b) that a good choice of kernel is crucial for this task: we used the procedure outlined in the Appendix. Figure 3 presents the results of various two sample tests on the Blobs dataset. The full MMD has the best power as function of sample size, but a much worse power-execution time tradeoff than the -based tests. The Smoothed has the best power as function of the sample size among the linear- and sub-quadratic time tests. Real Data 1: Higgs dataset The next experiment we consider is on the UCI Higgs dataset (Lichman, 213) described in Baldi et al. (214) - the task is to distinguish signatures of processes which produce Higgs bosons and background processes which do not. We consider a twosample test on certain extremely low-signal low-level features in the dataset - kinematic properties measured by the particle detectors, i.e., the joint distributions of the azimuthal angular momenta ϕ for four particle jets. We denote by P the jet ϕ-momenta distribution of the background process, and by Q that of the process that produces Higgs bosons (both are distributions on R 4 ). As discussed in Baldi et al. (214, Fig. 2), such angular momenta, unlike 2 25

5 .8.8 random variable X.8 AM signals, X AM signals, Y Full MMD random variable Y Test power sample size log time Figure 3. Power and execution time of different two sample tests on the blob dataset. Left: as the function of the sample size. Center: vs execution time. All plotted results are averaged over 1 runs. Right: illustration of the blob dataset. Each mixture component in the upper plot is a standard Gaussian, whereas those in the lower plot have the direction of the largest variance rotated by π/4 and amplified so the standard deviation towards this direction is 2. transverse momenta p T, individually carry very little discriminating information for signal vs. background benchmark events. Therefore, we would like to test the null hypothesis that the distributions of angular momenta, P and Q, are the same. The results for different algorithms are presented in the Figure 4. We observe that the join distribution of the angular momenta is in fact a discriminative feature. Clearly, the Smoothed test has significantly higher power than the other two tests, both as a function of sample size and execution time Added noise Figure 5. Left: Power comparison on the music dataset. The pale envelope around the solid lines represents 95% confidence intervals for the results as averaged over 1 runs. Right: four different realizations of the X variable and Y variable sample size dimensions Block MMD Figure 6. Type I error of the blobs dataset (left) and the dimensions dataset (right). The dashed lines denote the 99 % confidence interval for the Type I error sample size log time 1-1 the same album, song X and song Y (further details of these data are described by Gretton et al., 212b, in Section 5). We conduced two sample tests using ten thousand samples from each signal.to increase the difficulty of the testing problem, independent Gaussian noise of increasing variance was added to the signals. The results are presented in the Figure 5. The is competitive with smoothed and unsmoothed tests for low noise levels, but both and are less powerful than our test for medium- to high noise levels. Figure 4. Power for two different two sample tests based on the four explanatory variables from the Higgs dataset. Left: power of the test as the function of sample size. Right: power of the test as a function of execution time. Results are averaged over 1 runs (zooming in will display the extremely thin error envelopes). Real Data 2: Amplitude Modulated Music Amplitude modulation is the earliest technique used to transmit voice over the radio. In the following experiment X and Y were one thousand dimensional samples of carrier signals that were modulated with two different input audio signals from Type I error simulations. References In Figure 6, we present Type I error for the Alba Fernández, V., Jiménez-Gamero, M., and Muñoz Garcia, J. A test for the two-sample problem based on empirical characteristic functions. Computational Statistics and Data Analysis, 52: , 28. Baldi, P., Sadowski, P., and Whiteson, D. Searching for ex-

6 otic particles in high-energy physics with deep learning. Nature Communications, 5, 214. Baringhaus, L and Franz, C. On a new multivariate twosample test. Journal of Multivariate Analysis, 88(1): 19 26, 24. Borgwardt, K.M., Gretton, A., Rasch, M.J., Kriegel, H.- P., Schölkopf, B., and Smola, A. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49 e57, 26. Cuesta-Albertos, Juan Antonio, Fraiman, Ricardo, and Ransford, Thomas. A sharp form of the cramér wold theorem. Journal of Theoretical Probability, 2(2):21 29, 27. Epps, Thomas W and Pulley, Lawrence B. A test for normality based on the empirical characteristic function. Biometrika, 7(3): , Epps, T.W. and Singleton, K.J. An omnibus test for the two-sample problem using the empirical characteristic function. Journal of Statistical Computation and Simulation., 26(3-4):177 23, Gretton, A., Fukumizu, K., Harchaoui, Z., and Sriperumbudur, B. A fast, consistent kernel two-sample test. In Advances in Neural Information Processing Systems, 29. Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., and Smola, A. A kernel two-sample test. Journal Machine Learning Research, 13: , 212a. Gretton, A., Sriperumbudur, B., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., and Fukumizu, K. Optimal kernel choice for large-scale two-sample tests. In Advances in Neural Information Processing Systems, 212b. Hall, Peter and Welsh, AH. A test for normality based on the empirical characteristic function. Biometrika, 7(2): , Harchaoui, Z., Bach, F.R., and Moulines, E. Testing for Homogeneity with Kernel Fisher Discriminant Analysis. In Advances in Neural Information Processing Systems, pp Heathcote, CE. A test of goodness of fit for symmetric random variables. Australian Journal of Statistics, 14 (2): , Heathcote, CR. The integrated squared error estimation of parameters. Biometrika, 64(2): , Ho, H.-C. and Shieh, G. Two-stage U-statistics for hypothesis testing. Scandinavian Journal of Statistics, 33(4): , 26. Le, Q., Sarlos, T., and Smola, A. Fastfood - computing Hilbert space expansions in loglinear time. In Proceedings of the International Conference on Machine Learning, JMLR W&CP, volume 28, pp , 213. Lichman, M. UCI machine learning repository, 213. URL Lloyd, J.R. and Ghahramani, Z. Statistical model criticism using kernel two sample tests. Technical report, 214. Lukacs, Eugene. A survey of the theory of characteristic functions. Advances in Applied Probability, pp. 1 38, Lukacs, Eugene and Szasz, Otto. On analytic characteristic functions. Pacific Journal of Mathematics, Pevnỳ, Tomáš and Fridrich, Jessica. Benchmarking for steganography. In Information Hiding, pp Springer, 28. Rahimi, A. and Recht, B. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, 27. Ramdas, Aaditya, Reddi, Sashank, Póczos, Barnabás, Singh, Aarti, and Wasserman, Larry. On the decreasing power of kernel- and distance-based nonparametric hypothesis tests in high dimensions. 29th AAAI Conference on Aritificial Intelligence, 215. Reddi, Sashank, Ramdas, Aaditya, Póczos, Barnabás, Singh, Aarti, and Wasserman, Larry. On the highdimensional power of linear-time kernel two-sample testing under mean-difference alternatives. 18th International Conference on Artificial Intelligence and Statistics, 215. Rudin, Walter. Real and complex analysis. Tata McGraw- Hill Education, Sejdinovic, D., Sriperumbudur, B., Gretton, A., and Fukumizu, K. Equivalence of distance-based and RKHSbased statistics in hypothesis testing. Annals of Statistics, 41(5): , 213. Sriperumbudur, B., Gretton, A., Fukumizu, K., Lanckriet, G., and Schölkopf, B. Hilbert space embeddings and metrics on probability measures. Journal Machine Learning Research, 11: , 21. Sriperumbudur, B., Fukumizu, K., and Lanckriet, G. Universality, characteristic kernels and RKHS embedding of measures. Journal Machine Learning Research, 12: , 211.

7 Székely, GJ. E-statistics: The energy of statistical samples. Bowling Green State University, Department of Mathematics and Statistics Technical Report, (3-5):2 23, 23. Zaremba, W., Gretton, A., and Blaschko, M. : A nonparametric, low variance kernel two-sample test. In Advances in Neural Information Processing Systems, 213. Zinger, AA, Kakosyan, AV, and Klebanov, LB. A characterization of distributions by mean values of statistics and certain probabilistic metrics. Journal of Mathematical Sciences, 59(4):914 92, Fast Two Sample Tests using Smooth Random Features

Big Hypothesis Testing with Kernel Embeddings

Big Hypothesis Testing with Kernel Embeddings Big Hypothesis Testing with Kernel Embeddings Dino Sejdinovic Department of Statistics University of Oxford 9 January 2015 UCL Workshop on the Theory of Big Data D. Sejdinovic (Statistics, Oxford) Big

More information

Learning features to compare distributions

Learning features to compare distributions Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this

More information

Learning Interpretable Features to Compare Distributions

Learning Interpretable Features to Compare Distributions Learning Interpretable Features to Compare Distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London Theory of Big Data, 2017 1/41 Goal of this talk Given: Two collections

More information

Learning features to compare distributions

Learning features to compare distributions Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this

More information

B-tests: Low Variance Kernel Two-Sample Tests

B-tests: Low Variance Kernel Two-Sample Tests B-tests: Low Variance Kernel Two-Sample Tests Wojciech Zaremba, Arthur Gretton, Matthew Blaschko To cite this version: Wojciech Zaremba, Arthur Gretton, Matthew Blaschko. B-tests: Low Variance Kernel Two-

More information

Hypothesis Testing with Kernel Embeddings on Interdependent Data

Hypothesis Testing with Kernel Embeddings on Interdependent Data Hypothesis Testing with Kernel Embeddings on Interdependent Data Dino Sejdinovic Department of Statistics University of Oxford joint work with Kacper Chwialkowski and Arthur Gretton (Gatsby Unit, UCL)

More information

Hilbert Space Embedding of Probability Measures

Hilbert Space Embedding of Probability Measures Lecture 2 Hilbert Space Embedding of Probability Measures Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 2017 Recap of Lecture

More information

Unsupervised Nonparametric Anomaly Detection: A Kernel Method

Unsupervised Nonparametric Anomaly Detection: A Kernel Method Fifty-second Annual Allerton Conference Allerton House, UIUC, Illinois, USA October - 3, 24 Unsupervised Nonparametric Anomaly Detection: A Kernel Method Shaofeng Zou Yingbin Liang H. Vincent Poor 2 Xinghua

More information

A Linear-Time Kernel Goodness-of-Fit Test

A Linear-Time Kernel Goodness-of-Fit Test A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum Gatsby Unit, UCL wittawatj@gmail.com Wenkai Xu Gatsby Unit, UCL wenkaix@gatsby.ucl.ac.uk Zoltán Szabó CMAP, École Polytechnique zoltan.szabo@polytechnique.edu

More information

A Linear-Time Kernel Goodness-of-Fit Test

A Linear-Time Kernel Goodness-of-Fit Test A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum 1; Wenkai Xu 1 Zoltán Szabó 2 NIPS 2017 Best paper! Kenji Fukumizu 3 Arthur Gretton 1 wittawatj@gmail.com 1 Gatsby Unit, University College

More information

Topics in kernel hypothesis testing

Topics in kernel hypothesis testing Topics in kernel hypothesis testing Kacper Chwialkowski A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy of University College London. Department

More information

arxiv: v2 [stat.ml] 3 Sep 2015

arxiv: v2 [stat.ml] 3 Sep 2015 Nonparametric Independence Testing for Small Sample Sizes arxiv:46.9v [stat.ml] 3 Sep 5 Aaditya Ramdas Dept. of Statistics and Machine Learning Dept. Carnegie Mellon University aramdas@cs.cmu.edu September

More information

Advances in kernel exponential families

Advances in kernel exponential families Advances in kernel exponential families Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS, 2017 1/39 Outline Motivating application: Fast estimation of complex multivariate

More information

Hilbert Space Representations of Probability Distributions

Hilbert Space Representations of Probability Distributions Hilbert Space Representations of Probability Distributions Arthur Gretton joint work with Karsten Borgwardt, Kenji Fukumizu, Malte Rasch, Bernhard Schölkopf, Alex Smola, Le Song, Choon Hui Teo Max Planck

More information

A Wild Bootstrap for Degenerate Kernel Tests

A Wild Bootstrap for Degenerate Kernel Tests A Wild Bootstrap for Degenerate Kernel Tests Kacper Chwialkowski Department of Computer Science University College London London, Gower Street, WCE 6BT kacper.chwialkowski@gmail.com Dino Sejdinovic Gatsby

More information

A Kernelized Stein Discrepancy for Goodness-of-fit Tests

A Kernelized Stein Discrepancy for Goodness-of-fit Tests Qiang Liu QLIU@CS.DARTMOUTH.EDU Computer Science, Dartmouth College, NH, 3755 Jason D. Lee JASONDLEE88@EECS.BERKELEY.EDU Michael Jordan JORDAN@CS.BERKELEY.EDU Department of Electrical Engineering and Computer

More information

arxiv: v1 [cs.lg] 22 Jun 2009

arxiv: v1 [cs.lg] 22 Jun 2009 Bayesian two-sample tests arxiv:0906.4032v1 [cs.lg] 22 Jun 2009 Karsten M. Borgwardt 1 and Zoubin Ghahramani 2 1 Max-Planck-Institutes Tübingen, 2 University of Cambridge June 22, 2009 Abstract In this

More information

Kernel Learning via Random Fourier Representations

Kernel Learning via Random Fourier Representations Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random

More information

An Adaptive Test of Independence with Analytic Kernel Embeddings

An Adaptive Test of Independence with Analytic Kernel Embeddings An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum Gatsby Unit, University College London wittawat@gatsby.ucl.ac.uk Probabilistic Graphical Model Workshop 2017 Institute

More information

Two-sample hypothesis testing for random dot product graphs

Two-sample hypothesis testing for random dot product graphs Two-sample hypothesis testing for random dot product graphs Minh Tang Department of Applied Mathematics and Statistics Johns Hopkins University JSM 2014 Joint work with Avanti Athreya, Vince Lyzinski,

More information

An Adaptive Test of Independence with Analytic Kernel Embeddings

An Adaptive Test of Independence with Analytic Kernel Embeddings An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum 1 Zoltán Szabó 2 Arthur Gretton 1 1 Gatsby Unit, University College London 2 CMAP, École Polytechnique ICML 2017, Sydney

More information

Tensor Product Kernels: Characteristic Property, Universality

Tensor Product Kernels: Characteristic Property, Universality Tensor Product Kernels: Characteristic Property, Universality Zolta n Szabo CMAP, E cole Polytechnique Joint work with: Bharath K. Sriperumbudur Hangzhou International Conference on Frontiers of Data Science

More information

M-Statistic for Kernel Change-Point Detection

M-Statistic for Kernel Change-Point Detection M-Statistic for Kernel Change-Point Detection Shuang Li, Yao Xie H. Milton Stewart School of Industrial and Systems Engineering Georgian Institute of Technology sli37@gatech.edu yao.xie@isye.gatech.edu

More information

Convergence Rates of Kernel Quadrature Rules

Convergence Rates of Kernel Quadrature Rules Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction

More information

On a Nonparametric Notion of Residual and its Applications

On a Nonparametric Notion of Residual and its Applications On a Nonparametric Notion of Residual and its Applications Bodhisattva Sen and Gábor Székely arxiv:1409.3886v1 [stat.me] 12 Sep 2014 Columbia University and National Science Foundation September 16, 2014

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Kernel adaptive Sequential Monte Carlo

Kernel adaptive Sequential Monte Carlo Kernel adaptive Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) December 7, 2015 1 / 36 Section 1 Outline

More information

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,

More information

Maximum Mean Discrepancy

Maximum Mean Discrepancy Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia

More information

Minimax Estimation of Kernel Mean Embeddings

Minimax Estimation of Kernel Mean Embeddings Minimax Estimation of Kernel Mean Embeddings Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University Gatsby Computational Neuroscience Unit May 4, 2016 Collaborators Dr. Ilya Tolstikhin

More information

On prediction and density estimation Peter McCullagh University of Chicago December 2004

On prediction and density estimation Peter McCullagh University of Chicago December 2004 On prediction and density estimation Peter McCullagh University of Chicago December 2004 Summary Having observed the initial segment of a random sequence, subsequent values may be predicted by calculating

More information

Kernel Sequential Monte Carlo

Kernel Sequential Monte Carlo Kernel Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) * equal contribution April 25, 2016 1 / 37 Section

More information

Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.

Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators. Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators. Michel Besserve MPI for Intelligent Systems, Tübingen michel.besserve@tuebingen.mpg.de Nikos K. Logothetis MPI

More information

Kernel methods for Bayesian inference

Kernel methods for Bayesian inference Kernel methods for Bayesian inference Arthur Gretton Gatsby Computational Neuroscience Unit Lancaster, Nov. 2014 Motivating Example: Bayesian inference without a model 3600 downsampled frames of 20 20

More information

Approximate Kernel Methods

Approximate Kernel Methods Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression

More information

Approximate Kernel PCA with Random Features

Approximate Kernel PCA with Random Features Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,

More information

Recovering Distributions from Gaussian RKHS Embeddings

Recovering Distributions from Gaussian RKHS Embeddings Motonobu Kanagawa Graduate University for Advanced Studies kanagawa@ism.ac.jp Kenji Fukumizu Institute of Statistical Mathematics fukumizu@ism.ac.jp Abstract Recent advances of kernel methods have yielded

More information

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM Wei Chu Chong Jin Ong chuwei@gatsby.ucl.ac.uk mpeongcj@nus.edu.sg S. Sathiya Keerthi mpessk@nus.edu.sg Control Division, Department

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

Imitation Learning via Kernel Mean Embedding

Imitation Learning via Kernel Mean Embedding Imitation Learning via Kernel Mean Embedding Kee-Eung Kim School of Computer Science KAIST kekim@cs.kaist.ac.kr Hyun Soo Park Department of Computer Science and Engineering University of Minnesota hspark@umn.edu

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Nonparametric Indepedence Tests: Space Partitioning and Kernel Approaches

Nonparametric Indepedence Tests: Space Partitioning and Kernel Approaches Nonparametric Indepedence Tests: Space Partitioning and Kernel Approaches Arthur Gretton 1 and László Györfi 2 1. Gatsby Computational Neuroscience Unit London, UK 2. Budapest University of Technology

More information

Kernels for Multi task Learning

Kernels for Multi task Learning Kernels for Multi task Learning Charles A Micchelli Department of Mathematics and Statistics State University of New York, The University at Albany 1400 Washington Avenue, Albany, NY, 12222, USA Massimiliano

More information

Self Adaptive Particle Filter

Self Adaptive Particle Filter Self Adaptive Particle Filter Alvaro Soto Pontificia Universidad Catolica de Chile Department of Computer Science Vicuna Mackenna 4860 (143), Santiago 22, Chile asoto@ing.puc.cl Abstract The particle filter

More information

Efficient Complex Output Prediction

Efficient Complex Output Prediction Efficient Complex Output Prediction Florence d Alché-Buc Joint work with Romain Brault, Alex Lambert, Maxime Sangnier October 12, 2017 LTCI, Télécom ParisTech, Institut-Mines Télécom, Université Paris-Saclay

More information

Random Projections. Lopez Paz & Duvenaud. November 7, 2013

Random Projections. Lopez Paz & Duvenaud. November 7, 2013 Random Projections Lopez Paz & Duvenaud November 7, 2013 Random Outline The Johnson-Lindenstrauss Lemma (1984) Random Kitchen Sinks (Rahimi and Recht, NIPS 2008) Fastfood (Le et al., ICML 2013) Why random

More information

A Magiv CV Theory for Large-Margin Classifiers

A Magiv CV Theory for Large-Margin Classifiers A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector

More information

Importance Reweighting Using Adversarial-Collaborative Training

Importance Reweighting Using Adversarial-Collaborative Training Importance Reweighting Using Adversarial-Collaborative Training Yifan Wu yw4@andrew.cmu.edu Tianshu Ren tren@andrew.cmu.edu Lidan Mu lmu@andrew.cmu.edu Abstract We consider the problem of reweighting a

More information

Can we do statistical inference in a non-asymptotic way? 1

Can we do statistical inference in a non-asymptotic way? 1 Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.

More information

Tutorial on Approximate Bayesian Computation

Tutorial on Approximate Bayesian Computation Tutorial on Approximate Bayesian Computation Michael Gutmann https://sites.google.com/site/michaelgutmann University of Helsinki Aalto University Helsinki Institute for Information Technology 16 May 2016

More information

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department,

More information

Distribution Regression

Distribution Regression Zoltán Szabó (École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Dagstuhl Seminar 16481

More information

Degrees of Freedom in Regression Ensembles

Degrees of Freedom in Regression Ensembles Degrees of Freedom in Regression Ensembles Henry WJ Reeve Gavin Brown University of Manchester - School of Computer Science Kilburn Building, University of Manchester, Oxford Rd, Manchester M13 9PL Abstract.

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Testing Restrictions and Comparing Models

Testing Restrictions and Comparing Models Econ. 513, Time Series Econometrics Fall 00 Chris Sims Testing Restrictions and Comparing Models 1. THE PROBLEM We consider here the problem of comparing two parametric models for the data X, defined by

More information

Distribution Regression with Minimax-Optimal Guarantee

Distribution Regression with Minimax-Optimal Guarantee Distribution Regression with Minimax-Optimal Guarantee (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton

More information

Fastfood Approximating Kernel Expansions in Loglinear Time. Quoc Le, Tamas Sarlos, and Alex Smola Presenter: Shuai Zheng (Kyle)

Fastfood Approximating Kernel Expansions in Loglinear Time. Quoc Le, Tamas Sarlos, and Alex Smola Presenter: Shuai Zheng (Kyle) Fastfood Approximating Kernel Expansions in Loglinear Time Quoc Le, Tamas Sarlos, and Alex Smola Presenter: Shuai Zheng (Kyle) Large Scale Problem: ImageNet Challenge Large scale data Number of training

More information

Generalized clustering via kernel embeddings

Generalized clustering via kernel embeddings Generalized clustering via kernel embeddings Stefanie Jegelka 1, Arthur Gretton 2,1, Bernhard Schölkopf 1, Bharath K. Sriperumbudur 3, and Ulrike von Luxburg 1 1 Max Planck Institute for Biological Cybernetics,

More information

Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise

Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise Makoto Yamada and Masashi Sugiyama Department of Computer Science, Tokyo Institute of Technology

More information

Hypothesis testing:power, test statistic CMS:

Hypothesis testing:power, test statistic CMS: Hypothesis testing:power, test statistic The more sensitive the test, the better it can discriminate between the null and the alternative hypothesis, quantitatively, maximal power In order to achieve this

More information

CS 7140: Advanced Machine Learning

CS 7140: Advanced Machine Learning Instructor CS 714: Advanced Machine Learning Lecture 3: Gaussian Processes (17 Jan, 218) Jan-Willem van de Meent (j.vandemeent@northeastern.edu) Scribes Mo Han (han.m@husky.neu.edu) Guillem Reus Muns (reusmuns.g@husky.neu.edu)

More information

Probabilistic Regression Using Basis Function Models

Probabilistic Regression Using Basis Function Models Probabilistic Regression Using Basis Function Models Gregory Z. Grudic Department of Computer Science University of Colorado, Boulder grudic@cs.colorado.edu Abstract Our goal is to accurately estimate

More information

Minimax-optimal distribution regression

Minimax-optimal distribution regression Zoltán Szabó (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) ISNPS, Avignon June 12,

More information

Multiple kernel learning for multiple sources

Multiple kernel learning for multiple sources Multiple kernel learning for multiple sources Francis Bach INRIA - Ecole Normale Supérieure NIPS Workshop - December 2008 Talk outline Multiple sources in computer vision Multiple kernel learning (MKL)

More information

Forefront of the Two Sample Problem

Forefront of the Two Sample Problem Forefront of the Two Sample Problem From classical to state-of-the-art methods Yuchi Matsuoka What is the Two Sample Problem? X ",, X % ~ P i. i. d Y ",, Y - ~ Q i. i. d Two Sample Problem P = Q? Example:

More information

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee 9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic

More information

Kernel-Based Just-In-Time Learning for Passing Expectation Propagation Messages

Kernel-Based Just-In-Time Learning for Passing Expectation Propagation Messages Kernel-Based Just-In-Time Learning for Passing Expectation Propagation Messages Wittawat Jitkrittum, Arthur Gretton, Nicolas Heess, S. M. Ali Eslami Balaji Lakshminarayanan, Dino Sejdinovic and Zoltán

More information

Generalized Transfer Component Analysis for Mismatched Jpeg Steganalysis

Generalized Transfer Component Analysis for Mismatched Jpeg Steganalysis Generalized Transfer Component Analysis for Mismatched Jpeg Steganalysis Xiaofeng Li 1, Xiangwei Kong 1, Bo Wang 1, Yanqing Guo 1,Xingang You 2 1 School of Information and Communication Engineering Dalian

More information

10-701/ Recitation : Kernels

10-701/ Recitation : Kernels 10-701/15-781 Recitation : Kernels Manojit Nandi February 27, 2014 Outline Mathematical Theory Banach Space and Hilbert Spaces Kernels Commonly Used Kernels Kernel Theory One Weird Kernel Trick Representer

More information

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December

More information

On the Noise Model of Support Vector Machine Regression. Massimiliano Pontil, Sayan Mukherjee, Federico Girosi

On the Noise Model of Support Vector Machine Regression. Massimiliano Pontil, Sayan Mukherjee, Federico Girosi MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 1651 October 1998

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Input-Dependent Estimation of Generalization Error under Covariate Shift

Input-Dependent Estimation of Generalization Error under Covariate Shift Statistics & Decisions, vol.23, no.4, pp.249 279, 25. 1 Input-Dependent Estimation of Generalization Error under Covariate Shift Masashi Sugiyama (sugi@cs.titech.ac.jp) Department of Computer Science,

More information

Discussion of the paper Inference for Semiparametric Models: Some Questions and an Answer by Bickel and Kwon

Discussion of the paper Inference for Semiparametric Models: Some Questions and an Answer by Bickel and Kwon Discussion of the paper Inference for Semiparametric Models: Some Questions and an Answer by Bickel and Kwon Jianqing Fan Department of Statistics Chinese University of Hong Kong AND Department of Statistics

More information

Statistical Convergence of Kernel CCA

Statistical Convergence of Kernel CCA Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,

More information

Learning gradients: prescriptive models

Learning gradients: prescriptive models Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

Deep Convolutional Neural Networks for Pairwise Causality

Deep Convolutional Neural Networks for Pairwise Causality Deep Convolutional Neural Networks for Pairwise Causality Karamjit Singh, Garima Gupta, Lovekesh Vig, Gautam Shroff, and Puneet Agarwal TCS Research, Delhi Tata Consultancy Services Ltd. {karamjit.singh,

More information

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

Package dhsic. R topics documented: July 27, 2017

Package dhsic. R topics documented: July 27, 2017 Package dhsic July 27, 2017 Type Package Title Independence Testing via Hilbert Schmidt Independence Criterion Version 2.0 Date 2017-07-24 Author Niklas Pfister and Jonas Peters Maintainer Niklas Pfister

More information

Understanding the relationship between Functional and Structural Connectivity of Brain Networks

Understanding the relationship between Functional and Structural Connectivity of Brain Networks Understanding the relationship between Functional and Structural Connectivity of Brain Networks Sashank J. Reddi Machine Learning Department, Carnegie Mellon University SJAKKAMR@CS.CMU.EDU Abstract Background.

More information

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR

More information

A central limit theorem for an omnibus embedding of random dot product graphs

A central limit theorem for an omnibus embedding of random dot product graphs A central limit theorem for an omnibus embedding of random dot product graphs Keith Levin 1 with Avanti Athreya 2, Minh Tang 2, Vince Lyzinski 3 and Carey E. Priebe 2 1 University of Michigan, 2 Johns

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Robust Low Rank Kernel Embeddings of Multivariate Distributions

Robust Low Rank Kernel Embeddings of Multivariate Distributions Robust Low Rank Kernel Embeddings of Multivariate Distributions Le Song, Bo Dai College of Computing, Georgia Institute of Technology lsong@cc.gatech.edu, bodai@gatech.edu Abstract Kernel embedding of

More information

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides Intelligent Data Analysis and Probabilistic Inference Lecture

More information

arxiv: v1 [math.st] 24 Aug 2017

arxiv: v1 [math.st] 24 Aug 2017 Submitted to the Annals of Statistics MULTIVARIATE DEPENDENCY MEASURE BASED ON COPULA AND GAUSSIAN KERNEL arxiv:1708.07485v1 math.st] 24 Aug 2017 By AngshumanRoy and Alok Goswami and C. A. Murthy We propose

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London

More information

Asymptotic Statistics-VI. Changliang Zou

Asymptotic Statistics-VI. Changliang Zou Asymptotic Statistics-VI Changliang Zou Kolmogorov-Smirnov distance Example (Kolmogorov-Smirnov confidence intervals) We know given α (0, 1), there is a well-defined d = d α,n such that, for any continuous

More information

Exploiting k-nearest Neighbor Information with Many Data

Exploiting k-nearest Neighbor Information with Many Data Exploiting k-nearest Neighbor Information with Many Data 2017 TEST TECHNOLOGY WORKSHOP 2017. 10. 24 (Tue.) Yung-Kyun Noh Robotics Lab., Contents Nonparametric methods for estimating density functions Nearest

More information

Active and Semi-supervised Kernel Classification

Active and Semi-supervised Kernel Classification Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),

More information

Predicting the Probability of Correct Classification

Predicting the Probability of Correct Classification Predicting the Probability of Correct Classification Gregory Z. Grudic Department of Computer Science University of Colorado, Boulder grudic@cs.colorado.edu Abstract We propose a formulation for binary

More information

Robust Monte Carlo Methods for Sequential Planning and Decision Making

Robust Monte Carlo Methods for Sequential Planning and Decision Making Robust Monte Carlo Methods for Sequential Planning and Decision Making Sue Zheng, Jason Pacheco, & John Fisher Sensing, Learning, & Inference Group Computer Science & Artificial Intelligence Laboratory

More information

Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation

Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation COMPSTAT 2010 Revised version; August 13, 2010 Michael G.B. Blum 1 Laboratoire TIMC-IMAG, CNRS, UJF Grenoble

More information

Lecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1

Lecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1 Lecture 2 1 Probability (90 min.) Definition, Bayes theorem, probability densities and their properties, catalogue of pdfs, Monte Carlo 2 Statistical tests (90 min.) general concepts, test statistics,

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information