Learning features to compare distributions

Size: px
Start display at page:

Download "Learning features to compare distributions"

Transcription

1 Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28

2 Goal of this talk Have: Two collections of samples X; Y from unknown distributions P and Q. Goal: Learn distinguishing features that indicate how P and Q differ. 2/28

3 Goal of this talk Have: Two collections of samples X; Y from unknown distributions P and Q. Goal: Learn distinguishing features that indicate how P and Q differ. 2/28

4 Divergences 3/28

5 Divergences 4/28

6 Divergences 5/28

7 Divergences 6/28

8 Divergences Sriperumbudur, Fukumizu, G, Schoelkopf, Lanckriet (2012) 7/28

9 Overview The Maximum mean discrepancy: How to compute and interpret the MMD How to train the MMD Application to troubleshooting GANs The ME test statistic: Informative, linear time features for comparing distributions How to learn these features TL;DR: Variance matters. 8/28

10 The maximum mean discrepancy Are P and Q different? P(x) Q(y) /28

11 Maximum mean discrepancy (on sample) 10/28

12 Maximum mean discrepancy (on sample) Observe X = fx 1 ; : : : ; x n g P Observe Y = fy 1 ; : : : ; y n g Q 10/28

13 Maximum mean discrepancy (on sample) Gaussian kernel on x i Gaussian kernel on y i 10/28

14 Maximum mean discrepancy (on sample) ^P (v): mean embedding of P ^Q (v): mean embedding of Q v ^P (v) := 1 m P mi=1 k (x i ; v ) 10/28

15 Maximum mean discrepancy (on sample) ^P (v): mean embedding of P ^Q (v): mean embedding of Q witness(v) = ^P (v) {z } ^Q (v) v 10/28

16 Maximum mean discrepancy (on sample) \MMD 2 = kwitness(v)k 2 F 1 X 1 = k(x n(n 1) i ; x j ) + n(n 1) i6=j 2 X n 2 k(x i ; y j ) i ;j X i6=j k(y i ; y j ) 11/28

17 Overview Dogs (= P ) and fish (= Q) example revisited Each entry is one of k (dog i ; dog j ), k (dog i ; fish j ), or k (fish i ; fish j ) 12/28

18 Overview The maximum mean discrepancy: \MMD 2 1 X = n(n 1) i6=j 2 X n 2 k(dog i ; fish j ) i ;j k(dog i ; dog j ) + 1 n(n 1) X i6=j k(fish i ; fish j ) 13/28

19 Asymptotics of MMD The MMD: \MMD 2 1 = n(n 1) X k(x i ; x j ) + i6=j 2 X n 2 k(x i ; y j ) but how to choose the kernel? i ;j 1 n(n 1) X i6=j k(y i ; y j ) 14/28

20 Asymptotics of MMD The MMD: \MMD 2 1 = n(n 1) X k(x i ; x j ) + i6=j 2 X n 2 k(x i ; y j ) but how to choose the kernel? i ;j 1 n(n 1) X i6=j k(y i ; y j ) Perspective from statistical hypothesis testing: When P = Q then MMD 2 close to zero. \ When P 6= \ Q then MMD 2 far from zero \ 2 Threshold c for MMD gives false positive rate 14/28

21 A statistical test MMD density P=Q P Q Prob. of n MMD False negatives c α = 1 α quantile when P=Q n MMD 15/28

22 A statistical test MMD density P=Q P Q Prob. of n MMD False negatives c α = 1 α quantile when P=Q n MMD Best kernel gives lowest false negative rate (=highest power) 15/28

23 A statistical test MMD density P=Q P Q Prob. of n MMD False negatives c α = 1 α quantile when P=Q n MMD Best kernel gives lowest false negative rate (=highest power)... but can you train for this? 15/28

24 Asymptotics of MMD When P 6= Q, statistic is asymptotically normal, \MMD 2 MMD(P ; Q) p Vn (P ; Q) D! N (0; 1); where MMD(P ; Q) is population MMD, and V n (P ; Q) = O n MMD distribution and Gaussian fit under H1 Empirical PDF Gaussian fit Prob. density MMD 16/28

25 Asymptotics of MMD n\ MMD 2 1X Where P = Q, statistic has asymptotic distribution l=1 l hz 2 l 2 i MMD density under H0 χ 2 sum Empirical PDF where i i (x 0 ) = Z X ~k (x ; x 0 ) {z } centred i (x )dp (x ) 0.5 Prob. density z l N (0; 2) i:i:d: n MMD 2 17/28

26 Optimizing test power The power of our test (Pr 1 denotes probability under P 6= Q): n\ 2 Pr 1 MMD > ^c 18/28

27 Optimizing test power The power of our test (Pr 1 denotes probability under P 6= Q): n\ 2 Pr 1 MMD > ^c! 1 c n p V n (P ; Q)! MMD 2 (P ; Q) p Vn (P ; Q) where is the CDF of the standard normal distribution. ^c is an estimate of c test threshold. 18/28

28 Optimizing test power The power of our test (Pr 1 denotes probability under P 6= Q): n\ 2 Pr 1 MMD > ^c! 1 c n p V n (P ; Q) {z } O(n 3=2 )! MMD 2 (P ; Q) p Vn (P ; Q) {z } O(n 1=2 ) First term asymptotically negligible! 18/28

29 Optimizing test power The power of our test (Pr 1 denotes probability under P 6= Q): n\ 2 Pr 1 MMD > ^c! 1 c n p V n (P ; Q)! MMD 2 (P ; Q) p Vn (P ; Q) To maximize test power, maximize MMD 2 (P ; Q) p Vn (P ; Q) (Sutherland, Tung, Strathmann, De, Ramdas, Smola, G., in review for ICLR 2017) Code: github.com/dougalsutherland/opt-mmd 18/28

30 Troubleshooting for generative adversarial networks MNIST samples Samples from a GAN 19/28

31 Troubleshooting for generative adversarial networks MNIST samples ARD map Samples from a GAN Power for optimzed ARD kernel: 1.00 at = 0:01 Power for optimized RBF kernel: 0.57 at = 0:01 19/28

32 Benchmarking generative adversarial networks 20/28

33 The ME statistic and test 21/28

34 Distinguishing Feature(s) ^P (v): mean embedding of P ^Q (v): mean embedding of Q witness(v) = ^P (v) {z } ^Q (v) v 22/28

35 Distinguishing Feature(s) witness 2 (v) Take square of witness (only worry about amplitude) 23/28

36 Distinguishing Feature(s) New test statistic: witness 2 at a single v ; Linear time in number n of samples...but how to choose best feature v? 23/28

37 Distinguishing Feature(s) v Best feature = v that maximizes witness 2 (v)?? 23/28

38 Distinguishing Feature(s) witness 2 (v) Sample size n = 3 24/28

39 Distinguishing Feature(s) Sample size n = 50 24/28

40 Distinguishing Feature(s) Sample size n = /28

41 Distinguishing Feature(s) P(x) Q(y) witness 2 (v) Population witness 2 function 24/28

42 Distinguishing Feature(s) P(x) Q(y) witness 2 (v) v? v? 24/28

43 Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ^ n (v) := n witness2 (v) variance of v. 25/28

44 Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ^ n (v) := n witness2 (v) variance of v. 25/28

45 Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ^ n (v) := n witness2 (v) variance of v. P(x) Q(y) witness 2 (v) 25/28

46 Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ^ n (v) := n witness2 (v) variance of v. witness 2 (v) variance X (v) 25/28

47 Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ^ n (v) := n witness2 (v) variance of v. witness 2 (v) variance Y (v) 25/28

48 Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ^ n (v) := n witness2 (v) variance of v. witness 2 (v) variance of v 25/28

49 Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ^ n (v) := n witness2 (v) variance of v. ^ n(v) Best location is v that maximizes ^ n. v Improve performance using multiple locations fv j gj j=1 25/28

50 Distinguishing Positive/Negative Emotions + : : happy neutral surprised afraid angry disgusted 35 females and 35 males (Lundqvist et al., 1998) = 1632 dimensions. Pixel features. Sample size: 402. The proposed test achieves maximum test power in time O(n). Informative features: differences at the nose, and smile lines. 26/28

51 Distinguishing Positive/Negative Emotions Random feature + : : happy neutral surprised afraid angry disgusted Power vs. - The proposed test achieves maximum test power in time O(n). Informative features: differences at the nose, and smile lines. 26/28

52 Distinguishing Positive/Negative Emotions Random feature + : : happy neutral surprised afraid angry disgusted Power Proposed + vs. - The proposed test achieves maximum test power in time O(n). Informative features: differences at the nose, and smile lines. 26/28

53 Distinguishing Positive/Negative Emotions Random feature + : : happy neutral surprised afraid angry disgusted Power Proposed MMD (quadratic time) + vs. - The proposed test achieves maximum test power in time O(n). Informative features: differences at the nose, and smile lines. 26/28

54 Distinguishing Positive/Negative Emotions + : happy neutral surprised : afraid angry disgusted Learned feature The proposed test achieves maximum test power in time O(n). Informative features: differences at the nose, and smile lines. 26/28

55 Distinguishing Positive/Negative Emotions + : happy neutral surprised : afraid angry disgusted Learned feature The proposed test achieves maximum test power in time O(n). Informative features: differences at the nose, and smile lines. Code: 26/28

56 Final thoughts Witness function approaches: Diversity of samples: MMD test uses pairwise similarities between all samples ME test uses similarities to J reference features Disjoint support of generator/data distributions Witness function is smooth Other discriminator heuristics: Diversity of samples by minibatch heuristic (add as feature distances to neighbour samples) Salimans et al. (2016) Disjoint support treated by adding noise to blur images Arjovsky and Bottou (2016), Sønderby et al (2016) 27/28

57 Co-authors Students and postdocs: Kacper Chwialkowski (at Voleon) Wittawat Jitkrittum Heiko Strathmann Dougal Sutherland Collaborators Kenji Fukumizu Krikamol Muandet Bernhard Schoelkopf Bharath Sriperumbudur Zoltan Szabo Questions? 28/28

58 Testing against a probabilistic model 29/28

59 Statistical model criticism MMD(P ; Q) = kf k 2 = sup kf kf 1 [E Qf E p f ] p(x) q(x) f * (x) f (x ) is the witness function Can we compute MMD with samples from Q and a model P? Problem: usualy can t compute E p f in closed form. 30/28

60 Stein idea To get rid of E p f in we define the Stein operator sup [E q f E p f ] kf k F 1 T p f x f + f (@ x log p) Then E P T P f = 0 subject to appropriate boundary conditions. (Oates, Girolami, Chopin, 2016) 31/28

61 Maximum Stein Discrepancy Stein operator T p f x f + x (log p) Maximum Stein Discrepancy (MSD) MSD(p; q; F ) = sup E q T p g E p T p g kgk F 1 32/28

62 Maximum Stein Discrepancy Stein operator T p f x f + x (log p) Maximum Stein Discrepancy (MSD) MSD(p; q; F ) = sup E q T p g E p T p g kgk F 1 32/28

63 Maximum Stein Discrepancy Stein operator T p f x f + x (log p) Maximum Stein Discrepancy (MSD) MSD(p; q; F ) = sup E q T p g E p T p g = sup E q T p g kgk F 1 kgk F 1 32/28

64 Maximum Stein Discrepancy Stein operator T p f x f + x (log p) Maximum Stein Discrepancy (MSD) MSD(p; q; F ) = sup E q T p g E p T p g = sup E q T p g kgk F 1 kgk F p(x) q(x) g * (x) /28

65 Maximum Stein Discrepancy Stein operator T p f x f + x (log p) Maximum Stein Discrepancy (MSD) MSD(p; q; F ) = sup E q T p g E p T p g = sup E q T p g kgk F 1 kgk F p(x) q(x) g * (x) /28

66 Maximum stein discrepancy Closed-form expression for MSD: given Z ; Z 0 q, then (Chwialkowski, Strathmann, G., 2016) (Liu, Lee, Jordan 2016) where MSD(p; q; F ) = E q h p (Z ; Z 0 ) and k is RKHS kernel for F h p (x ; y) x log p(x )@ x log p(y)k (x ; y) y log p(y)@ x k (x ; y) x log p(x )@ y k (x ; y) y k (x ; y) Only depends on kernel log p(x ). Do not need to normalize p, or sample from it. 33/28

67 Statistical model criticism 3 Solar activity (normalised) Test the hypothesis that a Gaussian process model, learned from data?, is a good fit for the test data (example from Lloyd and Ghahramani, 2015) Year Code: 34/28

68 Statistical model criticism V n test Bootstrapped B n Frequency V n Test the hypothesis that a Gaussian process model, learned from data?, is a good fit for the test data 35/28

Learning Interpretable Features to Compare Distributions

Learning Interpretable Features to Compare Distributions Learning Interpretable Features to Compare Distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London Theory of Big Data, 2017 1/41 Goal of this talk Given: Two collections

More information

Learning features to compare distributions

Learning features to compare distributions Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this

More information

A Linear-Time Kernel Goodness-of-Fit Test

A Linear-Time Kernel Goodness-of-Fit Test A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum 1; Wenkai Xu 1 Zoltán Szabó 2 NIPS 2017 Best paper! Kenji Fukumizu 3 Arthur Gretton 1 wittawatj@gmail.com 1 Gatsby Unit, University College

More information

A Linear-Time Kernel Goodness-of-Fit Test

A Linear-Time Kernel Goodness-of-Fit Test A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum Gatsby Unit, UCL wittawatj@gmail.com Wenkai Xu Gatsby Unit, UCL wenkaix@gatsby.ucl.ac.uk Zoltán Szabó CMAP, École Polytechnique zoltan.szabo@polytechnique.edu

More information

Advances in kernel exponential families

Advances in kernel exponential families Advances in kernel exponential families Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS, 2017 1/39 Outline Motivating application: Fast estimation of complex multivariate

More information

Hilbert Space Embedding of Probability Measures

Hilbert Space Embedding of Probability Measures Lecture 2 Hilbert Space Embedding of Probability Measures Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 2017 Recap of Lecture

More information

An Adaptive Test of Independence with Analytic Kernel Embeddings

An Adaptive Test of Independence with Analytic Kernel Embeddings An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum Gatsby Unit, University College London wittawat@gatsby.ucl.ac.uk Probabilistic Graphical Model Workshop 2017 Institute

More information

Hilbert Space Representations of Probability Distributions

Hilbert Space Representations of Probability Distributions Hilbert Space Representations of Probability Distributions Arthur Gretton joint work with Karsten Borgwardt, Kenji Fukumizu, Malte Rasch, Bernhard Schölkopf, Alex Smola, Le Song, Choon Hui Teo Max Planck

More information

Minimax Estimation of Kernel Mean Embeddings

Minimax Estimation of Kernel Mean Embeddings Minimax Estimation of Kernel Mean Embeddings Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University Gatsby Computational Neuroscience Unit May 4, 2016 Collaborators Dr. Ilya Tolstikhin

More information

An Adaptive Test of Independence with Analytic Kernel Embeddings

An Adaptive Test of Independence with Analytic Kernel Embeddings An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum 1 Zoltán Szabó 2 Arthur Gretton 1 1 Gatsby Unit, University College London 2 CMAP, École Polytechnique ICML 2017, Sydney

More information

Adaptive HMC via the Infinite Exponential Family

Adaptive HMC via the Infinite Exponential Family Adaptive HMC via the Infinite Exponential Family Arthur Gretton Gatsby Unit, CSML, University College London RegML, 2017 Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family

More information

Kernel methods for Bayesian inference

Kernel methods for Bayesian inference Kernel methods for Bayesian inference Arthur Gretton Gatsby Computational Neuroscience Unit Lancaster, Nov. 2014 Motivating Example: Bayesian inference without a model 3600 downsampled frames of 20 20

More information

Big Hypothesis Testing with Kernel Embeddings

Big Hypothesis Testing with Kernel Embeddings Big Hypothesis Testing with Kernel Embeddings Dino Sejdinovic Department of Statistics University of Oxford 9 January 2015 UCL Workshop on the Theory of Big Data D. Sejdinovic (Statistics, Oxford) Big

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Maximum Mean Discrepancy

Maximum Mean Discrepancy Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia

More information

Examples are not Enough, Learn to Criticize! Criticism for Interpretability

Examples are not Enough, Learn to Criticize! Criticism for Interpretability Examples are not Enough, Learn to Criticize! Criticism for Interpretability Been Kim, Rajiv Khanna, Oluwasanmi Koyejo Wittawat Jitkrittum Gatsby Machine Learning Journal Club 16 Jan 2017 1/20 Summary Examples

More information

arxiv: v1 [stat.ml] 27 Oct 2018

arxiv: v1 [stat.ml] 27 Oct 2018 Informative Features for Model Comparison Wittawat Jitkrittum Max Planck Institute for Intelligent Systems wittawat@tuebingen.mpg.de Heishiro Kanagawa Gatsby Unit, UCL heishirok@gatsby.ucl.ac.uk arxiv:1810.11630v1

More information

Fast Two Sample Tests using Smooth Random Features

Fast Two Sample Tests using Smooth Random Features Kacper Chwialkowski University College London, Computer Science Department Aaditya Ramdas Carnegie Mellon University, Machine Learning and Statistics School of Computer Science Dino Sejdinovic University

More information

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December

More information

Nonparametric Indepedence Tests: Space Partitioning and Kernel Approaches

Nonparametric Indepedence Tests: Space Partitioning and Kernel Approaches Nonparametric Indepedence Tests: Space Partitioning and Kernel Approaches Arthur Gretton 1 and László Györfi 2 1. Gatsby Computational Neuroscience Unit London, UK 2. Budapest University of Technology

More information

Kernel Adaptive Metropolis-Hastings

Kernel Adaptive Metropolis-Hastings Kernel Adaptive Metropolis-Hastings Arthur Gretton,?? Gatsby Unit, CSML, University College London NIPS, December 2015 Arthur Gretton (Gatsby Unit, UCL) Kernel Adaptive Metropolis-Hastings 12/12/2015 1

More information

Topics in kernel hypothesis testing

Topics in kernel hypothesis testing Topics in kernel hypothesis testing Kacper Chwialkowski A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy of University College London. Department

More information

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,

More information

A Kernelized Stein Discrepancy for Goodness-of-fit Tests

A Kernelized Stein Discrepancy for Goodness-of-fit Tests Qiang Liu QLIU@CS.DARTMOUTH.EDU Computer Science, Dartmouth College, NH, 3755 Jason D. Lee JASONDLEE88@EECS.BERKELEY.EDU Michael Jordan JORDAN@CS.BERKELEY.EDU Department of Electrical Engineering and Computer

More information

Hypothesis Testing with Kernel Embeddings on Interdependent Data

Hypothesis Testing with Kernel Embeddings on Interdependent Data Hypothesis Testing with Kernel Embeddings on Interdependent Data Dino Sejdinovic Department of Statistics University of Oxford joint work with Kacper Chwialkowski and Arthur Gretton (Gatsby Unit, UCL)

More information

Kernel methods for comparing distributions, measuring dependence

Kernel methods for comparing distributions, measuring dependence Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations

More information

MMD GAN: Towards Deeper Understanding of Moment Matching Network

MMD GAN: Towards Deeper Understanding of Moment Matching Network MMD GAN: Towards Deeper Understanding of Moment Matching Network Chun-Liang Li 1, Wei-Cheng Chang 1, Yu Cheng 2 Yiming Yang 1 Barnabás Póczos 1 1 Carnegie Mellon University, 2 AI Foundations, IBM Research

More information

Tensor Product Kernels: Characteristic Property, Universality

Tensor Product Kernels: Characteristic Property, Universality Tensor Product Kernels: Characteristic Property, Universality Zolta n Szabo CMAP, E cole Polytechnique Joint work with: Bharath K. Sriperumbudur Hangzhou International Conference on Frontiers of Data Science

More information

A Wild Bootstrap for Degenerate Kernel Tests

A Wild Bootstrap for Degenerate Kernel Tests A Wild Bootstrap for Degenerate Kernel Tests Kacper Chwialkowski Department of Computer Science University College London London, Gower Street, WCE 6BT kacper.chwialkowski@gmail.com Dino Sejdinovic Gatsby

More information

MMD GAN 1 Fisher GAN 2

MMD GAN 1 Fisher GAN 2 MMD GAN 1 Fisher GAN 1 Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás Póczos (CMU, IBM Research) Youssef Mroueh, and Tom Sercu (IBM Research) Presented by Rui-Yi(Roy) Zhang Decemeber

More information

Kernel adaptive Sequential Monte Carlo

Kernel adaptive Sequential Monte Carlo Kernel adaptive Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) December 7, 2015 1 / 36 Section 1 Outline

More information

Training generative neural networks via Maximum Mean Discrepancy optimization

Training generative neural networks via Maximum Mean Discrepancy optimization Training generative neural networks via Maximum Mean Discrepancy optimization Gintare Karolina Dziugaite University of Cambridge Daniel M. Roy University of Toronto Zoubin Ghahramani University of Cambridge

More information

Measuring Sample Quality with Stein s Method

Measuring Sample Quality with Stein s Method Measuring Sample Quality with Stein s Method Lester Mackey, Joint work with Jackson Gorham Microsoft Research Stanford University July 4, 2017 Mackey (MSR) Kernel Stein Discrepancy July 4, 2017 1 / 25

More information

MMD GAN: Towards Deeper Understanding of Moment Matching Network

MMD GAN: Towards Deeper Understanding of Moment Matching Network MMD GAN: Towards Deeper Understanding of Moment Matching Network Chun-Liang Li 1, Wei-Cheng Chang 1, Yu Cheng 2 Yiming Yang 1 Barnabás Póczos 1 1 Carnegie Mellon University, 2 AI Foundations, IBM Research

More information

Kernel Sequential Monte Carlo

Kernel Sequential Monte Carlo Kernel Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) * equal contribution April 25, 2016 1 / 37 Section

More information

Manny Parzen, Cherished Teacher And a Tale of Two Kernels. Grace Wahba

Manny Parzen, Cherished Teacher And a Tale of Two Kernels. Grace Wahba Manny Parzen, Cherished Teacher And a Tale of Two Kernels Grace Wahba From Emanuel Parzen and a Tale of Two Kernels Memorial Session for Manny Parzen Joint Statistical Meetings Baltimore, MD August 2,

More information

Importance Reweighting Using Adversarial-Collaborative Training

Importance Reweighting Using Adversarial-Collaborative Training Importance Reweighting Using Adversarial-Collaborative Training Yifan Wu yw4@andrew.cmu.edu Tianshu Ren tren@andrew.cmu.edu Lidan Mu lmu@andrew.cmu.edu Abstract We consider the problem of reweighting a

More information

Active and Semi-supervised Kernel Classification

Active and Semi-supervised Kernel Classification Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),

More information

Imitation Learning via Kernel Mean Embedding

Imitation Learning via Kernel Mean Embedding Imitation Learning via Kernel Mean Embedding Kee-Eung Kim School of Computer Science KAIST kekim@cs.kaist.ac.kr Hyun Soo Park Department of Computer Science and Engineering University of Minnesota hspark@umn.edu

More information

Lecture 2: Mappings of Probabilities to RKHS and Applications

Lecture 2: Mappings of Probabilities to RKHS and Applications Lecture 2: Mappings of Probabilities to RKHS and Applications Lille, 214 Arthur Gretton Gatsby Unit, CSML, UCL Detecting differences in brain signals The problem: Dolocalfieldpotential(LFP)signalschange

More information

Unsupervised Nonparametric Anomaly Detection: A Kernel Method

Unsupervised Nonparametric Anomaly Detection: A Kernel Method Fifty-second Annual Allerton Conference Allerton House, UIUC, Illinois, USA October - 3, 24 Unsupervised Nonparametric Anomaly Detection: A Kernel Method Shaofeng Zou Yingbin Liang H. Vincent Poor 2 Xinghua

More information

arxiv: v2 [stat.ml] 3 Sep 2015

arxiv: v2 [stat.ml] 3 Sep 2015 Nonparametric Independence Testing for Small Sample Sizes arxiv:46.9v [stat.ml] 3 Sep 5 Aaditya Ramdas Dept. of Statistics and Machine Learning Dept. Carnegie Mellon University aramdas@cs.cmu.edu September

More information

B-tests: Low Variance Kernel Two-Sample Tests

B-tests: Low Variance Kernel Two-Sample Tests B-tests: Low Variance Kernel Two-Sample Tests Wojciech Zaremba, Arthur Gretton, Matthew Blaschko To cite this version: Wojciech Zaremba, Arthur Gretton, Matthew Blaschko. B-tests: Low Variance Kernel Two-

More information

Energy-Based Generative Adversarial Network

Energy-Based Generative Adversarial Network Energy-Based Generative Adversarial Network Energy-Based Generative Adversarial Network J. Zhao, M. Mathieu and Y. LeCun Learning to Draw Samples: With Application to Amoritized MLE for Generalized Adversarial

More information

Generalized clustering via kernel embeddings

Generalized clustering via kernel embeddings Generalized clustering via kernel embeddings Stefanie Jegelka 1, Arthur Gretton 2,1, Bernhard Schölkopf 1, Bharath K. Sriperumbudur 3, and Ulrike von Luxburg 1 1 Max Planck Institute for Biological Cybernetics,

More information

Convergence Rates of Kernel Quadrature Rules

Convergence Rates of Kernel Quadrature Rules Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction

More information

CS8803: Statistical Techniques in Robotics Byron Boots. Hilbert Space Embeddings

CS8803: Statistical Techniques in Robotics Byron Boots. Hilbert Space Embeddings CS8803: Statistical Techniques in Robotics Byron Boots Hilbert Space Embeddings 1 Motivation CS8803: STR Hilbert Space Embeddings 2 Overview Multinomial Distributions Marginal, Joint, Conditional Sum,

More information

Non-parametric Estimation of Integral Probability Metrics

Non-parametric Estimation of Integral Probability Metrics Non-parametric Estimation of Integral Probability Metrics Bharath K. Sriperumbudur, Kenji Fukumizu 2, Arthur Gretton 3,4, Bernhard Schölkopf 4 and Gert R. G. Lanckriet Department of ECE, UC San Diego,

More information

arxiv: v1 [stat.ml] 14 May 2015

arxiv: v1 [stat.ml] 14 May 2015 Training generative neural networks via aximum ean Discrepancy optimization Gintare Karolina Dziugaite University of Cambridge Daniel. Roy University of Toronto Zoubin Ghahramani University of Cambridge

More information

Distribution Regression

Distribution Regression Zoltán Szabó (École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Dagstuhl Seminar 16481

More information

AUTOREGRESSIVE IMPLICIT QUANTILE NETWORKS FOR TIME SERIES GENERATION MAXIMILIEN BAUDRY *

AUTOREGRESSIVE IMPLICIT QUANTILE NETWORKS FOR TIME SERIES GENERATION MAXIMILIEN BAUDRY * AUTOREGRESSIVE IMPLICIT QUANTILE NETWORKS FOR TIME SERIES GENERATION MAXIMILIEN BAUDRY * *DAMI Chair & Université Lyon 1 - ISFA http://chaire-dami.fr/ AKNOWLEDGEMENTS 2 ACKNOWLEDGEMENTS TO Christian Y.

More information

Forefront of the Two Sample Problem

Forefront of the Two Sample Problem Forefront of the Two Sample Problem From classical to state-of-the-art methods Yuchi Matsuoka What is the Two Sample Problem? X ",, X % ~ P i. i. d Y ",, Y - ~ Q i. i. d Two Sample Problem P = Q? Example:

More information

Model Selection for Gaussian Processes

Model Selection for Gaussian Processes Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal

More information

Learning to Sample Using Stein Discrepancy

Learning to Sample Using Stein Discrepancy Learning to Sample Using Stein Discrepancy Dilin Wang Yihao Feng Qiang Liu Department of Computer Science Dartmouth College Hanover, NH 03755 {dilin.wang.gr, yihao.feng.gr, qiang.liu}@dartmouth.edu Abstract

More information

arxiv: v3 [cs.lg] 15 Nov 2018

arxiv: v3 [cs.lg] 15 Nov 2018 Distance Measure Machines arxiv:1803.00250v3 [cs.lg] 15 Nov 2018 Alain Rakotomamonjy Abraham Traoré Maxime Bérar Rémi Flamary Nicolas Courty LITIS EA4108 Université de Rouen and Criteo AI Labs Criteo Paris

More information

Metric Embedding for Kernel Classification Rules

Metric Embedding for Kernel Classification Rules Metric Embedding for Kernel Classification Rules Bharath K. Sriperumbudur University of California, San Diego (Joint work with Omer Lang & Gert Lanckriet) Bharath K. Sriperumbudur (UCSD) Metric Embedding

More information

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d

More information

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham Generative classifiers: The Gaussian classifier Ata Kaban School of Computer Science University of Birmingham Outline We have already seen how Bayes rule can be turned into a classifier In all our examples

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Support Vector Clustering

Support Vector Clustering Support Vector Clustering Asa Ben-Hur, David Horn, Hava T. Siegelmann, Vladimir Vapnik Wittawat Jitkrittum Gatsby tea talk 3 July 25 / Overview Support Vector Clustering Asa Ben-Hur, David Horn, Hava T.

More information

Wasserstein GAN. Juho Lee. Jan 23, 2017

Wasserstein GAN. Juho Lee. Jan 23, 2017 Wasserstein GAN Juho Lee Jan 23, 2017 Wasserstein GAN (WGAN) Arxiv submission Martin Arjovsky, Soumith Chintala, and Léon Bottou A new GAN model minimizing the Earth-Mover s distance (Wasserstein-1 distance)

More information

Recovering Distributions from Gaussian RKHS Embeddings

Recovering Distributions from Gaussian RKHS Embeddings Motonobu Kanagawa Graduate University for Advanced Studies kanagawa@ism.ac.jp Kenji Fukumizu Institute of Statistical Mathematics fukumizu@ism.ac.jp Abstract Recent advances of kernel methods have yielded

More information

BAYESIAN DECISION THEORY

BAYESIAN DECISION THEORY Last updated: September 17, 2012 BAYESIAN DECISION THEORY Problems 2 The following problems from the textbook are relevant: 2.1 2.9, 2.11, 2.17 For this week, please at least solve Problem 2.3. We will

More information

Quantifying mismatch in Bayesian optimization

Quantifying mismatch in Bayesian optimization Quantifying mismatch in Bayesian optimization Eric Schulz University College London e.schulz@cs.ucl.ac.uk Maarten Speekenbrink University College London m.speekenbrink@ucl.ac.uk José Miguel Hernández-Lobato

More information

1 Exponential manifold by reproducing kernel Hilbert spaces

1 Exponential manifold by reproducing kernel Hilbert spaces 1 Exponential manifold by reproducing kernel Hilbert spaces 1.1 Introduction The purpose of this paper is to propose a method of constructing exponential families of Hilbert manifold, on which estimation

More information

Wasserstein Distance Measure Machines

Wasserstein Distance Measure Machines Alain Rakotomamonjy, Abraham Traore, Maxime Berar, Rémi Flamary, Nicolas Courty To cite this version: Alain Rakotomamonjy, Abraham Traore, Maxime Berar, Rémi Flamary, Nicolas Courty. Wasserstein Distance

More information

Bayesian Support Vector Machines for Feature Ranking and Selection

Bayesian Support Vector Machines for Feature Ranking and Selection Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction

More information

Causal Inference in Time Series via Supervised Learning

Causal Inference in Time Series via Supervised Learning Causal Inference in Time Series via Supervised Learning Yoichi Chikahara and Akinori Fujino NTT Communication Science Laboratories, Kyoto 619-0237, Japan chikahara.yoichi@lab.ntt.co.jp, fujino.akinori@lab.ntt.co.jp

More information

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM Wei Chu Chong Jin Ong chuwei@gatsby.ucl.ac.uk mpeongcj@nus.edu.sg S. Sathiya Keerthi mpessk@nus.edu.sg Control Division, Department

More information

Characterizing Independence with Tensor Product Kernels

Characterizing Independence with Tensor Product Kernels Characterizing Independence with Tensor Product Kernels Zolta n Szabo CMAP, E cole Polytechnique Joint work with: Bharath K. Sriperumbudur Department of Statistics, PSU December 13, 2017 Zolta n Szabo

More information

High-dimensional test for normality

High-dimensional test for normality High-dimensional test for normality Jérémie Kellner Ph.D Student University Lille I - MODAL project-team Inria joint work with Alain Celisse Rennes - June 5th, 2014 Jérémie Kellner Ph.D Student University

More information

Two-sample hypothesis testing for random dot product graphs

Two-sample hypothesis testing for random dot product graphs Two-sample hypothesis testing for random dot product graphs Minh Tang Department of Applied Mathematics and Statistics Johns Hopkins University JSM 2014 Joint work with Avanti Athreya, Vince Lyzinski,

More information

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018 CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018 1 Overview This lecture mainly covers Recall the statistical theory of GANs

More information

Statistical inference for MEG

Statistical inference for MEG Statistical inference for MEG Vladimir Litvak Wellcome Trust Centre for Neuroimaging University College London, UK MEG-UK 2014 educational day Talk aims Show main ideas of common methods Explain some of

More information

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI Bayes Classifiers CAP5610 Machine Learning Instructor: Guo-Jun QI Recap: Joint distributions Joint distribution over Input vector X = (X 1, X 2 ) X 1 =B or B (drinking beer or not) X 2 = H or H (headache

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Gaussian processes for inference in stochastic differential equations

Gaussian processes for inference in stochastic differential equations Gaussian processes for inference in stochastic differential equations Manfred Opper, AI group, TU Berlin November 6, 2017 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017

More information

The Variational Gaussian Approximation Revisited

The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

More information

Content by Week Week of October 14 27

Content by Week Week of October 14 27 Content by Week Week of October 14 27 Learning objectives By the end of this week, you should be able to: Understand the purpose and interpretation of confidence intervals for the mean, Calculate confidence

More information

Robert Collins CSE586, PSU Intro to Sampling Methods

Robert Collins CSE586, PSU Intro to Sampling Methods Intro to Sampling Methods CSE586 Computer Vision II Penn State Univ Topics to be Covered Monte Carlo Integration Sampling and Expected Values Inverse Transform Sampling (CDF) Ancestral Sampling Rejection

More information

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning Basics: Maximum Likelihood Estimation Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning

More information

Two-stage Sampled Learning Theory on Distributions

Two-stage Sampled Learning Theory on Distributions Two-stage Sampled Learning Theory on Distributions Zoltán Szabó (Gatsby Unit, UCL) Joint work with Arthur Gretton (Gatsby Unit, UCL), Barnabás Póczos (ML Department, CMU), Bharath K. Sriperumbudur (Department

More information

INDEPENDENCE MEASURES

INDEPENDENCE MEASURES INDEPENDENCE MEASURES Beatriz Bueno Larraz Máster en Investigación e Innovación en Tecnologías de la Información y las Comunicaciones. Escuela Politécnica Superior. Máster en Matemáticas y Aplicaciones.

More information

Hypothesis testing:power, test statistic CMS:

Hypothesis testing:power, test statistic CMS: Hypothesis testing:power, test statistic The more sensitive the test, the better it can discriminate between the null and the alternative hypothesis, quantitatively, maximal power In order to achieve this

More information

f-gan: Training Generative Neural Samplers using Variational Divergence Minimization

f-gan: Training Generative Neural Samplers using Variational Divergence Minimization f-gan: Training Generative Neural Samplers using Variational Divergence Minimization Sebastian Nowozin, Botond Cseke, Ryota Tomioka Machine Intelligence and Perception Group Microsoft Research {Sebastian.Nowozin,

More information

Understanding the relationship between Functional and Structural Connectivity of Brain Networks

Understanding the relationship between Functional and Structural Connectivity of Brain Networks Understanding the relationship between Functional and Structural Connectivity of Brain Networks Sashank J. Reddi Machine Learning Department, Carnegie Mellon University SJAKKAMR@CS.CMU.EDU Abstract Background.

More information

Non-Linear Regression for Bag-of-Words Data via Gaussian Process Latent Variable Set Model

Non-Linear Regression for Bag-of-Words Data via Gaussian Process Latent Variable Set Model Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Non-Linear Regression for Bag-of-Words Data via Gaussian Process Latent Variable Set Model Yuya Yoshikawa Nara Institute of Science

More information

Is the test error unbiased for these programs?

Is the test error unbiased for these programs? Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x

More information

Need for Sampling in Machine Learning. Sargur Srihari

Need for Sampling in Machine Learning. Sargur Srihari Need for Sampling in Machine Learning Sargur srihari@cedar.buffalo.edu 1 Rationale for Sampling 1. ML methods model data with probability distributions E.g., p(x,y; θ) 2. Models are used to answer queries,

More information

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines CS4495/6495 Introduction to Computer Vision 8C-L3 Support Vector Machines Discriminative classifiers Discriminative classifiers find a division (surface) in feature space that separates the classes Several

More information

A QUANTITATIVE MEASURE OF GENERATIVE ADVERSARIAL NETWORK DISTRIBUTIONS

A QUANTITATIVE MEASURE OF GENERATIVE ADVERSARIAL NETWORK DISTRIBUTIONS A QUANTITATIVE MEASURE OF GENERATIVE ADVERSARIAL NETWORK DISTRIBUTIONS Dan Hendrycks University of Chicago dan@ttic.edu Steven Basart University of Chicago xksteven@uchicago.edu ABSTRACT We introduce a

More information

Kernel Embeddings of Conditional Distributions

Kernel Embeddings of Conditional Distributions Kernel Embeddings of Conditional Distributions Le Song, Kenji Fukumizu and Arthur Gretton Georgia Institute of Technology The Institute of Statistical Mathematics University College London Abstract Many

More information

Stein Points. Abstract. 1. Introduction. Wilson Ye Chen 1 Lester Mackey 2 Jackson Gorham 3 François-Xavier Briol Chris. J.

Stein Points. Abstract. 1. Introduction. Wilson Ye Chen 1 Lester Mackey 2 Jackson Gorham 3 François-Xavier Briol Chris. J. Wilson Ye Chen Lester Mackey 2 Jackson Gorham 3 François-Xavier Briol 4 5 6 Chris. J. Oates 7 6 Abstract An important task in computational statistics and machine learning is to approximate a posterior

More information

Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing

Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing George Papandreou and Alan Yuille Department of Statistics University of California, Los Angeles ICCV Workshop on Information

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

arxiv: v1 [stat.ml] 28 Aug 2017

arxiv: v1 [stat.ml] 28 Aug 2017 Characteristic and Universal Tensor Product Kernels arxiv:1708.08157v1 [stat.l] 8 Aug 017 Zoltán Szabó CAP, École Polytechnique Route de Saclay, 9118 Palaiseau, France Bharath K. Sriperumbudur Department

More information

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu Dilin Wang Department of Computer Science Dartmouth College Hanover, NH 03755 {qiang.liu, dilin.wang.gr}@dartmouth.edu

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

arxiv: v2 [cs.lg] 8 Jun 2017

arxiv: v2 [cs.lg] 8 Jun 2017 Youssef Mroueh * 1 2 Tom Sercu * 1 2 Vaibhava Goel 2 arxiv:1702.08398v2 [cs.lg] 8 Jun 2017 Abstract We introduce new families of Integral Probability Metrics (IPM) for training Generative Adversarial Networks

More information

Minimax-optimal distribution regression

Minimax-optimal distribution regression Zoltán Szabó (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) ISNPS, Avignon June 12,

More information