Learning Interpretable Features to Compare Distributions
|
|
- Easter Cobb
- 5 years ago
- Views:
Transcription
1 Learning Interpretable Features to Compare Distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London Theory of Big Data, /41
2 Goal of this talk Given: Two collections of samples from unknown distributions P and Q. Goal: Learn distinguishing features that indicate how P and Q differ. 2/41
3 Goal of this talk Given: Two collections of samples from unknown distributions P and Q. Goal: Learn distinguishing features that indicate how P and Q differ. 2/41
4 Goal of this talk Have: Two collections of samples X; Y from unknown distributions P and Q. Goal: Learn distinguishing features that indicate how P and Q differ. MNIST samples Samples from a GAN Significant difference in GAN and MNIST? 3/41
5 Divergences 4/41
6 Divergences 5/41
7 Divergences 6/41
8 Divergences 7/41
9 Divergences Sriperumbudur, Fukumizu, G, Schoelkopf, Lanckriet (2012) 8/41
10 Overview The maximum mean discrepancy: How to compute and interpret the MMD Training the MMD to maximize test power Application to troubleshooting GANs The ME test statistic: Informative, linear time features for comparing distributions How to learn these features The maximum Stein discrepancy: Divergence between sample and model Only need model up to normalizing constant 9/41
11 Feature mean difference Simple example: 2 Gaussians with different means Answer: t-test Two Gaussians with different means P X Q X Prob. density X 10/41
12 Feature mean difference Two Gaussians with same means, different variance Idea: look at difference in means of features of the RVs In Gaussian case: second order features of form '(x ) = x Two Gaussians with different variances P X Q X 0.3 Prob. density X 11/41
13 Feature mean difference Two Gaussians with same means, different variance Idea: look at difference in means of features of the RVs In Gaussian case: second order features of form '(x ) = x 2 Prob. density Two Gaussians with different variances X P X Q X Prob. density Densities of feature X X 2 P X Q X 11/41
14 Feature mean difference Gaussian and Laplace distributions Same mean and same variance Difference in means using higher order features...rkhs Prob. density Gaussian and Laplace densities P X Q X X 12/41
15 The maximum mean discrepancy The reproducing property (kernel trick) Given x 2 X for some set X, define feature map '(x ) 2 F, '(x ) = [: : : ' i (x ) : : :] 2 `2 For positive definite k (x ; x 0 ), k (x ; x 0 ) = h'(x ); '(x 0 )i F e.g. k(x ; x 0 ) = exp kx x 0 k 2 13/41
16 The maximum mean discrepancy The reproducing property (kernel trick) Given x 2 X for some set X, define feature map '(x ) 2 F, '(x ) = [: : : ' i (x ) : : :] 2 `2 For positive definite k (x ; x 0 ), k (x ; x 0 ) = h'(x ); '(x 0 )i F e.g. k(x ; x 0 ) = exp kx x 0 k 2 The mean trick Given P a Borel probability measure on X, define feature map P 2 F P = [: : : E P [' i (X )] : : :] For positive definite k (x ; x 0 ), E P ;Qk(x ; y) = h P ; Q i F for x P and y Q. 13/41
17 The maximum mean discrepancy The maximum mean discrepancy is the distance between feature means: MMD 2 (P ; Q) = k P Q k 2 F = h P ; P i F + h Q ; Q i F 2 h P ; Q i F {z } (a) = E P k (X ; X 0 ) + E Q k (Y ; Y 0 ) {z } (a) 2E P ;Qk(X ; Y ) {z } (b) (a)= within distrib. similarity, (b)= cross-distrib. similarity. 14/41
18 The maximum mean discrepancy The maximum mean discrepancy is the distance between feature means: MMD 2 (P ; Q) = k P Q k 2 F = h P ; P i F + h Q ; Q i F 2 h P ; Q i F {z } (a) = E P k (X ; X 0 ) + E Q k (Y ; Y 0 ) {z } (a) 2E P ;Qk(X ; Y ) {z } (b) (a)= within distrib. similarity, (b)= cross-distrib. similarity. Unbiased empirical estimate of first term (quadratic time) be P k (x ; x 0 1 ) = m(m 1) mx mx i=1 j 6=i k (x i ; x j ) 14/41
19 Illustration of MMD Dogs (= P ) and fish (= Q) example revisited Each entry is one of k (dog i ; dog j ), k (dog i ; fish j ), or k (fish i ; fish j ) 15/41
20 Illustration of MMD The maximum mean discrepancy: \MMD 2 1 X = n(n 1) i6=j 2 X n 2 k(dog i ; fish j ) i ;j k(dog i ; dog j ) + 1 n(n 1) X i6=j k(fish i ; fish j ) 16/41
21 Witness function interpretation of MMD Are P and Q different? P(x) Q(y) /41
22 Witness function interpretation of MMD 18/41
23 Witness function interpretation of MMD Observe X = fx 1 ; : : : ; x n g P Observe Y = fy 1 ; : : : ; y n g Q 18/41
24 Witness function interpretation of MMD Gaussian kernel on x i Gaussian kernel on y i 18/41
25 Witness function interpretation of MMD ^P (v): mean embedding of b P ^Q (v): mean embedding of c Q v ^P (v) := 1 m P mi=1 k (x i ; v ) 18/41
26 Witness function interpretation of MMD ^P (v): mean embedding of b P ^Q (v): mean embedding of c Q witness(v) = ^P (v) {z } ^Q (v) v 18/41
27 Function showing difference in distributions Maximum mean discrepancy: smooth function for P vs Q MMD(P ; Q; F ) := sup [E P f (X ) E Q f (Y )] f 2F MMD(P ; Q; F ) = 0 iff P = Q when F = the unit ball in a characteristic RKHS 19/41
28 Function showing difference in distributions Maximum mean discrepancy: smooth function for P vs Q MMD(P ; Q; F ) := sup [E P f (X ) E Q f (Y )] f 2F MMD(P ; Q; F ) = 0 iff P = Q when F = the unit ball in a characteristic RKHS Example revisited: Laplace P vs Gauss Q Prob. density and f Witness f for Gauss and Laplace densities f Gauss Laplace X 19/41
29 Asymptotics of MMD The MMD: \MMD 2 1 = n(n 1) X k(x i ; x j ) + i6=j 2 X n 2 k(x i ; y j ) but how to choose the kernel? i ;j 1 n(n 1) X i6=j k(y i ; y j ) 20/41
30 Asymptotics of MMD The MMD: \MMD 2 1 = n(n 1) X k(x i ; x j ) + i6=j 2 X n 2 k(x i ; y j ) but how to choose the kernel? i ;j 1 n(n 1) X i6=j k(y i ; y j ) Perspective from statistical hypothesis testing: When P = Q then MMD 2 close to zero. \ When P 6= \ Q then MMD 2 far from zero \ 2 Threshold c for MMD gives false positive rate 20/41
31 A statistical test MMD density P=Q P Q Prob. of n MMD False negatives c α = 1 α quantile when P=Q n MMD 21/41
32 A statistical test MMD density P=Q P Q Prob. of n MMD False negatives c α = 1 α quantile when P=Q n MMD Best kernel gives lowest false negative rate (=highest power) 21/41
33 A statistical test MMD density P=Q P Q Prob. of n MMD False negatives c α = 1 α quantile when P=Q n MMD Best kernel gives lowest false negative rate (=highest power)... but can you train for this? 21/41
34 Asymptotics of MMD When P 6= Q, statistic is asymptotically normal, \MMD 2 MMD(P ; Q) p Vn (P ; Q) D! N (0; 1); where MMD(P ; Q) is population MMD, and V n (P ; Q) = O n MMD distribution and Gaussian fit under H1 Empirical PDF Gaussian fit Prob. density MMD 22/41
35 Asymptotics of MMD n\ MMD 2 1X Where P = Q, statistic has asymptotic distribution l=1 l hz 2 l 2 i MMD density under H0 χ 2 sum Empirical PDF where i i (x 0 ) = Z X ~k (x ; x 0 ) {z } centred i (x )dp (x ) 0.5 Prob. density z l N (0; 2) i:i:d: n MMD 2 23/41
36 Optimizing test power The power of our test (Pr 1 denotes probability under P 6= Q): n\ 2 Pr 1 MMD > ^c 24/41
37 Optimizing test power The power of our test (Pr 1 denotes probability under P 6= Q): n\ 2 Pr 1 MMD > ^c! MMD2 (P ; Q) p Vn (P ; Q) c n p V n (P ; Q)! where is the CDF of the standard normal distribution. ^c is an estimate of c test threshold. 24/41
38 Optimizing test power The power of our test (Pr 1 denotes probability under P 6= Q): n\ 2 Pr 1 MMD > ^c! MMD2 (P ; Q) p Vn (P ; Q) {z } O(n 1=2 ) c n p V n (P ; Q) {z } O(n 3=2 )! Second term asymptotically negligible! 24/41
39 Optimizing test power The power of our test (Pr 1 denotes probability under P 6= Q): n\ 2 Pr 1 MMD > ^c! MMD2 (P ; Q) p Vn (P ; Q) c n p V n (P ; Q)! To maximize test power, maximize MMD 2 (P ; Q) p Vn (P ; Q) (Sutherland, Tung, Strathmann, De, Ramdas, Smola, G., ICLR 2017) Code: github.com/dougalsutherland/opt-mmd 24/41
40 Troubleshooting for generative adversarial networks MNIST samples Samples from a GAN 25/41
41 Troubleshooting for generative adversarial networks MNIST samples ARD map Samples from a GAN Power for optimzed ARD kernel: 1.00 at = 0:01 Power for optimized RBF kernel: 0.57 at = 0:01 25/41
42 Benchmarking generative adversarial networks 26/41
43 The ME statistic and test 27/41
44 Distinguishing Feature(s) ^P (v): mean embedding of P ^Q (v): mean embedding of Q witness(v) = ^P (v) {z } ^Q (v) v 28/41
45 Distinguishing Feature(s) witness 2 (v) Take square of witness (only worry about amplitude) 29/41
46 Distinguishing Feature(s) New test statistic: witness 2 at a single v ; Linear time in number n of samples...but how to choose best feature v? 29/41
47 Distinguishing Feature(s) v Best feature = v that maximizes witness 2 (v)?? 29/41
48 Distinguishing Feature(s) witness 2 (v) Sample size n = 3 30/41
49 Distinguishing Feature(s) Sample size n = 50 30/41
50 Distinguishing Feature(s) Sample size n = /41
51 Distinguishing Feature(s) P(x) Q(y) witness 2 (v) Population witness 2 function 30/41
52 Distinguishing Feature(s) P(x) Q(y) witness 2 (v) v? v? 30/41
53 Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ^ n (v) := n witness2 (v) variance of v. 31/41
54 Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ^ n (v) := n witness2 (v) variance of v. 31/41
55 Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ^ n (v) := n witness2 (v) variance of v. P(x) Q(y) witness 2 (v) 31/41
56 Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ^ n (v) := n witness2 (v) variance of v. witness 2 (v) variance X (v) 31/41
57 Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ^ n (v) := n witness2 (v) variance of v. witness 2 (v) variance Y (v) 31/41
58 Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ^ n (v) := n witness2 (v) variance of v. witness 2 (v) variance of v 31/41
59 Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ^ n (v) := n witness2 (v) variance of v. ^ n(v) Best location is v that maximizes ^ n. v Improve performance using multiple locations fv j gj j=1 31/41
60 Distinguishing Positive/Negative Emotions + : : happy neutral surprised afraid angry disgusted 35 females and 35 males (Lundqvist et al., 1998) = 1632 dimensions. Pixel features. Sample size: 402. The proposed test achieves maximum test power in time O(n). Informative features: differences at the nose, and smile lines. 32/41
61 Distinguishing Positive/Negative Emotions Random feature + : : happy neutral surprised afraid angry disgusted Power vs. - The proposed test achieves maximum test power in time O(n). Informative features: differences at the nose, and smile lines. 32/41
62 Distinguishing Positive/Negative Emotions Random feature + : : happy neutral surprised afraid angry disgusted Power Proposed + vs. - The proposed test achieves maximum test power in time O(n). Informative features: differences at the nose, and smile lines. 32/41
63 Distinguishing Positive/Negative Emotions Random feature + : : happy neutral surprised afraid angry disgusted Power Proposed MMD (quadratic time) + vs. - The proposed test achieves maximum test power in time O(n). Informative features: differences at the nose, and smile lines. 32/41
64 Distinguishing Positive/Negative Emotions + : happy neutral surprised : afraid angry disgusted Learned feature The proposed test achieves maximum test power in time O(n). Informative features: differences at the nose, and smile lines. 32/41
65 Distinguishing Positive/Negative Emotions + : happy neutral surprised : afraid angry disgusted Learned feature The proposed test achieves maximum test power in time O(n). Informative features: differences at the nose, and smile lines. Code: 32/41
66 Testing against a probabilistic model 33/41
67 Statistical model criticism MMD(P ; Q) = kf k 2 = sup kf kf 1 [E Qf E p f ] p(x) q(x) f * (x) f (x ) is the witness function Can we compute MMD with samples from Q and a model P? Problem: usualy can t compute E p f in closed form. 34/41
68 Stein idea To get rid of E p f in we define the Stein operator sup [E q f E p f ] kf k F 1 T p f x f + f (@ x log p) Then E P T P f = 0 subject to appropriate boundary conditions. (Oates, Girolami, Chopin, 2016) 35/41
69 Proof of Stein idea Consider the class G = f@x f + f (@x log p)jf 2 Fg Given g 2 G, then (integration by parts) Epg(X ) =Ep [@ x f (X ) + f (X )@ x log p(x )] Z = Z 1 = x f (x )p(x ) + f (x )@ x p(x )dx (f (x )@ x p(x ))dx = f (x )p(x ) x= 1 = 0 x=1 36/41
70 Proof of Stein idea Consider the class G = f@x f + f (@x log p)jf 2 Fg Given g 2 G, then (integration by parts) Epg(X ) =Ep [@ x f (X ) + f (X )@ x log p(x )] Z = Z 1 = x f (x )p(x ) + f (x )@ x p(x )dx (f (x )@ x p(x ))dx = f (x )p(x ) x= 1 = 0 x=1 36/41
71 Proof of Stein idea Consider the class G = f@x f + f (@x log p)jf 2 Fg Given g 2 G, then (integration by parts) Epg(X ) =Ep [@ x f (X ) + f (X )@ x log p(x )] Z = Z 1 = x f (x )p(x ) + f (x )@ x p(x )dx (f (x )@ x p(x ))dx = f (x )p(x ) x= 1 = 0 x=1 36/41
72 Proof of Stein idea Consider the class G = f@x f + f (@x log p)jf 2 Fg Given g 2 G, then (integration by parts) Epg(X ) =Ep [@ x f (X ) + f (X )@ x log p(x )] Z = Z 1 = x f (x )p(x ) + f (x )@ x p(x )dx (f (x )@ x p(x ))dx = f (x )p(x ) x= 1 = 0 x=1 36/41
73 Maximum Stein Discrepancy Stein operator T p f x f + x (log p) Maximum Stein Discrepancy (MSD) MSD(p; q; F ) = sup E q T p g E p T p g kgk F 1 37/41
74 Maximum Stein Discrepancy Stein operator T p f x f + x (log p) Maximum Stein Discrepancy (MSD) MSD(p; q; F ) = sup E q T p g E p T p g kgk F 1 37/41
75 Maximum Stein Discrepancy Stein operator T p f x f + x (log p) Maximum Stein Discrepancy (MSD) MSD(p; q; F ) = sup E q T p g E p T p g = sup E q T p g kgk F 1 kgk F 1 37/41
76 Maximum Stein Discrepancy Stein operator T p f x f + x (log p) Maximum Stein Discrepancy (MSD) MSD(p; q; F ) = sup E q T p g E p T p g = sup E q T p g kgk F 1 kgk F p(x) q(x) g * (x) /41
77 Maximum Stein Discrepancy Stein operator T p f x f + x (log p) Maximum Stein Discrepancy (MSD) MSD(p; q; F ) = sup E q T p g E p T p g = sup E q T p g kgk F 1 kgk F p(x) q(x) g * (x) /41
78 Maximum stein discrepancy Closed-form expression for MSD: given Z ; Z 0 q, then (Chwialkowski, Strathmann, G., 2016) (Liu, Lee, Jordan 2016) where MSD(p; q; F ) = E q h p (Z ; Z 0 ) and k is RKHS kernel for F h p (x ; y) x log p(x )@ x log p(y)k (x ; y) y log p(y)@ x k (x ; y) x log p(x )@ y k (x ; y) y k (x ; y) Only depends on kernel log p(x ). Do not need to normalize p, or sample from it. 38/41
79 Statistical model criticism 3 Solar activity (normalised) Test the hypothesis that a Gaussian process model, learned from data?, is a good fit for the test data (example from Lloyd and Ghahramani, 2015) Year Code: 39/41
80 Statistical model criticism V n test Bootstrapped B n Frequency V n Test the hypothesis that a Gaussian process model, learned from data?, is a good fit for the test data 40/41
81 Co-authors Students and postdocs: Kacper Chwialkowski (at Voleon) Wittawat Jitkrittum Heiko Strathmann Dougal Sutherland Collaborators Kenji Fukumizu Krikamol Muandet Bernhard Schoelkopf Alex Smola Bharath Sriperumbudur Zoltan Szabo Questions? 41/41
Learning features to compare distributions
Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this
More informationLearning features to compare distributions
Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this
More informationA Linear-Time Kernel Goodness-of-Fit Test
A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum 1; Wenkai Xu 1 Zoltán Szabó 2 NIPS 2017 Best paper! Kenji Fukumizu 3 Arthur Gretton 1 wittawatj@gmail.com 1 Gatsby Unit, University College
More informationHilbert Space Representations of Probability Distributions
Hilbert Space Representations of Probability Distributions Arthur Gretton joint work with Karsten Borgwardt, Kenji Fukumizu, Malte Rasch, Bernhard Schölkopf, Alex Smola, Le Song, Choon Hui Teo Max Planck
More informationAdvances in kernel exponential families
Advances in kernel exponential families Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS, 2017 1/39 Outline Motivating application: Fast estimation of complex multivariate
More informationHilbert Space Embedding of Probability Measures
Lecture 2 Hilbert Space Embedding of Probability Measures Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 2017 Recap of Lecture
More informationA Linear-Time Kernel Goodness-of-Fit Test
A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum Gatsby Unit, UCL wittawatj@gmail.com Wenkai Xu Gatsby Unit, UCL wenkaix@gatsby.ucl.ac.uk Zoltán Szabó CMAP, École Polytechnique zoltan.szabo@polytechnique.edu
More informationAn Adaptive Test of Independence with Analytic Kernel Embeddings
An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum Gatsby Unit, University College London wittawat@gatsby.ucl.ac.uk Probabilistic Graphical Model Workshop 2017 Institute
More informationMaximum Mean Discrepancy
Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia
More informationMinimax Estimation of Kernel Mean Embeddings
Minimax Estimation of Kernel Mean Embeddings Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University Gatsby Computational Neuroscience Unit May 4, 2016 Collaborators Dr. Ilya Tolstikhin
More informationKernel methods for Bayesian inference
Kernel methods for Bayesian inference Arthur Gretton Gatsby Computational Neuroscience Unit Lancaster, Nov. 2014 Motivating Example: Bayesian inference without a model 3600 downsampled frames of 20 20
More informationLecture 35: December The fundamental statistical distances
36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose
More informationKernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton
Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,
More informationAn Adaptive Test of Independence with Analytic Kernel Embeddings
An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum 1 Zoltán Szabó 2 Arthur Gretton 1 1 Gatsby Unit, University College London 2 CMAP, École Polytechnique ICML 2017, Sydney
More informationAdaptive HMC via the Infinite Exponential Family
Adaptive HMC via the Infinite Exponential Family Arthur Gretton Gatsby Unit, CSML, University College London RegML, 2017 Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family
More informationExamples are not Enough, Learn to Criticize! Criticism for Interpretability
Examples are not Enough, Learn to Criticize! Criticism for Interpretability Been Kim, Rajiv Khanna, Oluwasanmi Koyejo Wittawat Jitkrittum Gatsby Machine Learning Journal Club 16 Jan 2017 1/20 Summary Examples
More informationKernel Bayes Rule: Nonparametric Bayesian inference with kernels
Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December
More informationBig Hypothesis Testing with Kernel Embeddings
Big Hypothesis Testing with Kernel Embeddings Dino Sejdinovic Department of Statistics University of Oxford 9 January 2015 UCL Workshop on the Theory of Big Data D. Sejdinovic (Statistics, Oxford) Big
More informationKernel methods for comparing distributions, measuring dependence
Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations
More informationKernel Adaptive Metropolis-Hastings
Kernel Adaptive Metropolis-Hastings Arthur Gretton,?? Gatsby Unit, CSML, University College London NIPS, December 2015 Arthur Gretton (Gatsby Unit, UCL) Kernel Adaptive Metropolis-Hastings 12/12/2015 1
More informationTopics in kernel hypothesis testing
Topics in kernel hypothesis testing Kacper Chwialkowski A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy of University College London. Department
More informationNonparametric Indepedence Tests: Space Partitioning and Kernel Approaches
Nonparametric Indepedence Tests: Space Partitioning and Kernel Approaches Arthur Gretton 1 and László Györfi 2 1. Gatsby Computational Neuroscience Unit London, UK 2. Budapest University of Technology
More informationKernel adaptive Sequential Monte Carlo
Kernel adaptive Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) December 7, 2015 1 / 36 Section 1 Outline
More informationTensor Product Kernels: Characteristic Property, Universality
Tensor Product Kernels: Characteristic Property, Universality Zolta n Szabo CMAP, E cole Polytechnique Joint work with: Bharath K. Sriperumbudur Hangzhou International Conference on Frontiers of Data Science
More informationKernel Sequential Monte Carlo
Kernel Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) * equal contribution April 25, 2016 1 / 37 Section
More informationFast Two Sample Tests using Smooth Random Features
Kacper Chwialkowski University College London, Computer Science Department Aaditya Ramdas Carnegie Mellon University, Machine Learning and Statistics School of Computer Science Dino Sejdinovic University
More informationA Kernelized Stein Discrepancy for Goodness-of-fit Tests
Qiang Liu QLIU@CS.DARTMOUTH.EDU Computer Science, Dartmouth College, NH, 3755 Jason D. Lee JASONDLEE88@EECS.BERKELEY.EDU Michael Jordan JORDAN@CS.BERKELEY.EDU Department of Electrical Engineering and Computer
More informationMMD GAN 1 Fisher GAN 2
MMD GAN 1 Fisher GAN 1 Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás Póczos (CMU, IBM Research) Youssef Mroueh, and Tom Sercu (IBM Research) Presented by Rui-Yi(Roy) Zhang Decemeber
More informationLecture 2: Mappings of Probabilities to RKHS and Applications
Lecture 2: Mappings of Probabilities to RKHS and Applications Lille, 214 Arthur Gretton Gatsby Unit, CSML, UCL Detecting differences in brain signals The problem: Dolocalfieldpotential(LFP)signalschange
More informationarxiv: v2 [stat.ml] 3 Sep 2015
Nonparametric Independence Testing for Small Sample Sizes arxiv:46.9v [stat.ml] 3 Sep 5 Aaditya Ramdas Dept. of Statistics and Machine Learning Dept. Carnegie Mellon University aramdas@cs.cmu.edu September
More informationHigh-dimensional test for normality
High-dimensional test for normality Jérémie Kellner Ph.D Student University Lille I - MODAL project-team Inria joint work with Alain Celisse Rennes - June 5th, 2014 Jérémie Kellner Ph.D Student University
More informationHypothesis Testing with Kernel Embeddings on Interdependent Data
Hypothesis Testing with Kernel Embeddings on Interdependent Data Dino Sejdinovic Department of Statistics University of Oxford joint work with Kacper Chwialkowski and Arthur Gretton (Gatsby Unit, UCL)
More informationarxiv: v1 [stat.ml] 27 Oct 2018
Informative Features for Model Comparison Wittawat Jitkrittum Max Planck Institute for Intelligent Systems wittawat@tuebingen.mpg.de Heishiro Kanagawa Gatsby Unit, UCL heishirok@gatsby.ucl.ac.uk arxiv:1810.11630v1
More informationIntroduction to Machine Learning
Introduction to Machine Learning 3. Instance Based Learning Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 Outline Parzen Windows Kernels, algorithm Model selection
More informationA Wild Bootstrap for Degenerate Kernel Tests
A Wild Bootstrap for Degenerate Kernel Tests Kacper Chwialkowski Department of Computer Science University College London London, Gower Street, WCE 6BT kacper.chwialkowski@gmail.com Dino Sejdinovic Gatsby
More informationMeasuring Sample Quality with Stein s Method
Measuring Sample Quality with Stein s Method Lester Mackey, Joint work with Jackson Gorham Microsoft Research Stanford University July 4, 2017 Mackey (MSR) Kernel Stein Discrepancy July 4, 2017 1 / 25
More informationTraining generative neural networks via Maximum Mean Discrepancy optimization
Training generative neural networks via Maximum Mean Discrepancy optimization Gintare Karolina Dziugaite University of Cambridge Daniel M. Roy University of Toronto Zoubin Ghahramani University of Cambridge
More informationForefront of the Two Sample Problem
Forefront of the Two Sample Problem From classical to state-of-the-art methods Yuchi Matsuoka What is the Two Sample Problem? X ",, X % ~ P i. i. d Y ",, Y - ~ Q i. i. d Two Sample Problem P = Q? Example:
More informationCS8803: Statistical Techniques in Robotics Byron Boots. Hilbert Space Embeddings
CS8803: Statistical Techniques in Robotics Byron Boots Hilbert Space Embeddings 1 Motivation CS8803: STR Hilbert Space Embeddings 2 Overview Multinomial Distributions Marginal, Joint, Conditional Sum,
More informationManny Parzen, Cherished Teacher And a Tale of Two Kernels. Grace Wahba
Manny Parzen, Cherished Teacher And a Tale of Two Kernels Grace Wahba From Emanuel Parzen and a Tale of Two Kernels Memorial Session for Manny Parzen Joint Statistical Meetings Baltimore, MD August 2,
More informationModel Selection for Gaussian Processes
Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal
More informationConvergence Rates of Kernel Quadrature Rules
Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction
More informationAdditional Practice Lessons 2.02 and 2.03
Additional Practice Lessons 2.02 and 2.03 1. There are two numbers n that satisfy the following equations. Find both numbers. a. n(n 1) 306 b. n(n 1) 462 c. (n 1)(n) 182 2. The following function is defined
More informationUnsupervised Nonparametric Anomaly Detection: A Kernel Method
Fifty-second Annual Allerton Conference Allerton House, UIUC, Illinois, USA October - 3, 24 Unsupervised Nonparametric Anomaly Detection: A Kernel Method Shaofeng Zou Yingbin Liang H. Vincent Poor 2 Xinghua
More informationBayesian Support Vector Machines for Feature Ranking and Selection
Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction
More informationCS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018
CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018 1 Overview This lecture mainly covers Recall the statistical theory of GANs
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationINDEPENDENCE MEASURES
INDEPENDENCE MEASURES Beatriz Bueno Larraz Máster en Investigación e Innovación en Tecnologías de la Información y las Comunicaciones. Escuela Politécnica Superior. Máster en Matemáticas y Aplicaciones.
More informationB-tests: Low Variance Kernel Two-Sample Tests
B-tests: Low Variance Kernel Two-Sample Tests Wojciech Zaremba, Arthur Gretton, Matthew Blaschko To cite this version: Wojciech Zaremba, Arthur Gretton, Matthew Blaschko. B-tests: Low Variance Kernel Two-
More informationlearning bounds for importance weighting Tamas Madarasz & Michael Rabadi April 15, 2015
learning bounds for importance weighting Tamas Madarasz & Michael Rabadi April 15, 2015 Introduction Often, training distribution does not match testing distribution Want to utilize information about test
More informationStatistical Convergence of Kernel CCA
Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,
More informationDiffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)
Diffeomorphic Warping Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) What Manifold Learning Isn t Common features of Manifold Learning Algorithms: 1-1 charting Dense sampling Geometric Assumptions
More informationRobust Support Vector Machines for Probability Distributions
Robust Support Vector Machines for Probability Distributions Andreas Christmann joint work with Ingo Steinwart (Los Alamos National Lab) ICORS 2008, Antalya, Turkey, September 8-12, 2008 Andreas Christmann,
More information1 Exponential manifold by reproducing kernel Hilbert spaces
1 Exponential manifold by reproducing kernel Hilbert spaces 1.1 Introduction The purpose of this paper is to propose a method of constructing exponential families of Hilbert manifold, on which estimation
More informationAUTOREGRESSIVE IMPLICIT QUANTILE NETWORKS FOR TIME SERIES GENERATION MAXIMILIEN BAUDRY *
AUTOREGRESSIVE IMPLICIT QUANTILE NETWORKS FOR TIME SERIES GENERATION MAXIMILIEN BAUDRY * *DAMI Chair & Université Lyon 1 - ISFA http://chaire-dami.fr/ AKNOWLEDGEMENTS 2 ACKNOWLEDGEMENTS TO Christian Y.
More informationTwo-sample hypothesis testing for random dot product graphs
Two-sample hypothesis testing for random dot product graphs Minh Tang Department of Applied Mathematics and Statistics Johns Hopkins University JSM 2014 Joint work with Avanti Athreya, Vince Lyzinski,
More informationRecovering Distributions from Gaussian RKHS Embeddings
Motonobu Kanagawa Graduate University for Advanced Studies kanagawa@ism.ac.jp Kenji Fukumizu Institute of Statistical Mathematics fukumizu@ism.ac.jp Abstract Recent advances of kernel methods have yielded
More informationEnergy-Based Generative Adversarial Network
Energy-Based Generative Adversarial Network Energy-Based Generative Adversarial Network J. Zhao, M. Mathieu and Y. LeCun Learning to Draw Samples: With Application to Amoritized MLE for Generalized Adversarial
More informationDistribution Regression
Zoltán Szabó (École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Dagstuhl Seminar 16481
More informationActive and Semi-supervised Kernel Classification
Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),
More informationMMD GAN: Towards Deeper Understanding of Moment Matching Network
MMD GAN: Towards Deeper Understanding of Moment Matching Network Chun-Liang Li 1, Wei-Cheng Chang 1, Yu Cheng 2 Yiming Yang 1 Barnabás Póczos 1 1 Carnegie Mellon University, 2 AI Foundations, IBM Research
More informationNon-parametric Estimation of Integral Probability Metrics
Non-parametric Estimation of Integral Probability Metrics Bharath K. Sriperumbudur, Kenji Fukumizu 2, Arthur Gretton 3,4, Bernhard Schölkopf 4 and Gert R. G. Lanckriet Department of ECE, UC San Diego,
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationApproximation Theoretical Questions for SVMs
Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually
More informationComputer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression
Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.
More informationKernel Embeddings of Conditional Distributions
Kernel Embeddings of Conditional Distributions Le Song, Kenji Fukumizu and Arthur Gretton Georgia Institute of Technology The Institute of Statistical Mathematics University College London Abstract Many
More informationGeneralized clustering via kernel embeddings
Generalized clustering via kernel embeddings Stefanie Jegelka 1, Arthur Gretton 2,1, Bernhard Schölkopf 1, Bharath K. Sriperumbudur 3, and Ulrike von Luxburg 1 1 Max Planck Institute for Biological Cybernetics,
More informationAssociation studies and regression
Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration
More informationIs the test error unbiased for these programs?
Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x
More informationGenerative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham
Generative classifiers: The Gaussian classifier Ata Kaban School of Computer Science University of Birmingham Outline We have already seen how Bayes rule can be turned into a classifier In all our examples
More informationIntroduction to Smoothing spline ANOVA models (metamodelling)
Introduction to Smoothing spline ANOVA models (metamodelling) M. Ratto DYNARE Summer School, Paris, June 215. Joint Research Centre www.jrc.ec.europa.eu Serving society Stimulating innovation Supporting
More informationStein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm
Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d
More informationWasserstein GAN. Juho Lee. Jan 23, 2017
Wasserstein GAN Juho Lee Jan 23, 2017 Wasserstein GAN (WGAN) Arxiv submission Martin Arjovsky, Soumith Chintala, and Léon Bottou A new GAN model minimizing the Earth-Mover s distance (Wasserstein-1 distance)
More information9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee
9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic
More informationAn Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM
An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM Wei Chu Chong Jin Ong chuwei@gatsby.ucl.ac.uk mpeongcj@nus.edu.sg S. Sathiya Keerthi mpessk@nus.edu.sg Control Division, Department
More informationarxiv: v1 [stat.ml] 14 May 2015
Training generative neural networks via aximum ean Discrepancy optimization Gintare Karolina Dziugaite University of Cambridge Daniel. Roy University of Toronto Zoubin Ghahramani University of Cambridge
More information10-701/ Recitation : Kernels
10-701/15-781 Recitation : Kernels Manojit Nandi February 27, 2014 Outline Mathematical Theory Banach Space and Hilbert Spaces Kernels Commonly Used Kernels Kernel Theory One Weird Kernel Trick Representer
More informationIntroduction to Machine Learning
Introduction to Machine Learning 12. Gaussian Processes Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 The Normal Distribution http://www.gaussianprocess.org/gpml/chapters/
More informationSupport Vector Clustering
Support Vector Clustering Asa Ben-Hur, David Horn, Hava T. Siegelmann, Vladimir Vapnik Wittawat Jitkrittum Gatsby tea talk 3 July 25 / Overview Support Vector Clustering Asa Ben-Hur, David Horn, Hava T.
More informationLearning to Sample Using Stein Discrepancy
Learning to Sample Using Stein Discrepancy Dilin Wang Yihao Feng Qiang Liu Department of Computer Science Dartmouth College Hanover, NH 03755 {dilin.wang.gr, yihao.feng.gr, qiang.liu}@dartmouth.edu Abstract
More informationA Kernel Between Sets of Vectors
A Kernel Between Sets of Vectors Risi Kondor Tony Jebara Columbia University, New York, USA. 1 A Kernel between Sets of Vectors In SVM, Gassian Processes, Kernel PCA, kernel K de nes feature map Φ : X
More informationGaussian processes for inference in stochastic differential equations
Gaussian processes for inference in stochastic differential equations Manfred Opper, AI group, TU Berlin November 6, 2017 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationImitation Learning via Kernel Mean Embedding
Imitation Learning via Kernel Mean Embedding Kee-Eung Kim School of Computer Science KAIST kekim@cs.kaist.ac.kr Hyun Soo Park Department of Computer Science and Engineering University of Minnesota hspark@umn.edu
More informationThe Variational Gaussian Approximation Revisited
The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much
More informationLecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157
Lecture 6: Gaussian Channels Copyright G. Caire (Sample Lectures) 157 Differential entropy (1) Definition 18. The (joint) differential entropy of a continuous random vector X n p X n(x) over R is: Z h(x
More informationOn Integral Probability Metrics, φ-divergences and Binary Classification
On Integral Probability etrics, φ-divergences and Binary Classification Bharath K. Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf and Gert R. G. Lanckriet arxiv:090.698v4 [cs.it] 3 Oct
More informationGaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak
Gaussian processes and bayesian optimization Stanisław Jastrzębski kudkudak.github.io kudkudak Plan Goal: talk about modern hyperparameter optimization algorithms Bayes reminder: equivalent linear regression
More informationComputer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression
Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:
More informationKernel methods and the exponential family
Kernel methods and the exponential family Stéphane Canu 1 and Alex J. Smola 2 1- PSI - FRE CNRS 2645 INSA de Rouen, France St Etienne du Rouvray, France Stephane.Canu@insa-rouen.fr 2- Statistical Machine
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationRoots and Coefficients Polynomials Preliminary Maths Extension 1
Preliminary Maths Extension Question If, and are the roots of x 5x x 0, find the following. (d) (e) Question If p, q and r are the roots of x x x 4 0, evaluate the following. pq r pq qr rp p q q r r p
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationA Kernel Method for the Two-Sample-Problem
A Kernel Method for the Two-Sample-Problem Arthur Gretton MPI for Biological Cybernetics Tübingen, Germany arthur@tuebingen.mpg.de Karsten M. Borgwardt Ludwig-Maximilians-Univ. Munich, Germany kb@dbs.ifi.lmu.de
More informationQuantifying mismatch in Bayesian optimization
Quantifying mismatch in Bayesian optimization Eric Schulz University College London e.schulz@cs.ucl.ac.uk Maarten Speekenbrink University College London m.speekenbrink@ucl.ac.uk José Miguel Hernández-Lobato
More informationStein Points. Abstract. 1. Introduction. Wilson Ye Chen 1 Lester Mackey 2 Jackson Gorham 3 François-Xavier Briol Chris. J.
Wilson Ye Chen Lester Mackey 2 Jackson Gorham 3 François-Xavier Briol 4 5 6 Chris. J. Oates 7 6 Abstract An important task in computational statistics and machine learning is to approximate a posterior
More informationMetric Embedding for Kernel Classification Rules
Metric Embedding for Kernel Classification Rules Bharath K. Sriperumbudur University of California, San Diego (Joint work with Omer Lang & Gert Lanckriet) Bharath K. Sriperumbudur (UCSD) Metric Embedding
More informationKernel Learning via Random Fourier Representations
Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random
More informationStein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm
Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu Dilin Wang Department of Computer Science Dartmouth College Hanover, NH 03755 {qiang.liu, dilin.wang.gr}@dartmouth.edu
More information