Learning features to compare distributions

Size: px

Start display at page:

Download "Learning features to compare distributions"

Lenard Perry
6 years ago
Views:

1 Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28

2 Goal of this talk Have: Two collections of samples X; Y from unknown distributions P and Q. Goal: Learn distinguishing features that indicate how P and Q differ. 2/28

3 Goal of this talk Have: Two collections of samples X; Y from unknown distributions P and Q. Goal: Learn distinguishing features that indicate how P and Q differ. 2/28

4 Divergences 3/28

5 Divergences 4/28

6 Divergences 5/28

7 Divergences 6/28

8 Divergences Sriperumbudur, Fukumizu, G, Schoelkopf, Lanckriet (2012) 7/28

9 Overview The Maximum mean discrepancy: How to compute and interpret the MMD How to train the MMD Application to troubleshooting GANs The ME test statistic: Informative, linear time features for comparing distributions How to learn these features TL;DR: Variance matters. 8/28

10 The maximum mean discrepancy Are P and Q different? P(x) Q(y) /28

11 Maximum mean discrepancy (on sample) 10/28

12 Maximum mean discrepancy (on sample) Observe X = fx 1 ; : : : ; x n g P Observe Y = fy 1 ; : : : ; y n g Q 10/28

13 Maximum mean discrepancy (on sample) Gaussian kernel on x i Gaussian kernel on y i 10/28

14 Maximum mean discrepancy (on sample) ^P (v): mean embedding of P ^Q (v): mean embedding of Q v ^P (v) := 1 m P mi=1 k (x i ; v ) 10/28

15 Maximum mean discrepancy (on sample) ^P (v): mean embedding of P ^Q (v): mean embedding of Q witness(v) = ^P (v) {z } ^Q (v) v 10/28

16 Maximum mean discrepancy (on sample) \MMD 2 = kwitness(v)k 2 F 1 X 1 = k(x n(n 1) i ; x j ) + n(n 1) i6=j 2 X n 2 k(x i ; y j ) i ;j X i6=j k(y i ; y j ) 11/28

17 Overview Dogs (= P ) and fish (= Q) example revisited Each entry is one of k (dog i ; dog j ), k (dog i ; fish j ), or k (fish i ; fish j ) 12/28

$Overview The maximum mean discrepancy: \MMD 2 1 X = n(n 1) i6=j 2 X n 2 k(dog$

18 Overview The maximum mean discrepancy: \MMD 2 1 X = n(n 1) i6=j 2 X n 2 k(dog i ; fish j ) i ;j k(dog i ; dog j ) + 1 n(n 1) X i6=j k(fish i ; fish j ) 13/28

19 Asymptotics of MMD The MMD: \MMD 2 1 = n(n 1) X k(x i ; x j ) + i6=j 2 X n 2 k(x i ; y j ) but how to choose the kernel? i ;j 1 n(n 1) X i6=j k(y i ; y j ) 14/28

20 Asymptotics of MMD The MMD: \MMD 2 1 = n(n 1) X k(x i ; x j ) + i6=j 2 X n 2 k(x i ; y j ) but how to choose the kernel? i ;j 1 n(n 1) X i6=j k(y i ; y j ) Perspective from statistical hypothesis testing: When P = Q then MMD 2 close to zero. \ When P 6= \ Q then MMD 2 far from zero \ 2 Threshold c for MMD gives false positive rate 14/28

21 A statistical test MMD density P=Q P Q Prob. of n MMD False negatives c α = 1 α quantile when P=Q n MMD 15/28

22 A statistical test MMD density P=Q P Q Prob. of n MMD False negatives c α = 1 α quantile when P=Q n MMD Best kernel gives lowest false negative rate (=highest power) 15/28

23 A statistical test MMD density P=Q P Q Prob. of n MMD False negatives c α = 1 α quantile when P=Q n MMD Best kernel gives lowest false negative rate (=highest power)... but can you train for this? 15/28

24 Asymptotics of MMD When P 6= Q, statistic is asymptotically normal, \MMD 2 MMD(P ; Q) p Vn (P ; Q) D! N (0; 1); where MMD(P ; Q) is population MMD, and V n (P ; Q) = O n MMD distribution and Gaussian fit under H1 Empirical PDF Gaussian fit Prob. density MMD 16/28

25 Asymptotics of MMD n\ MMD 2 1X Where P = Q, statistic has asymptotic distribution l=1 l hz 2 l 2 i MMD density under H0 χ 2 sum Empirical PDF where i i (x 0 ) = Z X ~k (x ; x 0 ) {z } centred i (x )dp (x ) 0.5 Prob. density z l N (0; 2) i:i:d: n MMD 2 17/28

26 Optimizing test power The power of our test (Pr 1 denotes probability under P 6= Q): n\ 2 Pr 1 MMD > ^c 18/28

27 Optimizing test power The power of our test (Pr 1 denotes probability under P 6= Q): n\ 2 Pr 1 MMD > ^c! 1 c n p V n (P ; Q)! MMD 2 (P ; Q) p Vn (P ; Q) where is the CDF of the standard normal distribution. ^c is an estimate of c test threshold. 18/28

28 Optimizing test power The power of our test (Pr 1 denotes probability under P 6= Q): n\ 2 Pr 1 MMD > ^c! 1 c n p V n (P ; Q) {z } O(n 3=2 )! MMD 2 (P ; Q) p Vn (P ; Q) {z } O(n 1=2 ) First term asymptotically negligible! 18/28

29 Optimizing test power The power of our test (Pr 1 denotes probability under P 6= Q): n\ 2 Pr 1 MMD > ^c! 1 c n p V n (P ; Q)! MMD 2 (P ; Q) p Vn (P ; Q) To maximize test power, maximize MMD 2 (P ; Q) p Vn (P ; Q) (Sutherland, Tung, Strathmann, De, Ramdas, Smola, G., in review for ICLR 2017) Code: github.com/dougalsutherland/opt-mmd 18/28

30 Troubleshooting for generative adversarial networks MNIST samples Samples from a GAN 19/28

31 Troubleshooting for generative adversarial networks MNIST samples ARD map Samples from a GAN Power for optimzed ARD kernel: 1.00 at = 0:01 Power for optimized RBF kernel: 0.57 at = 0:01 19/28

32 Benchmarking generative adversarial networks 20/28

33 The ME statistic and test 21/28

34 Distinguishing Feature(s) ^P (v): mean embedding of P ^Q (v): mean embedding of Q witness(v) = ^P (v) {z } ^Q (v) v 22/28

35 Distinguishing Feature(s) witness 2 (v) Take square of witness (only worry about amplitude) 23/28

36 Distinguishing Feature(s) New test statistic: witness 2 at a single v ; Linear time in number n of samples...but how to choose best feature v? 23/28

37 Distinguishing Feature(s) v Best feature = v that maximizes witness 2 (v)?? 23/28

38 Distinguishing Feature(s) witness 2 (v) Sample size n = 3 24/28

39 Distinguishing Feature(s) Sample size n = 50 24/28

40 Distinguishing Feature(s) Sample size n = /28

41 Distinguishing Feature(s) P(x) Q(y) witness 2 (v) Population witness 2 function 24/28

42 Distinguishing Feature(s) P(x) Q(y) witness 2 (v) v? v? 24/28

43 Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ^ n (v) := n witness2 (v) variance of v. 25/28

44 Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ^ n (v) := n witness2 (v) variance of v. 25/28

45 Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ^ n (v) := n witness2 (v) variance of v. P(x) Q(y) witness 2 (v) 25/28

46 Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ^ n (v) := n witness2 (v) variance of v. witness 2 (v) variance X (v) 25/28

47 Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ^ n (v) := n witness2 (v) variance of v. witness 2 (v) variance Y (v) 25/28

48 Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ^ n (v) := n witness2 (v) variance of v. witness 2 (v) variance of v 25/28

49 Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ^ n (v) := n witness2 (v) variance of v. ^ n(v) Best location is v that maximizes ^ n. v Improve performance using multiple locations fv j gj j=1 25/28

50 Distinguishing Positive/Negative Emotions + : : happy neutral surprised afraid angry disgusted 35 females and 35 males (Lundqvist et al., 1998) = 1632 dimensions. Pixel features. Sample size: 402. The proposed test achieves maximum test power in time O(n). Informative features: differences at the nose, and smile lines. 26/28

51 Distinguishing Positive/Negative Emotions Random feature + : : happy neutral surprised afraid angry disgusted Power vs. - The proposed test achieves maximum test power in time O(n). Informative features: differences at the nose, and smile lines. 26/28

52 Distinguishing Positive/Negative Emotions Random feature + : : happy neutral surprised afraid angry disgusted Power Proposed + vs. - The proposed test achieves maximum test power in time O(n). Informative features: differences at the nose, and smile lines. 26/28

53 Distinguishing Positive/Negative Emotions Random feature + : : happy neutral surprised afraid angry disgusted Power Proposed MMD (quadratic time) + vs. - The proposed test achieves maximum test power in time O(n). Informative features: differences at the nose, and smile lines. 26/28

54 Distinguishing Positive/Negative Emotions + : happy neutral surprised : afraid angry disgusted Learned feature The proposed test achieves maximum test power in time O(n). Informative features: differences at the nose, and smile lines. 26/28

55 Distinguishing Positive/Negative Emotions + : happy neutral surprised : afraid angry disgusted Learned feature The proposed test achieves maximum test power in time O(n). Informative features: differences at the nose, and smile lines. Code: 26/28

56 Final thoughts Witness function approaches: Diversity of samples: MMD test uses pairwise similarities between all samples ME test uses similarities to J reference features Disjoint support of generator/data distributions Witness function is smooth Other discriminator heuristics: Diversity of samples by minibatch heuristic (add as feature distances to neighbour samples) Salimans et al. (2016) Disjoint support treated by adding noise to blur images Arjovsky and Bottou (2016), Sønderby et al (2016) 27/28

57 Co-authors Students and postdocs: Kacper Chwialkowski (at Voleon) Wittawat Jitkrittum Heiko Strathmann Dougal Sutherland Collaborators Kenji Fukumizu Krikamol Muandet Bernhard Schoelkopf Bharath Sriperumbudur Zoltan Szabo Questions? 28/28

58 Testing against a probabilistic model 29/28

59 Statistical model criticism MMD(P ; Q) = kf k 2 = sup kf kf 1 [E Qf E p f ] p(x) q(x) f * (x) f (x ) is the witness function Can we compute MMD with samples from Q and a model P? Problem: usualy can t compute E p f in closed form. 30/28

60 Stein idea To get rid of E p f in we define the Stein operator sup [E q f E p f ] kf k F 1 T p f x f + f (@ x log p) Then E P T P f = 0 subject to appropriate boundary conditions. (Oates, Girolami, Chopin, 2016) 31/28

61 Maximum Stein Discrepancy Stein operator T p f x f + x (log p) Maximum Stein Discrepancy (MSD) MSD(p; q; F ) = sup E q T p g E p T p g kgk F 1 32/28

62 Maximum Stein Discrepancy Stein operator T p f x f + x (log p) Maximum Stein Discrepancy (MSD) MSD(p; q; F ) = sup E q T p g E p T p g kgk F 1 32/28

63 Maximum Stein Discrepancy Stein operator T p f x f + x (log p) Maximum Stein Discrepancy (MSD) MSD(p; q; F ) = sup E q T p g E p T p g = sup E q T p g kgk F 1 kgk F 1 32/28

64 Maximum Stein Discrepancy Stein operator T p f x f + x (log p) Maximum Stein Discrepancy (MSD) MSD(p; q; F ) = sup E q T p g E p T p g = sup E q T p g kgk F 1 kgk F p(x) q(x) g * (x) /28

65 Maximum Stein Discrepancy Stein operator T p f x f + x (log p) Maximum Stein Discrepancy (MSD) MSD(p; q; F ) = sup E q T p g E p T p g = sup E q T p g kgk F 1 kgk F p(x) q(x) g * (x) /28

66 Maximum stein discrepancy Closed-form expression for MSD: given Z ; Z 0 q, then (Chwialkowski, Strathmann, G., 2016) (Liu, Lee, Jordan 2016) where MSD(p; q; F ) = E q h p (Z ; Z 0 ) and k is RKHS kernel for F h p (x ; y) x log p(x )@ x log p(y)k (x ; y) y log p(y)@ x k (x ; y) x log p(x )@ y k (x ; y) y k (x ; y) Only depends on kernel log p(x ). Do not need to normalize p, or sample from it. 33/28

67 Statistical model criticism 3 Solar activity (normalised) Test the hypothesis that a Gaussian process model, learned from data?, is a good fit for the test data (example from Lloyd and Ghahramani, 2015) Year Code: 34/28

68 Statistical model criticism V n test Bootstrapped B n Frequency V n Test the hypothesis that a Gaussian process model, learned from data?, is a good fit for the test data 35/28

Learning Interpretable Features to Compare Distributions

Learning Interpretable Features to Compare Distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London Theory of Big Data, 2017 1/41 Goal of this talk Given: Two collections