Learning Interpretable Features to Compare Distributions

Size: px

Start display at page:

Download "Learning Interpretable Features to Compare Distributions"

Easter Cobb
5 years ago
Views:

1 Learning Interpretable Features to Compare Distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London Theory of Big Data, /41

2 Goal of this talk Given: Two collections of samples from unknown distributions P and Q. Goal: Learn distinguishing features that indicate how P and Q differ. 2/41

3 Goal of this talk Given: Two collections of samples from unknown distributions P and Q. Goal: Learn distinguishing features that indicate how P and Q differ. 2/41

4 Goal of this talk Have: Two collections of samples X; Y from unknown distributions P and Q. Goal: Learn distinguishing features that indicate how P and Q differ. MNIST samples Samples from a GAN Significant difference in GAN and MNIST? 3/41

5 Divergences 4/41

6 Divergences 5/41

7 Divergences 6/41

8 Divergences 7/41

9 Divergences Sriperumbudur, Fukumizu, G, Schoelkopf, Lanckriet (2012) 8/41

10 Overview The maximum mean discrepancy: How to compute and interpret the MMD Training the MMD to maximize test power Application to troubleshooting GANs The ME test statistic: Informative, linear time features for comparing distributions How to learn these features The maximum Stein discrepancy: Divergence between sample and model Only need model up to normalizing constant 9/41

11 Feature mean difference Simple example: 2 Gaussians with different means Answer: t-test Two Gaussians with different means P X Q X Prob. density X 10/41

12 Feature mean difference Two Gaussians with same means, different variance Idea: look at difference in means of features of the RVs In Gaussian case: second order features of form '(x ) = x Two Gaussians with different variances P X Q X 0.3 Prob. density X 11/41

13 Feature mean difference Two Gaussians with same means, different variance Idea: look at difference in means of features of the RVs In Gaussian case: second order features of form '(x ) = x 2 Prob. density Two Gaussians with different variances X P X Q X Prob. density Densities of feature X X 2 P X Q X 11/41

14 Feature mean difference Gaussian and Laplace distributions Same mean and same variance Difference in means using higher order features...rkhs Prob. density Gaussian and Laplace densities P X Q X X 12/41

15 The maximum mean discrepancy The reproducing property (kernel trick) Given x 2 X for some set X, define feature map '(x ) 2 F, '(x ) = [: : : ' i (x ) : : :] 2 `2 For positive definite k (x ; x 0 ), k (x ; x 0 ) = h'(x ); '(x 0 )i F e.g. k(x ; x 0 ) = exp kx x 0 k 2 13/41

16 The maximum mean discrepancy The reproducing property (kernel trick) Given x 2 X for some set X, define feature map '(x ) 2 F, '(x ) = [: : : ' i (x ) : : :] 2 `2 For positive definite k (x ; x 0 ), k (x ; x 0 ) = h'(x ); '(x 0 )i F e.g. k(x ; x 0 ) = exp kx x 0 k 2 The mean trick Given P a Borel probability measure on X, define feature map P 2 F P = [: : : E P [' i (X )] : : :] For positive definite k (x ; x 0 ), E P ;Qk(x ; y) = h P ; Q i F for x P and y Q. 13/41

17 The maximum mean discrepancy The maximum mean discrepancy is the distance between feature means: MMD 2 (P ; Q) = k P Q k 2 F = h P ; P i F + h Q ; Q i F 2 h P ; Q i F {z } (a) = E P k (X ; X 0 ) + E Q k (Y ; Y 0 ) {z } (a) 2E P ;Qk(X ; Y ) {z } (b) (a)= within distrib. similarity, (b)= cross-distrib. similarity. 14/41

18 The maximum mean discrepancy The maximum mean discrepancy is the distance between feature means: MMD 2 (P ; Q) = k P Q k 2 F = h P ; P i F + h Q ; Q i F 2 h P ; Q i F {z } (a) = E P k (X ; X 0 ) + E Q k (Y ; Y 0 ) {z } (a) 2E P ;Qk(X ; Y ) {z } (b) (a)= within distrib. similarity, (b)= cross-distrib. similarity. Unbiased empirical estimate of first term (quadratic time) be P k (x ; x 0 1 ) = m(m 1) mx mx i=1 j 6=i k (x i ; x j ) 14/41

19 Illustration of MMD Dogs (= P ) and fish (= Q) example revisited Each entry is one of k (dog i ; dog j ), k (dog i ; fish j ), or k (fish i ; fish j ) 15/41

20 Illustration of MMD The maximum mean discrepancy: \MMD 2 1 X = n(n 1) i6=j 2 X n 2 k(dog i ; fish j ) i ;j k(dog i ; dog j ) + 1 n(n 1) X i6=j k(fish i ; fish j ) 16/41

21 Witness function interpretation of MMD Are P and Q different? P(x) Q(y) /41

22 Witness function interpretation of MMD 18/41

23 Witness function interpretation of MMD Observe X = fx 1 ; : : : ; x n g P Observe Y = fy 1 ; : : : ; y n g Q 18/41

24 Witness function interpretation of MMD Gaussian kernel on x i Gaussian kernel on y i 18/41

25 Witness function interpretation of MMD ^P (v): mean embedding of b P ^Q (v): mean embedding of c Q v ^P (v) := 1 m P mi=1 k (x i ; v ) 18/41

26 Witness function interpretation of MMD ^P (v): mean embedding of b P ^Q (v): mean embedding of c Q witness(v) = ^P (v) {z } ^Q (v) v 18/41

27 Function showing difference in distributions Maximum mean discrepancy: smooth function for P vs Q MMD(P ; Q; F ) := sup [E P f (X ) E Q f (Y )] f 2F MMD(P ; Q; F ) = 0 iff P = Q when F = the unit ball in a characteristic RKHS 19/41

28 Function showing difference in distributions Maximum mean discrepancy: smooth function for P vs Q MMD(P ; Q; F ) := sup [E P f (X ) E Q f (Y )] f 2F MMD(P ; Q; F ) = 0 iff P = Q when F = the unit ball in a characteristic RKHS Example revisited: Laplace P vs Gauss Q Prob. density and f Witness f for Gauss and Laplace densities f Gauss Laplace X 19/41

29 Asymptotics of MMD The MMD: \MMD 2 1 = n(n 1) X k(x i ; x j ) + i6=j 2 X n 2 k(x i ; y j ) but how to choose the kernel? i ;j 1 n(n 1) X i6=j k(y i ; y j ) 20/41

30 Asymptotics of MMD The MMD: \MMD 2 1 = n(n 1) X k(x i ; x j ) + i6=j 2 X n 2 k(x i ; y j ) but how to choose the kernel? i ;j 1 n(n 1) X i6=j k(y i ; y j ) Perspective from statistical hypothesis testing: When P = Q then MMD 2 close to zero. \ When P 6= \ Q then MMD 2 far from zero \ 2 Threshold c for MMD gives false positive rate 20/41

31 A statistical test MMD density P=Q P Q Prob. of n MMD False negatives c α = 1 α quantile when P=Q n MMD 21/41

32 A statistical test MMD density P=Q P Q Prob. of n MMD False negatives c α = 1 α quantile when P=Q n MMD Best kernel gives lowest false negative rate (=highest power) 21/41

33 A statistical test MMD density P=Q P Q Prob. of n MMD False negatives c α = 1 α quantile when P=Q n MMD Best kernel gives lowest false negative rate (=highest power)... but can you train for this? 21/41

34 Asymptotics of MMD When P 6= Q, statistic is asymptotically normal, \MMD 2 MMD(P ; Q) p Vn (P ; Q) D! N (0; 1); where MMD(P ; Q) is population MMD, and V n (P ; Q) = O n MMD distribution and Gaussian fit under H1 Empirical PDF Gaussian fit Prob. density MMD 22/41

35 Asymptotics of MMD n\ MMD 2 1X Where P = Q, statistic has asymptotic distribution l=1 l hz 2 l 2 i MMD density under H0 χ 2 sum Empirical PDF where i i (x 0 ) = Z X ~k (x ; x 0 ) {z } centred i (x )dp (x ) 0.5 Prob. density z l N (0; 2) i:i:d: n MMD 2 23/41

36 Optimizing test power The power of our test (Pr 1 denotes probability under P 6= Q): n\ 2 Pr 1 MMD > ^c 24/41

37 Optimizing test power The power of our test (Pr 1 denotes probability under P 6= Q): n\ 2 Pr 1 MMD > ^c! MMD2 (P ; Q) p Vn (P ; Q) c n p V n (P ; Q)! where is the CDF of the standard normal distribution. ^c is an estimate of c test threshold. 24/41

38 Optimizing test power The power of our test (Pr 1 denotes probability under P 6= Q): n\ 2 Pr 1 MMD > ^c! MMD2 (P ; Q) p Vn (P ; Q) {z } O(n 1=2 ) c n p V n (P ; Q) {z } O(n 3=2 )! Second term asymptotically negligible! 24/41

39 Optimizing test power The power of our test (Pr 1 denotes probability under P 6= Q): n\ 2 Pr 1 MMD > ^c! MMD2 (P ; Q) p Vn (P ; Q) c n p V n (P ; Q)! To maximize test power, maximize MMD 2 (P ; Q) p Vn (P ; Q) (Sutherland, Tung, Strathmann, De, Ramdas, Smola, G., ICLR 2017) Code: github.com/dougalsutherland/opt-mmd 24/41

40 Troubleshooting for generative adversarial networks MNIST samples Samples from a GAN 25/41

41 Troubleshooting for generative adversarial networks MNIST samples ARD map Samples from a GAN Power for optimzed ARD kernel: 1.00 at = 0:01 Power for optimized RBF kernel: 0.57 at = 0:01 25/41

42 Benchmarking generative adversarial networks 26/41

43 The ME statistic and test 27/41

44 Distinguishing Feature(s) ^P (v): mean embedding of P ^Q (v): mean embedding of Q witness(v) = ^P (v) {z } ^Q (v) v 28/41

45 Distinguishing Feature(s) witness 2 (v) Take square of witness (only worry about amplitude) 29/41

46 Distinguishing Feature(s) New test statistic: witness 2 at a single v ; Linear time in number n of samples...but how to choose best feature v? 29/41

47 Distinguishing Feature(s) v Best feature = v that maximizes witness 2 (v)?? 29/41

48 Distinguishing Feature(s) witness 2 (v) Sample size n = 3 30/41

49 Distinguishing Feature(s) Sample size n = 50 30/41

50 Distinguishing Feature(s) Sample size n = /41

51 Distinguishing Feature(s) P(x) Q(y) witness 2 (v) Population witness 2 function 30/41

52 Distinguishing Feature(s) P(x) Q(y) witness 2 (v) v? v? 30/41

53 Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ^ n (v) := n witness2 (v) variance of v. 31/41

54 Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ^ n (v) := n witness2 (v) variance of v. 31/41

55 Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ^ n (v) := n witness2 (v) variance of v. P(x) Q(y) witness 2 (v) 31/41

56 Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ^ n (v) := n witness2 (v) variance of v. witness 2 (v) variance X (v) 31/41

57 Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ^ n (v) := n witness2 (v) variance of v. witness 2 (v) variance Y (v) 31/41

58 Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ^ n (v) := n witness2 (v) variance of v. witness 2 (v) variance of v 31/41

59 Variance of witness function Variance at v = variance of X at v + variance of Y at v. ME Statistic: ^ n (v) := n witness2 (v) variance of v. ^ n(v) Best location is v that maximizes ^ n. v Improve performance using multiple locations fv j gj j=1 31/41

60 Distinguishing Positive/Negative Emotions + : : happy neutral surprised afraid angry disgusted 35 females and 35 males (Lundqvist et al., 1998) = 1632 dimensions. Pixel features. Sample size: 402. The proposed test achieves maximum test power in time O(n). Informative features: differences at the nose, and smile lines. 32/41

61 Distinguishing Positive/Negative Emotions Random feature + : : happy neutral surprised afraid angry disgusted Power vs. - The proposed test achieves maximum test power in time O(n). Informative features: differences at the nose, and smile lines. 32/41

62 Distinguishing Positive/Negative Emotions Random feature + : : happy neutral surprised afraid angry disgusted Power Proposed + vs. - The proposed test achieves maximum test power in time O(n). Informative features: differences at the nose, and smile lines. 32/41

63 Distinguishing Positive/Negative Emotions Random feature + : : happy neutral surprised afraid angry disgusted Power Proposed MMD (quadratic time) + vs. - The proposed test achieves maximum test power in time O(n). Informative features: differences at the nose, and smile lines. 32/41

64 Distinguishing Positive/Negative Emotions + : happy neutral surprised : afraid angry disgusted Learned feature The proposed test achieves maximum test power in time O(n). Informative features: differences at the nose, and smile lines. 32/41

65 Distinguishing Positive/Negative Emotions + : happy neutral surprised : afraid angry disgusted Learned feature The proposed test achieves maximum test power in time O(n). Informative features: differences at the nose, and smile lines. Code: 32/41

66 Testing against a probabilistic model 33/41

67 Statistical model criticism MMD(P ; Q) = kf k 2 = sup kf kf 1 [E Qf E p f ] p(x) q(x) f * (x) f (x ) is the witness function Can we compute MMD with samples from Q and a model P? Problem: usualy can t compute E p f in closed form. 34/41

68 Stein idea To get rid of E p f in we define the Stein operator sup [E q f E p f ] kf k F 1 T p f x f + f (@ x log p) Then E P T P f = 0 subject to appropriate boundary conditions. (Oates, Girolami, Chopin, 2016) 35/41

69 Proof of Stein idea Consider the class G = f@x f + f (@x log p)jf 2 Fg Given g 2 G, then (integration by parts) Epg(X ) =Ep [@ x f (X ) + f (X )@ x log p(x )] Z = Z 1 = x f (x )p(x ) + f (x )@ x p(x )dx (f (x )@ x p(x ))dx = f (x )p(x ) x= 1 = 0 x=1 36/41

70 Proof of Stein idea Consider the class G = f@x f + f (@x log p)jf 2 Fg Given g 2 G, then (integration by parts) Epg(X ) =Ep [@ x f (X ) + f (X )@ x log p(x )] Z = Z 1 = x f (x )p(x ) + f (x )@ x p(x )dx (f (x )@ x p(x ))dx = f (x )p(x ) x= 1 = 0 x=1 36/41

71 Proof of Stein idea Consider the class G = f@x f + f (@x log p)jf 2 Fg Given g 2 G, then (integration by parts) Epg(X ) =Ep [@ x f (X ) + f (X )@ x log p(x )] Z = Z 1 = x f (x )p(x ) + f (x )@ x p(x )dx (f (x )@ x p(x ))dx = f (x )p(x ) x= 1 = 0 x=1 36/41

72 Proof of Stein idea Consider the class G = f@x f + f (@x log p)jf 2 Fg Given g 2 G, then (integration by parts) Epg(X ) =Ep [@ x f (X ) + f (X )@ x log p(x )] Z = Z 1 = x f (x )p(x ) + f (x )@ x p(x )dx (f (x )@ x p(x ))dx = f (x )p(x ) x= 1 = 0 x=1 36/41

73 Maximum Stein Discrepancy Stein operator T p f x f + x (log p) Maximum Stein Discrepancy (MSD) MSD(p; q; F ) = sup E q T p g E p T p g kgk F 1 37/41

74 Maximum Stein Discrepancy Stein operator T p f x f + x (log p) Maximum Stein Discrepancy (MSD) MSD(p; q; F ) = sup E q T p g E p T p g kgk F 1 37/41

75 Maximum Stein Discrepancy Stein operator T p f x f + x (log p) Maximum Stein Discrepancy (MSD) MSD(p; q; F ) = sup E q T p g E p T p g = sup E q T p g kgk F 1 kgk F 1 37/41

76 Maximum Stein Discrepancy Stein operator T p f x f + x (log p) Maximum Stein Discrepancy (MSD) MSD(p; q; F ) = sup E q T p g E p T p g = sup E q T p g kgk F 1 kgk F p(x) q(x) g * (x) /41

77 Maximum Stein Discrepancy Stein operator T p f x f + x (log p) Maximum Stein Discrepancy (MSD) MSD(p; q; F ) = sup E q T p g E p T p g = sup E q T p g kgk F 1 kgk F p(x) q(x) g * (x) /41

78 Maximum stein discrepancy Closed-form expression for MSD: given Z ; Z 0 q, then (Chwialkowski, Strathmann, G., 2016) (Liu, Lee, Jordan 2016) where MSD(p; q; F ) = E q h p (Z ; Z 0 ) and k is RKHS kernel for F h p (x ; y) x log p(x )@ x log p(y)k (x ; y) y log p(y)@ x k (x ; y) x log p(x )@ y k (x ; y) y k (x ; y) Only depends on kernel log p(x ). Do not need to normalize p, or sample from it. 38/41

79 Statistical model criticism 3 Solar activity (normalised) Test the hypothesis that a Gaussian process model, learned from data?, is a good fit for the test data (example from Lloyd and Ghahramani, 2015) Year Code: 39/41

80 Statistical model criticism V n test Bootstrapped B n Frequency V n Test the hypothesis that a Gaussian process model, learned from data?, is a good fit for the test data 40/41

81 Co-authors Students and postdocs: Kacper Chwialkowski (at Voleon) Wittawat Jitkrittum Heiko Strathmann Dougal Sutherland Collaborators Kenji Fukumizu Krikamol Muandet Bernhard Schoelkopf Alex Smola Bharath Sriperumbudur Zoltan Szabo Questions? 41/41

Learning features to compare distributions

Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this