A Linear-Time Kernel Goodness-of-Fit Test

Size: px

Start display at page:

Download "A Linear-Time Kernel Goodness-of-Fit Test"

Jesse Hodges
5 years ago
Views:

1 A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum 1; Wenkai Xu 1 Zoltán Szabó 2 NIPS 2017 Best paper! Kenji Fukumizu 3 Arthur Gretton 1 wittawatj@gmail.com 1 Gatsby Unit, University College London (Now at Max Planck Institute for Intelligent Systems) 2 CMAP, École Polytechnique 3 The Institute of Statistical Mathematics, Tokyo Workshop on Functional Inference and Machine Intelligence, Tokyo 20 February /25

2 Model Criticism 2/25

3 Model Criticism 2/25

4 Model Criticism Data = robbery events in Chicago in /25

5 Model Criticism Is this a good model? 2/25

6 Model Criticism Goals: 1 Test if a (complicated) model fits the data. 2 If it does not, show a location where it fails. 2/25

7 Model Criticism Goals: 1 Test if a (complicated) model fits the data. 2 If it does not, show a location where it fails. 2/25

8 Goodness-of-fit Testing Given: 1 Sample fx i g n i=1 i:i:d: q (unknown) on R d, 2 Unnormalized density p (known model). H 0 : p = q H 1 : p 6= q Want a test : : : 1 Nonparametric. 2 Linear-time. Runtime is O(n). Fast. 3 Interpretable. Model criticism by finding F. 3/25

9 Goodness-of-fit Testing Given: 1 Sample fx i g n i=1 i:i:d: q (unknown) on R d, 2 Unnormalized density p (known model). (unknown) H 0 : p = q H 1 : p 6= q? z } { x 1 ; x 2 ; : : : ; x n (model) Want a test : : : 1 Nonparametric. 2 Linear-time. Runtime is O(n). Fast. 3 Interpretable. Model criticism by finding F. 3/25

10 Goodness-of-fit Testing Given: 1 Sample fx i g n i=1 i:i:d: q (unknown) on R d, 2 Unnormalized density p (known model). (unknown) H 0 : p = q H 1 : p 6= q? z } { x 1 ; x 2 ; : : : ; x n (model) Want a test : : : 1 Nonparametric. 2 Linear-time. Runtime is O(n). Fast. 3 Interpretable. Model criticism by finding F. 3/25

11 Maximum Mean Discrepancy (MMD) Witness Function (Gretton et al., 2012) 4/25

12 Maximum Mean Discrepancy (MMD) Witness Function (Gretton et al., 2012) Observe X = fx 1 ; : : : ; x n g q Observe Y = fy 1 ; : : : ; y n g p 4/25

13 Maximum Mean Discrepancy (MMD) Witness Function (Gretton et al., 2012) Gaussian kernel on x i Gaussian kernel on y i 4/25

14 Maximum Mean Discrepancy (MMD) Witness Function (Gretton et al., 2012) q (v) = E xq k v (x) p(v) = E yp k v (y) (mean embedding of p) v 4/25

15 Maximum Mean Discrepancy (MMD) Witness Function (Gretton et al., 2012) q (v) = E xq k v (x) p(v) = E yp k v (y) (mean embedding of p) witness(v) = q (v) p(v) {z } v 4/25

16 Maximum Mean Discrepancy (MMD) Witness Function (Gretton et al., 2012) MMD(q; p) = kwitnessk v 4/25

17 Maximum Mean Discrepancy (MMD) Witness Function (Gretton et al., 2012) witness 2 (v) = (q (v) p(v)) {z } 2 v 4/25

18 Maximum Mean Discrepancy (MMD) Witness Function (Gretton et al., 2012) witness 2 (v) = (q (v) p(v)) {z } 2 v witness 2 (v) can be used to find a good test location v = F. 4/25

19 Model Criticism by the MMD Witness Find a location v at which q and p differ most (ME test) [Jitkrittum et al., 2016]. 5/25

20 Model Criticism by the MMD Witness Find a location v at which q and p differ most (ME test) [Jitkrittum et al., 2016]. witness(v) = E xq [ k v (x) ] E yp [ k v (y) ] 5/25

21 Model Criticism by the MMD Witness Find a location v at which q and p differ most (ME test) [Jitkrittum et al., 2016]. witness(v) = E xq [ k v (x) ] E yp [ k v (y) ] score(v) = witness2 (v) noise(v) = witness 2 (v) qv xq [k v (x)] + V yp [k v (y)] : 5/25

22 Model Criticism by the MMD Witness Find a location v at which q and p differ most (ME test) [Jitkrittum et al., 2016]. witness(v) = E xq [ k v (x) ] E yp [ k v (y) ] score(v) = witness2 (v) noise(v) = witness 2 (v) qv xq [k v (x)] + V yp [k v (y)] : 5/25

23 Model Criticism by the MMD Witness Find a location v at which q and p differ most (ME test) [Jitkrittum et al., 2016]. score: k v (x) = v witness(v) = E xq [ k v (x) ] E yp [ k v (y) ] score(v) = witness2 (v) noise(v) = witness 2 (v) qv xq [k v (x)] + V yp [k v (y)] : 5/25

24 Model Criticism by the MMD Witness Find a location v at which q and p differ most (ME test) [Jitkrittum et al., 2016]. score: 1.6 witness(v) = E xq [ k v (x) ] E yp [ k v (y) ] score(v) = witness2 (v) noise(v) = witness 2 (v) qv xq [k v (x)] + V yp [k v (y)] : 5/25

25 Model Criticism by the MMD Witness Find a location v at which q and p differ most (ME test) [Jitkrittum et al., 2016]. score: 13 witness(v) = E xq [ k v (x) ] E yp [ k v (y) ] score(v) = witness2 (v) noise(v) = witness 2 (v) qv xq [k v (x)] + V yp [k v (y)] : 5/25

26 Model Criticism by the MMD Witness Find a location v at which q and p differ most (ME test) [Jitkrittum et al., 2016]. score: 25 Best v witness(v) = E xq [ k v (x) ] E yp [ k v (y) ] score(v) = witness2 (v) noise(v) = witness 2 (v) qv xq [k v (x)] + V yp [k v (y)] : 5/25

27 Model Criticism by the MMD Witness Find a location v at which q and p differ most (ME test) [Jitkrittum et al., 2016]. score: 25 Best v witness(v) = E xq [ k v (x) ] E yp [ k v (y) ] score(v) = witness2 (v) noise(v) = No sample witnessfrom 2 (v) p. : qv Difficult xq [k v (x)] to + generate. V yp [k v (y)] 5/25

28 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. 6/25

29 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. (Stein) witness(v) = E xq [ T p k v (x) ] E yp [ T p k v (y) ] 6/25

30 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. (Stein) witness(v) = E xq [T p v ] E yp [T p v ] 6/25

31 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. (Stein) witness(v) = E xq [ v ] E yp [ v ] 6/25

32 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. (Stein) witness(v) = E xq [ v ] E yp [ v ] Idea: Define T p such that E yp (T p k v )(y) = 0; for any v. 6/25

33 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. (Stein) witness(v) = E xq [ v ] Idea: Define T p such that E yp (T p k v )(y) = 0; for any v. 6/25

34 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. (Stein) witness(v) = E xq [ T p k v (x) ] Idea: Define T p such that E yp (T p k v )(y) = 0; for any v. 6/25

35 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. (Stein) witness(v) = E xq [ T p k v (x) ] Idea: Define T p such that E yp (T p k v )(y) = 0; for any v. Proposal: Good v should have high score(v) = witness2 (v) : noise(v) 6/25

36 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. (Stein) witness(v) = E xq [ T p k v (x) ] Idea: Define T p such that E yp (T p k v )(y) = 0; for any v. Proposal: Good v should have high signal-to-noise ratio score(v) = witness2 (v) : noise(v) 6/25

37 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. (Stein) witness(v) = E xq [ T p k v (x) ] Idea: Define T p such that E yp (T p k v )(y) = 0; for any v. Proposal: Good v should have high signal-to-noise ratio score(v) = witness2 (v) : noise(v) score(v) can be estimated in linear-time. 6/25

38 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. (Stein) witness(v) = E xq [ T p k v (x) ] Idea: Define T p such that E yp (T p k v )(y) = 0; for any v. Proposal: Good v should have high signal-to-noise ratio score(v) = witness2 (v) : noise(v) score(v) can be estimated in linear-time. Goodness-of-fit test: 1 Find v = arg max v score(v). 2 Reject H 0 if witness 2 (v ) > threshold. 6/25

39 Proposal: Model Criticism with the Stein Witness score(v) = witness2 (v) : noise(v) 7/25

40 Proposal: Model Criticism with the Stein Witness score(v) = witness2 (v) : noise(v) 7/25

41 Proposal: Model Criticism with the Stein Witness score: (T p k v )(x) = v score(v) = witness2 (v) : noise(v) 7/25

42 Proposal: Model Criticism with the Stein Witness score: score(v) = witness2 (v) : noise(v) 7/25

43 Proposal: Model Criticism with the Stein Witness score: 0.17 score(v) = witness2 (v) : noise(v) 7/25

44 Proposal: Model Criticism with the Stein Witness score: 0.26 score(v) = witness2 (v) : noise(v) 7/25

45 Proposal: Model Criticism with the Stein Witness score: 0.33 score(v) = witness2 (v) : noise(v) 7/25

46 Proposal: Model Criticism with the Stein Witness score: 0.37 score(v) = witness2 (v) : noise(v) 7/25

47 Proposal: Model Criticism with the Stein Witness score: 0.37 score(v) = witness2 (v) : noise(v) 7/25

48 Proposal: Model Criticism with the Stein Witness score: 0.45 score(v) = witness2 (v) : noise(v) 7/25

49 Proposal: Model Criticism with the Stein Witness score: 0.44 score(v) = witness2 (v) : noise(v) 7/25

50 Proposal: Model Criticism with the Stein Witness score: 0.39 score(v) = witness2 (v) : noise(v) 7/25

51 Proposal: Model Criticism with the Stein Witness score: 0.31 score(v) = witness2 (v) : noise(v) 7/25

52 Proposal: Model Criticism with the Stein Witness score: 0.32 score(v) = witness2 (v) : noise(v) 7/25

53 Proposal: Model Criticism with the Stein Witness score: 0.32 score(v) = witness2 (v) : noise(v) 7/25

54 Proposal: Model Criticism with the Stein Witness score: 0.37 score(v) = witness2 (v) : noise(v) 7/25

55 Proposal: Model Criticism with the Stein Witness score: 0.48 score(v) = witness2 (v) : noise(v) 7/25

56 Proposal: Model Criticism with the Stein Witness score: 0.49 score(v) = witness2 (v) : noise(v) 7/25

57 Proposal: Model Criticism with the Stein Witness score: 0.47 score(v) = witness2 (v) : noise(v) 7/25

58 Proposal: Model Criticism with the Stein Witness score: 0.44 score(v) = witness2 (v) : noise(v) 7/25

59 Proposal: Model Criticism with the Stein Witness score: score(v) = witness2 (v) : noise(v) 7/25

60 Proposal: Model Criticism with the Stein Witness score: 0.37 score(v) = witness2 (v) : noise(v) 7/25

61 Proposal: Model Criticism with the Stein Witness score: 0.16 score(v) = witness2 (v) : noise(v) 7/25

62 Proposal: Model Criticism with the Stein Witness score: 0.44 score(v) = witness2 (v) : noise(v) 7/25

63 Theory 1 What is T p k v? 2 Test statistic 3 Distributions of the test statistic, test threshold. 4 What does v = arg max v score(v) do theoretically? 8/25

64 (1) What is T p k v? Recall witness(v) =E xq (T p k v )(x) E yp (T p k v )(y) 9/25

65 (1) What is T p k v? Recall witness(v) =E xq (T p k v )(x) E yp (T p k v )(y) (T p k 1 d v )(y) = p(y) dy [k v(y)p(y)]: Then, E yp (T p k v )(y) = 0: [Liu et al., 2016, Chwialkowski et al., 2016] 9/25

66 (1) What is T p k v? Recall witness(v) =E xq (T p k v )(x) E yp (T p k v )(y) (T p k 1 d v )(y) = p(y) dy [k v(y)p(y)]: Normalizer cancels Then, E yp (T p k v )(y) = 0: [Liu et al., 2016, Chwialkowski et al., 2016] 9/25

67 (1) What is T p k v? Recall witness(v) =E xq (T p k v )(x) E yp (T p k v )(y) (T p k 1 d v )(y) = p(y) dy [k v(y)p(y)]: Normalizer cancels Then, E yp (T p k v )(y) = 0: [Liu et al., 2016, Chwialkowski et al., 2016] Proof: E yp [(T p k v )(y)] 9/25

68 (1) What is T p k v? Recall witness(v) =E xq (T p k v )(x) E yp (T p k v )(y) (T p k 1 d v )(y) = p(y) dy [k v(y)p(y)]: Normalizer cancels Then, E yp (T p k v )(y) = 0: [Liu et al., 2016, Chwialkowski et al., 2016] Z 1 Proof: E yp [(T p k v )(y)] = 1 [(T p k v )(y)] p(y)dy 9/25

69 (1) What is T p k v? Recall witness(v) =E xq (T p k v )(x) E yp (T p k v )(y) (T p k 1 d v )(y) = p(y) dy [k v(y)p(y)]: Normalizer cancels Then, E yp (T p k v )(y) = 0: [Liu et al., 2016, Chwialkowski et al., 2016] Z 1 Proof: E yp [(T p k v )(y)] = 1 d 1 p(y) dy [k v(y)p(y)] p(y) dy 9/25

70 (1) What is T p k v? Recall witness(v) =E xq (T p k v )(x) E yp (T p k v )(y) (T p k 1 d v )(y) = p(y) dy [k v(y)p(y)]: Normalizer cancels Then, E yp (T p k v )(y) = 0: [Liu et al., 2016, Chwialkowski et al., 2016] Z 1 Proof: E yp [(T p k v )(y)] = 1 1 d p(y) dy [k v(y)p(y)] p(y) dy 9/25

71 (1) What is T p k v? Recall witness(v) =E xq (T p k v )(x) E yp (T p k v )(y) (T p k 1 d v )(y) = p(y) dy [k v(y)p(y)]: Normalizer cancels Then, E yp (T p k v )(y) = 0: [Liu et al., 2016, Chwialkowski et al., 2016] Z E 1 yp [(T p k v )(y)] = 1 Z 1 Proof: = 1 1 d p(y) dy [k v(y)p(y)] p(y) dy d dy [k v(y)p(y)] dy 9/25

72 (1) What is T p k v? Recall witness(v) =E xq (T p k v )(x) E yp (T p k v )(y) (T p k 1 d v )(y) = p(y) dy [k v(y)p(y)]: Normalizer cancels Then, E yp (T p k v )(y) = 0: [Liu et al., 2016, Chwialkowski et al., 2016] Z E 1 yp [(T p k v )(y)] = 1 Z 1 Proof: = 1 1 d p(y) dy [k v(y)p(y)] p(y) dy d dy [k v(y)p(y)] dy = [k v (y)p(y)] y=1 y= 1 9/25

73 (1) What is T p k v? Recall witness(v) =E xq (T p k v )(x) E yp (T p k v )(y) (T p k 1 d v )(y) = p(y) dy [k v(y)p(y)]: Normalizer cancels Then, E yp (T p k v )(y) = 0: [Liu et al., 2016, Chwialkowski et al., 2016] Z E 1 yp [(T p k v )(y)] = 1 Z 1 Proof: = (assume lim jyj!1 k v (y)p(y)) 1 1 d p(y) dy [k v(y)p(y)] p(y) dy d dy [k v(y)p(y)] dy = [k v (y)p(y)] y=1 y= 1 = 0 9/25

74 (2) Proposal: The Finite Set Stein Discrepancy (FSSD) Recall Stein witness: g(v) :=E xq h 1 p(x) d dx [k v(x)p(x)]i p(x) q(x) g(x) /25

75 (2) Proposal: The Finite Set Stein Discrepancy (FSSD) Recall Stein witness: g(v) :=E xq h 1 p(x) d dx [k v(x)p(x)]i p(x) q(x) g(x) /25

76 (2) Proposal: The Finite Set Stein Discrepancy (FSSD) Recall Stein witness: g(v) :=E xq h 1 p(x) d dx [k v(x)p(x)]i p(x) q(x) g(x) 10/25

77 (2) Proposal: The Finite Set Stein Discrepancy (FSSD) Recall Stein witness: g(v) :=E xq h 1 p(x) d dx [k v(x)p(x)]i p(x) q(x) g(x) FSSD statistic: Evaluate g 2 at J test locations V = fv 1 ; : : : ; v J g. Population FSSD FSSD 2 = 1 dj JX j=1 kg(v j )k 2 2 : Unbiased estimator \ FSSD 2 computable in O(d 2 Jn) time. (d = input dimension) 10/25

78 (2) FSSD is a Discrepancy Measure FSSD 2 = 1 dj P Jj=1 kg(v j )k 2 2 : Theorem 1 (FSSD is a discrepancy measure). Main conditions: 1 (Nice kernel) Kernel k is C 0 -universal, and real analytic e.g., Gaussian kernel. 2 (Vanishing boundary) lim kxk!1 p(x)k v (x) = 0. 3 (Avoid blind spots ) Locations v 1 ; : : : ; v J which has a density. Then, for any J 1, -almost surely, FSSD 2 = 0 () p = q. Summary: Evaluating the witness at random locations is sufficient to detect the discrepancy between p; q. 11/25

79 (2) What Are Blind Spots? Recall g(v) : =E xq 1 d p(x) dx [k v(x)p(x)] : Consider p = N (0; 1) and q = N (0; q 2 ). Use unit-width Gaussian kernel. v exp g(v ) = v 2 2+2q 1 2 q q 3=2 2 If v = 0, then FSSD 2 = g 2 (v ) = 0 regardless of 2 q. If g 6= 0, and k is real analytic, R = fv j g(v ) = 0g (blind spots) has 0 Lebesgue measure. So, if v a distribution with a density, then v =2 R. 12/25

80 (2) What Are Blind Spots? Recall g(v) : =E xq 1 d p(x) dx [k v(x)p(x)] : Consider p = N (0; 1) and q = N (0; q 2 ). Use unit-width Gaussian kernel. v exp g(v ) = v 2 2+2q 1 2 q q 3= p = N (0, 1) q = N (0, 4) g If v = 0, then FSSD 2 = g 2 (v ) = 0 regardless of 2 q If g 6= 0, and k is real analytic, R = fv j g(v ) = 0g (blind spots) has 0 Lebesgue measure. So, if v a distribution with a density, then v =2 R. 12/25

81 (2) What Are Blind Spots? Recall g(v) : =E xq 1 d p(x) dx [k v(x)p(x)] : Consider p = N (0; 1) and q = N (0; q 2 ). Use unit-width Gaussian kernel. v exp g(v ) = v 2 2+2q 1 2 q q 3= p = N (0, 1) q = N (0, 4) g If v = 0, then FSSD 2 = g 2 (v ) = 0 regardless of 2 q If g 6= 0, and k is real analytic, R = fv j g(v ) = 0g (blind spots) has 0 Lebesgue measure. So, if v a distribution with a density, then v =2 R. 12/25

82 (2) What Are Blind Spots? Recall g(v) : =E xq 1 d p(x) dx [k v(x)p(x)] : Consider p = N (0; 1) and q = N (0; q 2 ). Use unit-width Gaussian kernel. v exp g(v ) = v 2 2+2q 1 2 q q 3= p = N (0, 1) q = N (0, 4) g If v = 0, then FSSD 2 = g 2 (v ) = 0 regardless of 2 q If g 6= 0, and k is real analytic, R = fv j g(v ) = 0g (blind spots) has 0 Lebesgue measure. So, if v a distribution with a density, then v =2 R. 12/25

83 (2) What Are Blind Spots? Recall g(v) : =E xq 1 d p(x) dx [k v(x)p(x)] : Consider p = N (0; 1) and q = N (0; q 2 ). Use unit-width Gaussian kernel. v exp g(v ) = v 2 2+2q 1 2 q q 3= p = N (0, 1) q = N (0, 4) g If v = 0, then FSSD 2 = g 2 (v ) = 0 regardless of 2 q If g 6= 0, and k is real analytic, R = fv j g(v ) = 0g (blind spots) has 0 Lebesgue measure. So, if v a distribution with a density, then v =2 R. 12/25

84 (3) Asymptotic Distributions of ^n := n \ FSSD 2 P H0 (ˆλ n ) ˆλ n 13/25

85 (3) Asymptotic Distributions of ^n := n \ FSSD 2 P H0 (ˆλ n ) T α ˆλ n 13/25

86 (3) Asymptotic Distributions of ^n := n \ FSSD 2 P H0 (ˆλ n ) T α ˆλ n ˆλ n 13/25

87 (3) Asymptotic Distributions of ^n := n \ FSSD 2 P H0 (ˆλ n ) T α ˆλ n ˆλ n Under H 0 : p = q, asymptotically ^ n := n \ FSSD 2 d! djx i=1 (Z 2 i 1)! i ; f! i g dj i=1 are non-negative, computable quantities. i:i:d: Z 1 ; : : : ; Z dj N (0; 1) 13/25

88 (3) Asymptotic Distributions of ^n := n \ FSSD 2 P H0 (ˆλ n ) T α P H1 (ˆλ n ) ˆλ n Under H 0 : p = q, asymptotically ^ n := n \ FSSD 2 d! djx i=1 (Z 2 i 1)! i ; f! i g dj i=1 are non-negative, computable quantities. i:i:d: Z 1 ; : : : ; Z dj N (0; 1) Under H 1 : p 6= q, asymptotically p n( \ FSSD 2 FSSD 2 ) d! N (0; 2 H 1 ): 13/25

89 (3) Asymptotic Distributions of ^n := n \ FSSD 2 P H0 (ˆλ n ) T α P H1 (ˆλ n ) ˆλ n Under H 0 : p = q, asymptotically ^ n := n \ FSSD 2 d! djx i=1 (Z 2 i 1)! i ; f! i g dj i=1 are non-negative, computable quantities. i:i:d: Z 1 ; : : : ; Z dj N (0; 1) Under H 1 : p 6= q, asymptotically p n( \ FSSD 2 FSSD 2 ) d! N (0; 2 H 1 ): witness 2 (V ) noise(v ) 13/25

90 (4) What Does arg max v score(v) Do? Proposition 1 (Asymptotic test power). For large n, the test power P(reject H 0 j H 1 true) = PH 1 (n \ FSSD 2 > T ) pn FSSD 2 H1 where = CDF of N (0; 1). T p ; nh1 For large n, the 2 nd term dominates. arg max V ; 2 k P H1 (n \ FSSD 2 > T ) arg max P H0 (ˆλ n ) T α P H1 (ˆλ n ) ˆλ n V ; 2 k 2 4 \ 3 FSSD 2 = score(v ; k 2 d ) H1 5 : Maximize score(v ; k 2 ) () Maximize test power In practice, split fx i g n i=1 into independent training/test sets. Optimize on tr. Goodness-of-fit test on te. 14/25

91 (4) What Does arg max v score(v) Do? Proposition 1 (Asymptotic test power). For large n, the test power P(reject H 0 j H 1 true) = PH 1 (n \ FSSD 2 > T ) pn FSSD 2 H1 where = CDF of N (0; 1). T p ; nh1 For large n, the 2 nd term dominates. arg max V ; 2 k P H1 (n \ FSSD 2 > T ) arg max P H0 (ˆλ n ) T α P H1 (ˆλ n ) ˆλ n V ; 2 k 2 4 \ 3 FSSD 2 = score(v ; k 2 d ) H1 5 : Maximize score(v ; k 2 ) () Maximize test power In practice, split fx i g n i=1 into independent training/test sets. Optimize on tr. Goodness-of-fit test on te. 14/25

92 (4) What Does arg max v score(v) Do? Proposition 1 (Asymptotic test power). For large n, the test power P(reject H 0 j H 1 true) = PH 1 (n \ FSSD 2 > T ) pn FSSD 2 H1 where = CDF of N (0; 1). T p ; nh1 For large n, the 2 nd term dominates. arg max V ; 2 k P H1 (n \ FSSD 2 > T ) arg max P H0 (ˆλ n ) T α P H1 (ˆλ n ) ˆλ n V ; 2 k 2 4 \ 3 FSSD 2 = score(v ; k 2 d ) H1 5 : Maximize score(v ; k 2 ) () Maximize test power In practice, split fx i g n i=1 into independent training/test sets. Optimize on tr. Goodness-of-fit test on te. 14/25

93 Related Works 15/25

94 Kernel Stein Discrepancy (KSD) [Liu et al., 2016, Chwialkowski et al., 2016] 0.4 Recall Stein witness: g(v) :=E xq h 1 p(x) d dx [k v(x)p(x)]i p(x) q(x) g(x) 16/25

95 Kernel Stein Discrepancy (KSD) [Liu et al., 2016, Chwialkowski et al., 2016] 0.4 Recall Stein witness: g(v) :=E xq h 1 p(x) d dx [k v(x)p(x)]i p(x) q(x) g(x) witness KSD v KSD 2 = kgk 2 RKHS (RKHS norm). Good when the difference between p; q is spatially diffuse. 16/25

96 Kernel Stein Discrepancy (KSD) [Liu et al., 2016, Chwialkowski et al., 2016] 0.4 Recall Stein witness: g(v) :=E xq h 1 p(x) d dx [k v(x)p(x)]i p(x) q(x) g(x) witness KSD Proposed FSSD witness KSD 2 = kgk 2 RKHS (RKHS norm). Good when the difference between p; q is spatially diffuse. v v FSSD 2 = 1 dj P Jj=1 kg(v j )k 2 2 : Good when the difference between p; q is local. 16/25

97 Kernel Stein Discrepancy (KSD) Closed-form expression for KSD: [Liu et al., 2016, Chwialkowski et al., 2016] z } { E xq E yq h p (x; y) KSD 2 = kgk 2 RKHS = double sums where h p (x; y) := [@ x log p(x)] k (x; y) [@ y log p(y)] + [@ y log x k (x; y) + [@ x log y k (x; y) y k (x; y) and k is a kernel. The double sums make it O(d 2 n 2 ). Slow. 17/25

98 Kernel Stein Discrepancy (KSD) Closed-form expression for KSD: [Liu et al., 2016, Chwialkowski et al., 2016] z } { E xq E yq h p (x; y) KSD 2 = kgk 2 RKHS = double sums where h p (x; y) := [@ x log p(x)] k (x; y) [@ y log p(y)] + [@ y log x k (x; y) + [@ x log y k (x; y) y k (x; y) and k is a kernel. The double sums make it O(d 2 n 2 ). Slow. 17/25

99 Linear-Time Kernel Stein Discrepancy (LKS) [Liu et al., 2016] also proposed a linear version of KSD. For fx i g n i=1 q, KSD test statistic is 2 n(n 1) X i<j h p (x i ; x j ): 8 LKS test statistic is a running average X n=2 2 h p (x 2i 1 ; x 2i ): n i= Both unbiased. LKS has O(d 2 n) runtime. Same as proposed FSSD. LKS has high variance. Poor test power. 18/25

100 Simulation Settings Gaussian kernel k (x; v) = exp kx vk k Method Description 1 FSSD-opt Proposed. With optimization. J = 5. 2 FSSD-rand Proposed. Random test locations. 3 KSD Quadratic-time kernel Stein discrepancy [Liu et al., 2016, Chwialkowski et al., 2016] 4 LKS Linear-time running average version of KSD. 5 MMD-opt 6 ME-test MMD two-sample test [Gretton et al., 2012]. With optimization. Mean Embeddings two-sample test [Jitkrittum et al., 2016]. With optimization. Two-sample tests need to draw sample from p. Tests with optimization use 20% of the data. Significance level = 0:05. 19/25

101 Simulation Settings Gaussian kernel k (x; v) = exp kx vk k Method Description 1 FSSD-opt Proposed. With optimization. J = 5. 2 FSSD-rand Proposed. Random test locations. 3 KSD Quadratic-time kernel Stein discrepancy [Liu et al., 2016, Chwialkowski et al., 2016] 4 LKS Linear-time running average version of KSD. 5 MMD-opt 6 ME-test MMD two-sample test [Gretton et al., 2012]. With optimization. Mean Embeddings two-sample test [Jitkrittum et al., 2016]. With optimization. Two-sample tests need to draw sample from p. Tests with optimization use 20% of the data. Significance level = 0:05. 19/25

102 Simulation Settings Gaussian kernel k (x; v) = exp kx vk k Method Description 1 FSSD-opt Proposed. With optimization. J = 5. 2 FSSD-rand Proposed. Random test locations. 3 KSD Quadratic-time kernel Stein discrepancy [Liu et al., 2016, Chwialkowski et al., 2016] 4 LKS Linear-time running average version of KSD. 5 MMD-opt 6 ME-test MMD two-sample test [Gretton et al., 2012]. With optimization. Mean Embeddings two-sample test [Jitkrittum et al., 2016]. With optimization. Two-sample tests need to draw sample from p. Tests with optimization use 20% of the data. Significance level = 0:05. 19/25

103 Gaussian Vs. Laplace p = Gaussian. q = Laplace. Same mean and variance. High-order moments differ. Sample size n = dimension 4000 d Sample size n Rejection rate 1.0 FSSD-opt FSSD-rand KSD LKS MMD-opt ME-opt Optimization increases the power. Two-sample tests can perform well in this case (p; q clearly differ). 20/25

104 Gaussian-Bernoulli Restricted Boltzmann Machine (RBM) p(x) is the marginal of p(x; h) = 1 Z exp x > Bh + b > x + c > x kxk2 1 ; 2 where x 2R 50, h 2 f 1g 40 is latent. Randomly pick B; b; c. q(x) = p(x) with i.i.d. N (0; per ) noise added to all entries of B. Sample size n = /25

105 Gaussian-Bernoulli Restricted Boltzmann Machine (RBM) p(x) is the marginal of p(x; h) = 1 Z exp x > Bh + b > x + c > x kxk2 1 ; 2 where x 2R 50, h 2 f 1g 40 is latent. Randomly pick B; b; c. q(x) = p(x) with i.i.d. N (0; per ) noise added to all entries of B. Sample size n = /25

106 Gaussian-Bernoulli Restricted Boltzmann Machine (RBM) p(x) is the marginal of p(x; h) = 1 Z exp x > Bh + b > x + c > x kxk2 1 ; 2 where x 2R 50, h 2 f 1g 40 is latent. Randomly pick B; b; c. q(x) = p(x) with i.i.d. N (0; per ) noise added to all entries of B. Sample size n = /25

107 Gaussian-Bernoulli Restricted Boltzmann Machine (RBM) p(x) is the marginal of p(x; h) = 1 Z exp x > Bh + b > x + c > x kxk2 1 ; 2 where x 2R 50, h 2 f 1g 40 is latent. Randomly pick B; b; c. q(x) = p(x) with i.i.d. N (0; per ) noise added to all entries of B. Sample size n = Perturbation 4000SD σ per Sample size n Rejection rate 1.0 FSSD-opt FSSD-rand KSD LKS MMD-opt ME-opt KSD (O(n 2 )), FSSD-opt (O(n)) comparable. LKS has low power. 21/25

108 Interpretable Test Locations: Chicago Crime 22/25

109 Interpretable Test Locations: Chicago Crime 22/25

110 Interpretable Test Locations: Chicago Crime n = robbery events in Chicago in lat/long coordinates = sample from q. Model spatial density with Gaussian mixtures. 22/25

111 Interpretable Test Locations: Chicago Crime Model p = 2-component Gaussian mixture. 22/25

112 Interpretable Test Locations: Chicago Crime Score surface 22/25

113 Interpretable Test Locations: Chicago Crime F = optimized v. 22/25

114 Interpretable Test Locations: Chicago Crime F = optimized v. No robbery in Lake Michigan. 22/25

115 Interpretable Test Locations: Chicago Crime Model p = 10-component Gaussian mixture. 22/25

116 Interpretable Test Locations: Chicago Crime Capture the right tail better. 22/25

117 Interpretable Test Locations: Chicago Crime Still, does not capture the left tail. 22/25

118 Interpretable Test Locations: Chicago Crime Still, does not capture the left tail. Learned test locations are interpretable. 22/25

119 Bahadur Slope and Bahadur Efficiency Bahadur slope u rate of p-value! 0 under H 1 as n! 1. Measure a test s sensitivity to the departure from H 0. H 0 : = 0; H 1 : 6= 0: Typically pval n exp 1 2 c()n where c() > 0 under H 1, and c(0) = 0 [Bahadur, 1960]. c() higher =) more sensitive. Good. Bahadur slope c() := 2 plim n!1 log (1 F (T n )) ; n where F (t) = CDF of T n under H 0. Bahadur efficiency = ratio of slopes of two tests. 23/25

120 Bahadur Slope and Bahadur Efficiency Bahadur slope u rate of p-value! 0 under H 1 as n! 1. Measure a test s sensitivity to the departure from H 0. H 0 : = 0; H 1 : 6= 0: Typically pval n exp 1 2 c()n where c() > 0 under H 1, and c(0) = 0 [Bahadur, 1960]. c() higher =) more sensitive. Good. Bahadur slope c() := 2 plim n!1 log (1 F (T n )) ; n where F (t) = CDF of T n under H 0. Bahadur efficiency = ratio of slopes of two tests. 23/25

121 Bahadur Slope and Bahadur Efficiency Bahadur slope u rate of p-value! 0 under H 1 as n! 1. Measure a test s sensitivity to the departure from H 0. H 0 : = 0; H 1 : 6= 0: Typically pval n exp 1 2 c()n where c() > 0 under H 1, and c(0) = 0 [Bahadur, 1960]. c() higher =) more sensitive. Good. p-value p-value of T (1) n p-value of T (2) n n Bahadur slope c() := 2 plim n!1 log (1 F (T n )) ; n where F (t) = CDF of T n under H 0. Bahadur efficiency = ratio of slopes of two tests. 23/25

122 Bahadur Slope and Bahadur Efficiency Bahadur slope u rate of p-value! 0 under H 1 as n! 1. Measure a test s sensitivity to the departure from H 0. H 0 : = 0; H 1 : 6= 0: Typically pval n exp 1 2 c()n where c() > 0 under H 1, and c(0) = 0 [Bahadur, 1960]. c() higher =) more sensitive. Good. p-value p-value of T (1) n p-value of T (2) n n Bahadur slope c() := 2 plim n!1 log (1 F (T n )) ; n where F (t) = CDF of T n under H 0. Bahadur efficiency = ratio of slopes of two tests. 23/25

123 Gaussian Mean Shift Problem Consider p = N (0; 1) and q = N ( q ; 1). Assume J = 1 location for n \ FSSD 2. Gaussian kernel (bandwidth = 2 k ) c (FSSD) ( q ; v ; 2 k ) = q 2 2 k k 2 k v 2 (v q) 2 2 q e 2 k +2 2 k +1 For LKS, Gaussian kernel (bandwidth = 2 ). c (LKS) ( q ; 2 ) = k k + 44 k + (v 2 + 5) 2 k + 2 : 2 5= =2 4 q 2 ( 2 + 2) ( ) : Theorem 2 (FSSD is at least two times more efficient). Fix \ k 2 = 1 for n FSSD 2. Then, 8 q 6= 0, 9v 2R, 8 2 > 0, we have Bahadur efficiency c (FSSD) ( q ; v ; k 2 ) c (LKS) ( q ; 2 > 2: ) 24/25

124 Gaussian Mean Shift Problem Consider p = N (0; 1) and q = N ( q ; 1). Assume J = 1 location for n \ FSSD 2. Gaussian kernel (bandwidth = 2 k ) c (FSSD) ( q ; v ; 2 k ) = q 2 2 k k 2 k v 2 (v q) 2 2 q e 2 k +2 2 k +1 For LKS, Gaussian kernel (bandwidth = 2 ). c (LKS) ( q ; 2 ) = k k + 44 k + (v 2 + 5) 2 k + 2 : 2 5= =2 4 q 2 ( 2 + 2) ( ) : Theorem 2 (FSSD is at least two times more efficient). Fix \ k 2 = 1 for n FSSD 2. Then, 8 q 6= 0, 9v 2R, 8 2 > 0, we have Bahadur efficiency c (FSSD) ( q ; v ; k 2 ) c (LKS) ( q ; 2 > 2: ) 24/25

125 Gaussian Mean Shift Problem Consider p = N (0; 1) and q = N ( q ; 1). Assume J = 1 location for n \ FSSD 2. Gaussian kernel (bandwidth = 2 k ) c (FSSD) ( q ; v ; 2 k ) = q 2 2 k k 2 k v 2 (v q) 2 2 q e 2 k +2 2 k +1 For LKS, Gaussian kernel (bandwidth = 2 ). c (LKS) ( q ; 2 ) = k k + 44 k + (v 2 + 5) 2 k + 2 : 2 5= =2 4 q 2 ( 2 + 2) ( ) : Theorem 2 (FSSD is at least two times more efficient). Fix \ k 2 = 1 for n FSSD 2. Then, 8 q 6= 0, 9v 2R, 8 2 > 0, we have Bahadur efficiency c (FSSD) ( q ; v ; k 2 ) c (LKS) ( q ; 2 > 2: ) 24/25

126 Gaussian Mean Shift Problem Consider p = N (0; 1) and q = N ( q ; 1). Assume J = 1 location for n \ FSSD 2. Gaussian kernel (bandwidth = 2 k ) c (FSSD) ( q ; v ; 2 k ) = q 2 2 k k 2 k v 2 (v q) 2 2 q e 2 k +2 2 k +1 For LKS, Gaussian kernel (bandwidth = 2 ). c (LKS) ( q ; 2 ) = k k + 44 k + (v 2 + 5) 2 k + 2 : 2 5= =2 4 q 2 ( 2 + 2) ( ) : Theorem 2 (FSSD is at least two times more efficient). Fix \ k 2 = 1 for n FSSD 2. Then, 8 q 6= 0, 9v 2R, 8 2 > 0, we have Bahadur efficiency c (FSSD) ( q ; v ; k 2 ) c (LKS) ( q ; 2 > 2: ) 24/25

127 Conclusion Proposed The Finite Set Stein Discrepancy (FSSD). Goodness-of-fit test based on FSSD is 1 nonparametric, 2 linear-time, 3 tunable (parameters automatically tuned). 4 interpretable. A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum, Wenkai Xu, Zoltán Szabó, Kenji Fukumizu, Arthur Gretton NIPS 2017 (best paper award) Python code: 25/25

128 Questions? Thank you 26/25

129 Illustration: Score Surface Consider J = 1 location. FSSD score(v) = 2 (v) \ c H1 (v) (gray), p in wireframe, fx i g n i=1 q in purple, = best v. p = N 0; !! FSSD 2 / σ H1 vs. q = N 0; !! /25

130 Illustration: Score Surface Consider J = 1 location. FSSD score(v) = 2 (v) \ c H1 (v) (gray), p in wireframe, fx i g n i=1 q in purple, = best v. p = N (0; I) vs. q = Laplace with same mean & variance. p q FSSD 2 σ H1 4 2 v 0 v FSSD 2 / σ H /25

131 FSSD and KSD in 1D Gaussian Case Consider p = N (0; 1) and q = N ( q ; 2 q ). Assume J = 1 feature for n \ FSSD 2. Gaussian kernel (bandwidth = 2 k ). FSSD 2 = 2 k e (v q)2 2 k +2 q 2 k + 1 q + v 2 q k + 2 q 3 : ( 2 k +1)q If q 6= 0; q 2 6= 1, and v = (q 2 1), then FSSD2 = 0! This is why v should be drawn from a distribution with a density. For KSD, Gaussian kernel (bandwidth = 2 ). S 2 = 2 q 2 + 2q q q q 2 2 q : 28/25

132 FSSD and KSD in 1D Gaussian Case Consider p = N (0; 1) and q = N ( q ; 2 q ). Assume J = 1 feature for n \ FSSD 2. Gaussian kernel (bandwidth = 2 k ). FSSD 2 = 2 k e (v q)2 2 k +2 q 2 k + 1 q + v 2 q k + 2 q 3 : ( 2 k +1)q If q 6= 0; q 2 6= 1, and v = (q 2 1), then FSSD2 = 0! This is why v should be drawn from a distribution with a density. For KSD, Gaussian kernel (bandwidth = 2 ). S 2 = 2 q 2 + 2q q q q 2 2 q : 28/25

133 FSSD is a Discrepancy Measure Theorem 3. Let V = fv 1 ; : : : ; v J g R d be drawn i.i.d. from a distribution which has a density. Let X be a connected open set in R d. Assume 1 (Nice RKHS) Kernel k : X X!R is C 0 -universal, and real analytic. 2 (Stein witness not too rough) kgk 2 F < 1. 3 (Finite Fisher divergence) E xq kr x log p(x) q(x) k2 < 1. 4 (Vanishing boundary) lim kxk!1 p(x)g(x) = 0. Then, for any J 1, -almost surely FSSD 2 = 0 if and only if p = q. Gaussian kernel k (x; v) = exp kx vk k 2 In practice, J = 1 or J = 5. works. 29/25

134 Asymptotic Distributions of \FSSD 2 Recall (x; v) := 1 x[k (x; v)p(x)] 2R d : (x) := vertically stack (x; v 1 ); : : : (x; v J ) 2R dj. Feature vector of x. Mean feature: :=E xq [ (x)]. r := cov xr [ (x)] 2R dj dj for r 2 fp; qg Proposition 2 (Asymptotic distributions). Let Z 1 ; : : : ; Z dj i:i:d: N (0; 1), and f! i g dj i=1 be the eigenvalues of p. 1 Under H 0 : p = q, asymptotically n \ FSSD 2 d! P dj i=1 (Z 2 i 1)! i. Easy to simulate to get p-value. Simulation cost independent of n. 2 Under H 1 : p 6= q, we have p n( \ FSSD 2 FSSD 2 ) d! N (0; 2 H 1 ) where 2 H 1 := 4 > q. Implies P(reject H 0 )! 1 as n! 1. But, how to estimate p? No sample from p! Theorem: Using ^ q (computed with fx i g n i=1 q) still leads to a consistent test. 30/25

135 Asymptotic Distributions of \FSSD 2 Recall (x; v) := 1 x[k (x; v)p(x)] 2R d : (x) := vertically stack (x; v 1 ); : : : (x; v J ) 2R dj. Feature vector of x. Mean feature: :=E xq [ (x)]. r := cov xr [ (x)] 2R dj dj for r 2 fp; qg Proposition 2 (Asymptotic distributions). Let Z 1 ; : : : ; Z dj i:i:d: N (0; 1), and f! i g dj i=1 be the eigenvalues of p. 1 Under H 0 : p = q, asymptotically n \ FSSD 2 d! P dj i=1 (Z 2 i 1)! i. Easy to simulate to get p-value. Simulation cost independent of n. 2 Under H 1 : p 6= q, we have p n( \ FSSD 2 FSSD 2 ) d! N (0; 2 H 1 ) where 2 H 1 := 4 > q. Implies P(reject H 0 )! 1 as n! 1. But, how to estimate p? No sample from p! Theorem: Using ^ q (computed with fx i g n i=1 q) still leads to a consistent test. 30/25

136 Asymptotic Distributions of \FSSD 2 Recall (x; v) := 1 x[k (x; v)p(x)] 2R d : (x) := vertically stack (x; v 1 ); : : : (x; v J ) 2R dj. Feature vector of x. Mean feature: :=E xq [ (x)]. r := cov xr [ (x)] 2R dj dj for r 2 fp; qg Proposition 2 (Asymptotic distributions). Let Z 1 ; : : : ; Z dj i:i:d: N (0; 1), and f! i g dj i=1 be the eigenvalues of p. 1 Under H 0 : p = q, asymptotically n \ FSSD 2 d! P dj i=1 (Z 2 i 1)! i. Easy to simulate to get p-value. Simulation cost independent of n. 2 Under H 1 : p 6= q, we have p n( \ FSSD 2 FSSD 2 ) d! N (0; 2 H 1 ) where 2 H 1 := 4 > q. Implies P(reject H 0 )! 1 as n! 1. But, how to estimate p? No sample from p! Theorem: Using ^ q (computed with fx i g n i=1 q) still leads to a consistent test. 30/25

137 Asymptotic Distributions of \FSSD 2 Recall (x; v) := 1 x[k (x; v)p(x)] 2R d : (x) := vertically stack (x; v 1 ); : : : (x; v J ) 2R dj. Feature vector of x. Mean feature: :=E xq [ (x)]. r := cov xr [ (x)] 2R dj dj for r 2 fp; qg Proposition 2 (Asymptotic distributions). Let Z 1 ; : : : ; Z dj i:i:d: N (0; 1), and f! i g dj i=1 be the eigenvalues of p. 1 Under H 0 : p = q, asymptotically n \ FSSD 2 d! P dj i=1 (Z 2 i 1)! i. Easy to simulate to get p-value. Simulation cost independent of n. 2 Under H 1 : p 6= q, we have p n( \ FSSD 2 FSSD 2 ) d! N (0; 2 H 1 ) where 2 H 1 := 4 > q. Implies P(reject H 0 )! 1 as n! 1. But, how to estimate p? No sample from p! Theorem: Using ^ q (computed with fx i g n i=1 q) still leads to a consistent test. 30/25

138 Bahadur Slopes of FSSD and LKS Theorem 4. The Bahadur slope of n \ FSSD 2 is c (FSSD) := FSSD 2 =! 1 ; where! 1 is the maximum eigenvalue of p := cov xp [ (x)]. Theorem 5. The Bahadur slope of the linear-time kernel Stein (LKS) statistic p ns c2 l is c (LKS) = 1 2 [E q h p (x; x 0 )] 2 E p hh 2 p (x; x0 )i ; where h p is the U-statistic kernel of the KSD statistic. 31/25

139 Bahadur Slopes of FSSD and LKS Theorem 4. The Bahadur slope of n \ FSSD 2 is c (FSSD) := FSSD 2 =! 1 ; where! 1 is the maximum eigenvalue of p := cov xp [ (x)]. Theorem 5. The Bahadur slope of the linear-time kernel Stein (LKS) statistic p ns c2 l is c (LKS) = 1 2 [E q h p (x; x 0 )] 2 E p hh 2 p (x; x0 )i ; where h p is the U-statistic kernel of the KSD statistic. 31/25

140 Harder RBM Problem Perturb only one entry of B 2R (in the RBM). B 1;1 B 1;1 + N (0; 2 per = 0:12 ) Rejection rate Sample 4000 size n Sample size n FSSD-opt FSSD-rand KSD LKS MMD-opt ME-opt Two-sample tests fail. Samples from p; q look roughly the same. FSSD-opt is comparable to KSD at low n. One order of magnitude faster. 32/25

141 0 5 0 Harder RBM Problem Perturb only one entry of B 2R (in the RBM). B 1;1 B 1;1 + N (0; 2 per = 0:12 ). Time (s) Sample4000 size n Sample size n FSSD-opt FSSD-rand KSD LKS MMD-opt ME-opt Two-sample tests fail. Samples from p; q look roughly the same. FSSD-opt is comparable to KSD at low n. One order of magnitude faster. 32/25

142 References I Bahadur, R. R. (1960). Stochastic comparison of tests. The Annals of Mathematical Statistics, 31(2): Chwialkowski, K., Strathmann, H., and Gretton, A. (2016). A kernel test of goodness of fit. In ICML, pages Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A Kernel Two-Sample Test. JMLR, 13: Jitkrittum, W., Szabó, Z., Chwialkowski, K. P., and Gretton, A. (2016). Interpretable Distribution Features with Maximum Testing Power. In NIPS, pages /25

143 References II Liu, Q., Lee, J., and Jordan, M. (2016). A Kernelized Stein Discrepancy for Goodness-of-fit Tests. In ICML, pages /25

An Adaptive Test of Independence with Analytic Kernel Embeddings

An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum Gatsby Unit, University College London wittawat@gatsby.ucl.ac.uk Probabilistic Graphical Model Workshop 2017 Institute