A Linear-Time Kernel Goodness-of-Fit Test

Size: px
Start display at page:

Download "A Linear-Time Kernel Goodness-of-Fit Test"

Transcription

1 A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum 1; Wenkai Xu 1 Zoltán Szabó 2 NIPS 2017 Best paper! Kenji Fukumizu 3 Arthur Gretton 1 wittawatj@gmail.com 1 Gatsby Unit, University College London (Now at Max Planck Institute for Intelligent Systems) 2 CMAP, École Polytechnique 3 The Institute of Statistical Mathematics, Tokyo Workshop on Functional Inference and Machine Intelligence, Tokyo 20 February /25

2 Model Criticism 2/25

3 Model Criticism 2/25

4 Model Criticism Data = robbery events in Chicago in /25

5 Model Criticism Is this a good model? 2/25

6 Model Criticism Goals: 1 Test if a (complicated) model fits the data. 2 If it does not, show a location where it fails. 2/25

7 Model Criticism Goals: 1 Test if a (complicated) model fits the data. 2 If it does not, show a location where it fails. 2/25

8 Goodness-of-fit Testing Given: 1 Sample fx i g n i=1 i:i:d: q (unknown) on R d, 2 Unnormalized density p (known model). H 0 : p = q H 1 : p 6= q Want a test : : : 1 Nonparametric. 2 Linear-time. Runtime is O(n). Fast. 3 Interpretable. Model criticism by finding F. 3/25

9 Goodness-of-fit Testing Given: 1 Sample fx i g n i=1 i:i:d: q (unknown) on R d, 2 Unnormalized density p (known model). (unknown) H 0 : p = q H 1 : p 6= q? z } { x 1 ; x 2 ; : : : ; x n (model) Want a test : : : 1 Nonparametric. 2 Linear-time. Runtime is O(n). Fast. 3 Interpretable. Model criticism by finding F. 3/25

10 Goodness-of-fit Testing Given: 1 Sample fx i g n i=1 i:i:d: q (unknown) on R d, 2 Unnormalized density p (known model). (unknown) H 0 : p = q H 1 : p 6= q? z } { x 1 ; x 2 ; : : : ; x n (model) Want a test : : : 1 Nonparametric. 2 Linear-time. Runtime is O(n). Fast. 3 Interpretable. Model criticism by finding F. 3/25

11 Maximum Mean Discrepancy (MMD) Witness Function (Gretton et al., 2012) 4/25

12 Maximum Mean Discrepancy (MMD) Witness Function (Gretton et al., 2012) Observe X = fx 1 ; : : : ; x n g q Observe Y = fy 1 ; : : : ; y n g p 4/25

13 Maximum Mean Discrepancy (MMD) Witness Function (Gretton et al., 2012) Gaussian kernel on x i Gaussian kernel on y i 4/25

14 Maximum Mean Discrepancy (MMD) Witness Function (Gretton et al., 2012) q (v) = E xq k v (x) p(v) = E yp k v (y) (mean embedding of p) v 4/25

15 Maximum Mean Discrepancy (MMD) Witness Function (Gretton et al., 2012) q (v) = E xq k v (x) p(v) = E yp k v (y) (mean embedding of p) witness(v) = q (v) p(v) {z } v 4/25

16 Maximum Mean Discrepancy (MMD) Witness Function (Gretton et al., 2012) MMD(q; p) = kwitnessk v 4/25

17 Maximum Mean Discrepancy (MMD) Witness Function (Gretton et al., 2012) witness 2 (v) = (q (v) p(v)) {z } 2 v 4/25

18 Maximum Mean Discrepancy (MMD) Witness Function (Gretton et al., 2012) witness 2 (v) = (q (v) p(v)) {z } 2 v witness 2 (v) can be used to find a good test location v = F. 4/25

19 Model Criticism by the MMD Witness Find a location v at which q and p differ most (ME test) [Jitkrittum et al., 2016]. 5/25

20 Model Criticism by the MMD Witness Find a location v at which q and p differ most (ME test) [Jitkrittum et al., 2016]. witness(v) = E xq [ k v (x) ] E yp [ k v (y) ] 5/25

21 Model Criticism by the MMD Witness Find a location v at which q and p differ most (ME test) [Jitkrittum et al., 2016]. witness(v) = E xq [ k v (x) ] E yp [ k v (y) ] score(v) = witness2 (v) noise(v) = witness 2 (v) qv xq [k v (x)] + V yp [k v (y)] : 5/25

22 Model Criticism by the MMD Witness Find a location v at which q and p differ most (ME test) [Jitkrittum et al., 2016]. witness(v) = E xq [ k v (x) ] E yp [ k v (y) ] score(v) = witness2 (v) noise(v) = witness 2 (v) qv xq [k v (x)] + V yp [k v (y)] : 5/25

23 Model Criticism by the MMD Witness Find a location v at which q and p differ most (ME test) [Jitkrittum et al., 2016]. score: k v (x) = v witness(v) = E xq [ k v (x) ] E yp [ k v (y) ] score(v) = witness2 (v) noise(v) = witness 2 (v) qv xq [k v (x)] + V yp [k v (y)] : 5/25

24 Model Criticism by the MMD Witness Find a location v at which q and p differ most (ME test) [Jitkrittum et al., 2016]. score: 1.6 witness(v) = E xq [ k v (x) ] E yp [ k v (y) ] score(v) = witness2 (v) noise(v) = witness 2 (v) qv xq [k v (x)] + V yp [k v (y)] : 5/25

25 Model Criticism by the MMD Witness Find a location v at which q and p differ most (ME test) [Jitkrittum et al., 2016]. score: 13 witness(v) = E xq [ k v (x) ] E yp [ k v (y) ] score(v) = witness2 (v) noise(v) = witness 2 (v) qv xq [k v (x)] + V yp [k v (y)] : 5/25

26 Model Criticism by the MMD Witness Find a location v at which q and p differ most (ME test) [Jitkrittum et al., 2016]. score: 25 Best v witness(v) = E xq [ k v (x) ] E yp [ k v (y) ] score(v) = witness2 (v) noise(v) = witness 2 (v) qv xq [k v (x)] + V yp [k v (y)] : 5/25

27 Model Criticism by the MMD Witness Find a location v at which q and p differ most (ME test) [Jitkrittum et al., 2016]. score: 25 Best v witness(v) = E xq [ k v (x) ] E yp [ k v (y) ] score(v) = witness2 (v) noise(v) = No sample witnessfrom 2 (v) p. : qv Difficult xq [k v (x)] to + generate. V yp [k v (y)] 5/25

28 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. 6/25

29 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. (Stein) witness(v) = E xq [ T p k v (x) ] E yp [ T p k v (y) ] 6/25

30 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. (Stein) witness(v) = E xq [T p v ] E yp [T p v ] 6/25

31 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. (Stein) witness(v) = E xq [ v ] E yp [ v ] 6/25

32 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. (Stein) witness(v) = E xq [ v ] E yp [ v ] Idea: Define T p such that E yp (T p k v )(y) = 0; for any v. 6/25

33 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. (Stein) witness(v) = E xq [ v ] Idea: Define T p such that E yp (T p k v )(y) = 0; for any v. 6/25

34 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. (Stein) witness(v) = E xq [ T p k v (x) ] Idea: Define T p such that E yp (T p k v )(y) = 0; for any v. 6/25

35 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. (Stein) witness(v) = E xq [ T p k v (x) ] Idea: Define T p such that E yp (T p k v )(y) = 0; for any v. Proposal: Good v should have high score(v) = witness2 (v) : noise(v) 6/25

36 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. (Stein) witness(v) = E xq [ T p k v (x) ] Idea: Define T p such that E yp (T p k v )(y) = 0; for any v. Proposal: Good v should have high signal-to-noise ratio score(v) = witness2 (v) : noise(v) 6/25

37 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. (Stein) witness(v) = E xq [ T p k v (x) ] Idea: Define T p such that E yp (T p k v )(y) = 0; for any v. Proposal: Good v should have high signal-to-noise ratio score(v) = witness2 (v) : noise(v) score(v) can be estimated in linear-time. 6/25

38 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. (Stein) witness(v) = E xq [ T p k v (x) ] Idea: Define T p such that E yp (T p k v )(y) = 0; for any v. Proposal: Good v should have high signal-to-noise ratio score(v) = witness2 (v) : noise(v) score(v) can be estimated in linear-time. Goodness-of-fit test: 1 Find v = arg max v score(v). 2 Reject H 0 if witness 2 (v ) > threshold. 6/25

39 Proposal: Model Criticism with the Stein Witness score(v) = witness2 (v) : noise(v) 7/25

40 Proposal: Model Criticism with the Stein Witness score(v) = witness2 (v) : noise(v) 7/25

41 Proposal: Model Criticism with the Stein Witness score: (T p k v )(x) = v score(v) = witness2 (v) : noise(v) 7/25

42 Proposal: Model Criticism with the Stein Witness score: score(v) = witness2 (v) : noise(v) 7/25

43 Proposal: Model Criticism with the Stein Witness score: 0.17 score(v) = witness2 (v) : noise(v) 7/25

44 Proposal: Model Criticism with the Stein Witness score: 0.26 score(v) = witness2 (v) : noise(v) 7/25

45 Proposal: Model Criticism with the Stein Witness score: 0.33 score(v) = witness2 (v) : noise(v) 7/25

46 Proposal: Model Criticism with the Stein Witness score: 0.37 score(v) = witness2 (v) : noise(v) 7/25

47 Proposal: Model Criticism with the Stein Witness score: 0.37 score(v) = witness2 (v) : noise(v) 7/25

48 Proposal: Model Criticism with the Stein Witness score: 0.45 score(v) = witness2 (v) : noise(v) 7/25

49 Proposal: Model Criticism with the Stein Witness score: 0.44 score(v) = witness2 (v) : noise(v) 7/25

50 Proposal: Model Criticism with the Stein Witness score: 0.39 score(v) = witness2 (v) : noise(v) 7/25

51 Proposal: Model Criticism with the Stein Witness score: 0.31 score(v) = witness2 (v) : noise(v) 7/25

52 Proposal: Model Criticism with the Stein Witness score: 0.32 score(v) = witness2 (v) : noise(v) 7/25

53 Proposal: Model Criticism with the Stein Witness score: 0.32 score(v) = witness2 (v) : noise(v) 7/25

54 Proposal: Model Criticism with the Stein Witness score: 0.37 score(v) = witness2 (v) : noise(v) 7/25

55 Proposal: Model Criticism with the Stein Witness score: 0.48 score(v) = witness2 (v) : noise(v) 7/25

56 Proposal: Model Criticism with the Stein Witness score: 0.49 score(v) = witness2 (v) : noise(v) 7/25

57 Proposal: Model Criticism with the Stein Witness score: 0.47 score(v) = witness2 (v) : noise(v) 7/25

58 Proposal: Model Criticism with the Stein Witness score: 0.44 score(v) = witness2 (v) : noise(v) 7/25

59 Proposal: Model Criticism with the Stein Witness score: score(v) = witness2 (v) : noise(v) 7/25

60 Proposal: Model Criticism with the Stein Witness score: 0.37 score(v) = witness2 (v) : noise(v) 7/25

61 Proposal: Model Criticism with the Stein Witness score: 0.16 score(v) = witness2 (v) : noise(v) 7/25

62 Proposal: Model Criticism with the Stein Witness score: 0.44 score(v) = witness2 (v) : noise(v) 7/25

63 Theory 1 What is T p k v? 2 Test statistic 3 Distributions of the test statistic, test threshold. 4 What does v = arg max v score(v) do theoretically? 8/25

64 (1) What is T p k v? Recall witness(v) =E xq (T p k v )(x) E yp (T p k v )(y) 9/25

65 (1) What is T p k v? Recall witness(v) =E xq (T p k v )(x) E yp (T p k v )(y) (T p k 1 d v )(y) = p(y) dy [k v(y)p(y)]: Then, E yp (T p k v )(y) = 0: [Liu et al., 2016, Chwialkowski et al., 2016] 9/25

66 (1) What is T p k v? Recall witness(v) =E xq (T p k v )(x) E yp (T p k v )(y) (T p k 1 d v )(y) = p(y) dy [k v(y)p(y)]: Normalizer cancels Then, E yp (T p k v )(y) = 0: [Liu et al., 2016, Chwialkowski et al., 2016] 9/25

67 (1) What is T p k v? Recall witness(v) =E xq (T p k v )(x) E yp (T p k v )(y) (T p k 1 d v )(y) = p(y) dy [k v(y)p(y)]: Normalizer cancels Then, E yp (T p k v )(y) = 0: [Liu et al., 2016, Chwialkowski et al., 2016] Proof: E yp [(T p k v )(y)] 9/25

68 (1) What is T p k v? Recall witness(v) =E xq (T p k v )(x) E yp (T p k v )(y) (T p k 1 d v )(y) = p(y) dy [k v(y)p(y)]: Normalizer cancels Then, E yp (T p k v )(y) = 0: [Liu et al., 2016, Chwialkowski et al., 2016] Z 1 Proof: E yp [(T p k v )(y)] = 1 [(T p k v )(y)] p(y)dy 9/25

69 (1) What is T p k v? Recall witness(v) =E xq (T p k v )(x) E yp (T p k v )(y) (T p k 1 d v )(y) = p(y) dy [k v(y)p(y)]: Normalizer cancels Then, E yp (T p k v )(y) = 0: [Liu et al., 2016, Chwialkowski et al., 2016] Z 1 Proof: E yp [(T p k v )(y)] = 1 d 1 p(y) dy [k v(y)p(y)] p(y) dy 9/25

70 (1) What is T p k v? Recall witness(v) =E xq (T p k v )(x) E yp (T p k v )(y) (T p k 1 d v )(y) = p(y) dy [k v(y)p(y)]: Normalizer cancels Then, E yp (T p k v )(y) = 0: [Liu et al., 2016, Chwialkowski et al., 2016] Z 1 Proof: E yp [(T p k v )(y)] = 1 1 d p(y) dy [k v(y)p(y)] p(y) dy 9/25

71 (1) What is T p k v? Recall witness(v) =E xq (T p k v )(x) E yp (T p k v )(y) (T p k 1 d v )(y) = p(y) dy [k v(y)p(y)]: Normalizer cancels Then, E yp (T p k v )(y) = 0: [Liu et al., 2016, Chwialkowski et al., 2016] Z E 1 yp [(T p k v )(y)] = 1 Z 1 Proof: = 1 1 d p(y) dy [k v(y)p(y)] p(y) dy d dy [k v(y)p(y)] dy 9/25

72 (1) What is T p k v? Recall witness(v) =E xq (T p k v )(x) E yp (T p k v )(y) (T p k 1 d v )(y) = p(y) dy [k v(y)p(y)]: Normalizer cancels Then, E yp (T p k v )(y) = 0: [Liu et al., 2016, Chwialkowski et al., 2016] Z E 1 yp [(T p k v )(y)] = 1 Z 1 Proof: = 1 1 d p(y) dy [k v(y)p(y)] p(y) dy d dy [k v(y)p(y)] dy = [k v (y)p(y)] y=1 y= 1 9/25

73 (1) What is T p k v? Recall witness(v) =E xq (T p k v )(x) E yp (T p k v )(y) (T p k 1 d v )(y) = p(y) dy [k v(y)p(y)]: Normalizer cancels Then, E yp (T p k v )(y) = 0: [Liu et al., 2016, Chwialkowski et al., 2016] Z E 1 yp [(T p k v )(y)] = 1 Z 1 Proof: = (assume lim jyj!1 k v (y)p(y)) 1 1 d p(y) dy [k v(y)p(y)] p(y) dy d dy [k v(y)p(y)] dy = [k v (y)p(y)] y=1 y= 1 = 0 9/25

74 (2) Proposal: The Finite Set Stein Discrepancy (FSSD) Recall Stein witness: g(v) :=E xq h 1 p(x) d dx [k v(x)p(x)]i p(x) q(x) g(x) /25

75 (2) Proposal: The Finite Set Stein Discrepancy (FSSD) Recall Stein witness: g(v) :=E xq h 1 p(x) d dx [k v(x)p(x)]i p(x) q(x) g(x) /25

76 (2) Proposal: The Finite Set Stein Discrepancy (FSSD) Recall Stein witness: g(v) :=E xq h 1 p(x) d dx [k v(x)p(x)]i p(x) q(x) g(x) 10/25

77 (2) Proposal: The Finite Set Stein Discrepancy (FSSD) Recall Stein witness: g(v) :=E xq h 1 p(x) d dx [k v(x)p(x)]i p(x) q(x) g(x) FSSD statistic: Evaluate g 2 at J test locations V = fv 1 ; : : : ; v J g. Population FSSD FSSD 2 = 1 dj JX j=1 kg(v j )k 2 2 : Unbiased estimator \ FSSD 2 computable in O(d 2 Jn) time. (d = input dimension) 10/25

78 (2) FSSD is a Discrepancy Measure FSSD 2 = 1 dj P Jj=1 kg(v j )k 2 2 : Theorem 1 (FSSD is a discrepancy measure). Main conditions: 1 (Nice kernel) Kernel k is C 0 -universal, and real analytic e.g., Gaussian kernel. 2 (Vanishing boundary) lim kxk!1 p(x)k v (x) = 0. 3 (Avoid blind spots ) Locations v 1 ; : : : ; v J which has a density. Then, for any J 1, -almost surely, FSSD 2 = 0 () p = q. Summary: Evaluating the witness at random locations is sufficient to detect the discrepancy between p; q. 11/25

79 (2) What Are Blind Spots? Recall g(v) : =E xq 1 d p(x) dx [k v(x)p(x)] : Consider p = N (0; 1) and q = N (0; q 2 ). Use unit-width Gaussian kernel. v exp g(v ) = v 2 2+2q 1 2 q q 3=2 2 If v = 0, then FSSD 2 = g 2 (v ) = 0 regardless of 2 q. If g 6= 0, and k is real analytic, R = fv j g(v ) = 0g (blind spots) has 0 Lebesgue measure. So, if v a distribution with a density, then v =2 R. 12/25

80 (2) What Are Blind Spots? Recall g(v) : =E xq 1 d p(x) dx [k v(x)p(x)] : Consider p = N (0; 1) and q = N (0; q 2 ). Use unit-width Gaussian kernel. v exp g(v ) = v 2 2+2q 1 2 q q 3= p = N (0, 1) q = N (0, 4) g If v = 0, then FSSD 2 = g 2 (v ) = 0 regardless of 2 q If g 6= 0, and k is real analytic, R = fv j g(v ) = 0g (blind spots) has 0 Lebesgue measure. So, if v a distribution with a density, then v =2 R. 12/25

81 (2) What Are Blind Spots? Recall g(v) : =E xq 1 d p(x) dx [k v(x)p(x)] : Consider p = N (0; 1) and q = N (0; q 2 ). Use unit-width Gaussian kernel. v exp g(v ) = v 2 2+2q 1 2 q q 3= p = N (0, 1) q = N (0, 4) g If v = 0, then FSSD 2 = g 2 (v ) = 0 regardless of 2 q If g 6= 0, and k is real analytic, R = fv j g(v ) = 0g (blind spots) has 0 Lebesgue measure. So, if v a distribution with a density, then v =2 R. 12/25

82 (2) What Are Blind Spots? Recall g(v) : =E xq 1 d p(x) dx [k v(x)p(x)] : Consider p = N (0; 1) and q = N (0; q 2 ). Use unit-width Gaussian kernel. v exp g(v ) = v 2 2+2q 1 2 q q 3= p = N (0, 1) q = N (0, 4) g If v = 0, then FSSD 2 = g 2 (v ) = 0 regardless of 2 q If g 6= 0, and k is real analytic, R = fv j g(v ) = 0g (blind spots) has 0 Lebesgue measure. So, if v a distribution with a density, then v =2 R. 12/25

83 (2) What Are Blind Spots? Recall g(v) : =E xq 1 d p(x) dx [k v(x)p(x)] : Consider p = N (0; 1) and q = N (0; q 2 ). Use unit-width Gaussian kernel. v exp g(v ) = v 2 2+2q 1 2 q q 3= p = N (0, 1) q = N (0, 4) g If v = 0, then FSSD 2 = g 2 (v ) = 0 regardless of 2 q If g 6= 0, and k is real analytic, R = fv j g(v ) = 0g (blind spots) has 0 Lebesgue measure. So, if v a distribution with a density, then v =2 R. 12/25

84 (3) Asymptotic Distributions of ^n := n \ FSSD 2 P H0 (ˆλ n ) ˆλ n 13/25

85 (3) Asymptotic Distributions of ^n := n \ FSSD 2 P H0 (ˆλ n ) T α ˆλ n 13/25

86 (3) Asymptotic Distributions of ^n := n \ FSSD 2 P H0 (ˆλ n ) T α ˆλ n ˆλ n 13/25

87 (3) Asymptotic Distributions of ^n := n \ FSSD 2 P H0 (ˆλ n ) T α ˆλ n ˆλ n Under H 0 : p = q, asymptotically ^ n := n \ FSSD 2 d! djx i=1 (Z 2 i 1)! i ; f! i g dj i=1 are non-negative, computable quantities. i:i:d: Z 1 ; : : : ; Z dj N (0; 1) 13/25

88 (3) Asymptotic Distributions of ^n := n \ FSSD 2 P H0 (ˆλ n ) T α P H1 (ˆλ n ) ˆλ n Under H 0 : p = q, asymptotically ^ n := n \ FSSD 2 d! djx i=1 (Z 2 i 1)! i ; f! i g dj i=1 are non-negative, computable quantities. i:i:d: Z 1 ; : : : ; Z dj N (0; 1) Under H 1 : p 6= q, asymptotically p n( \ FSSD 2 FSSD 2 ) d! N (0; 2 H 1 ): 13/25

89 (3) Asymptotic Distributions of ^n := n \ FSSD 2 P H0 (ˆλ n ) T α P H1 (ˆλ n ) ˆλ n Under H 0 : p = q, asymptotically ^ n := n \ FSSD 2 d! djx i=1 (Z 2 i 1)! i ; f! i g dj i=1 are non-negative, computable quantities. i:i:d: Z 1 ; : : : ; Z dj N (0; 1) Under H 1 : p 6= q, asymptotically p n( \ FSSD 2 FSSD 2 ) d! N (0; 2 H 1 ): witness 2 (V ) noise(v ) 13/25

90 (4) What Does arg max v score(v) Do? Proposition 1 (Asymptotic test power). For large n, the test power P(reject H 0 j H 1 true) = PH 1 (n \ FSSD 2 > T ) pn FSSD 2 H1 where = CDF of N (0; 1). T p ; nh1 For large n, the 2 nd term dominates. arg max V ; 2 k P H1 (n \ FSSD 2 > T ) arg max P H0 (ˆλ n ) T α P H1 (ˆλ n ) ˆλ n V ; 2 k 2 4 \ 3 FSSD 2 = score(v ; k 2 d ) H1 5 : Maximize score(v ; k 2 ) () Maximize test power In practice, split fx i g n i=1 into independent training/test sets. Optimize on tr. Goodness-of-fit test on te. 14/25

91 (4) What Does arg max v score(v) Do? Proposition 1 (Asymptotic test power). For large n, the test power P(reject H 0 j H 1 true) = PH 1 (n \ FSSD 2 > T ) pn FSSD 2 H1 where = CDF of N (0; 1). T p ; nh1 For large n, the 2 nd term dominates. arg max V ; 2 k P H1 (n \ FSSD 2 > T ) arg max P H0 (ˆλ n ) T α P H1 (ˆλ n ) ˆλ n V ; 2 k 2 4 \ 3 FSSD 2 = score(v ; k 2 d ) H1 5 : Maximize score(v ; k 2 ) () Maximize test power In practice, split fx i g n i=1 into independent training/test sets. Optimize on tr. Goodness-of-fit test on te. 14/25

92 (4) What Does arg max v score(v) Do? Proposition 1 (Asymptotic test power). For large n, the test power P(reject H 0 j H 1 true) = PH 1 (n \ FSSD 2 > T ) pn FSSD 2 H1 where = CDF of N (0; 1). T p ; nh1 For large n, the 2 nd term dominates. arg max V ; 2 k P H1 (n \ FSSD 2 > T ) arg max P H0 (ˆλ n ) T α P H1 (ˆλ n ) ˆλ n V ; 2 k 2 4 \ 3 FSSD 2 = score(v ; k 2 d ) H1 5 : Maximize score(v ; k 2 ) () Maximize test power In practice, split fx i g n i=1 into independent training/test sets. Optimize on tr. Goodness-of-fit test on te. 14/25

93 Related Works 15/25

94 Kernel Stein Discrepancy (KSD) [Liu et al., 2016, Chwialkowski et al., 2016] 0.4 Recall Stein witness: g(v) :=E xq h 1 p(x) d dx [k v(x)p(x)]i p(x) q(x) g(x) 16/25

95 Kernel Stein Discrepancy (KSD) [Liu et al., 2016, Chwialkowski et al., 2016] 0.4 Recall Stein witness: g(v) :=E xq h 1 p(x) d dx [k v(x)p(x)]i p(x) q(x) g(x) witness KSD v KSD 2 = kgk 2 RKHS (RKHS norm). Good when the difference between p; q is spatially diffuse. 16/25

96 Kernel Stein Discrepancy (KSD) [Liu et al., 2016, Chwialkowski et al., 2016] 0.4 Recall Stein witness: g(v) :=E xq h 1 p(x) d dx [k v(x)p(x)]i p(x) q(x) g(x) witness KSD Proposed FSSD witness KSD 2 = kgk 2 RKHS (RKHS norm). Good when the difference between p; q is spatially diffuse. v v FSSD 2 = 1 dj P Jj=1 kg(v j )k 2 2 : Good when the difference between p; q is local. 16/25

97 Kernel Stein Discrepancy (KSD) Closed-form expression for KSD: [Liu et al., 2016, Chwialkowski et al., 2016] z } { E xq E yq h p (x; y) KSD 2 = kgk 2 RKHS = double sums where h p (x; y) := [@ x log p(x)] k (x; y) [@ y log p(y)] + [@ y log x k (x; y) + [@ x log y k (x; y) y k (x; y) and k is a kernel. The double sums make it O(d 2 n 2 ). Slow. 17/25

98 Kernel Stein Discrepancy (KSD) Closed-form expression for KSD: [Liu et al., 2016, Chwialkowski et al., 2016] z } { E xq E yq h p (x; y) KSD 2 = kgk 2 RKHS = double sums where h p (x; y) := [@ x log p(x)] k (x; y) [@ y log p(y)] + [@ y log x k (x; y) + [@ x log y k (x; y) y k (x; y) and k is a kernel. The double sums make it O(d 2 n 2 ). Slow. 17/25

99 Linear-Time Kernel Stein Discrepancy (LKS) [Liu et al., 2016] also proposed a linear version of KSD. For fx i g n i=1 q, KSD test statistic is 2 n(n 1) X i<j h p (x i ; x j ): 8 LKS test statistic is a running average X n=2 2 h p (x 2i 1 ; x 2i ): n i= Both unbiased. LKS has O(d 2 n) runtime. Same as proposed FSSD. LKS has high variance. Poor test power. 18/25

100 Simulation Settings Gaussian kernel k (x; v) = exp kx vk k Method Description 1 FSSD-opt Proposed. With optimization. J = 5. 2 FSSD-rand Proposed. Random test locations. 3 KSD Quadratic-time kernel Stein discrepancy [Liu et al., 2016, Chwialkowski et al., 2016] 4 LKS Linear-time running average version of KSD. 5 MMD-opt 6 ME-test MMD two-sample test [Gretton et al., 2012]. With optimization. Mean Embeddings two-sample test [Jitkrittum et al., 2016]. With optimization. Two-sample tests need to draw sample from p. Tests with optimization use 20% of the data. Significance level = 0:05. 19/25

101 Simulation Settings Gaussian kernel k (x; v) = exp kx vk k Method Description 1 FSSD-opt Proposed. With optimization. J = 5. 2 FSSD-rand Proposed. Random test locations. 3 KSD Quadratic-time kernel Stein discrepancy [Liu et al., 2016, Chwialkowski et al., 2016] 4 LKS Linear-time running average version of KSD. 5 MMD-opt 6 ME-test MMD two-sample test [Gretton et al., 2012]. With optimization. Mean Embeddings two-sample test [Jitkrittum et al., 2016]. With optimization. Two-sample tests need to draw sample from p. Tests with optimization use 20% of the data. Significance level = 0:05. 19/25

102 Simulation Settings Gaussian kernel k (x; v) = exp kx vk k Method Description 1 FSSD-opt Proposed. With optimization. J = 5. 2 FSSD-rand Proposed. Random test locations. 3 KSD Quadratic-time kernel Stein discrepancy [Liu et al., 2016, Chwialkowski et al., 2016] 4 LKS Linear-time running average version of KSD. 5 MMD-opt 6 ME-test MMD two-sample test [Gretton et al., 2012]. With optimization. Mean Embeddings two-sample test [Jitkrittum et al., 2016]. With optimization. Two-sample tests need to draw sample from p. Tests with optimization use 20% of the data. Significance level = 0:05. 19/25

103 Gaussian Vs. Laplace p = Gaussian. q = Laplace. Same mean and variance. High-order moments differ. Sample size n = dimension 4000 d Sample size n Rejection rate 1.0 FSSD-opt FSSD-rand KSD LKS MMD-opt ME-opt Optimization increases the power. Two-sample tests can perform well in this case (p; q clearly differ). 20/25

104 Gaussian-Bernoulli Restricted Boltzmann Machine (RBM) p(x) is the marginal of p(x; h) = 1 Z exp x > Bh + b > x + c > x kxk2 1 ; 2 where x 2R 50, h 2 f 1g 40 is latent. Randomly pick B; b; c. q(x) = p(x) with i.i.d. N (0; per ) noise added to all entries of B. Sample size n = /25

105 Gaussian-Bernoulli Restricted Boltzmann Machine (RBM) p(x) is the marginal of p(x; h) = 1 Z exp x > Bh + b > x + c > x kxk2 1 ; 2 where x 2R 50, h 2 f 1g 40 is latent. Randomly pick B; b; c. q(x) = p(x) with i.i.d. N (0; per ) noise added to all entries of B. Sample size n = /25

106 Gaussian-Bernoulli Restricted Boltzmann Machine (RBM) p(x) is the marginal of p(x; h) = 1 Z exp x > Bh + b > x + c > x kxk2 1 ; 2 where x 2R 50, h 2 f 1g 40 is latent. Randomly pick B; b; c. q(x) = p(x) with i.i.d. N (0; per ) noise added to all entries of B. Sample size n = /25

107 Gaussian-Bernoulli Restricted Boltzmann Machine (RBM) p(x) is the marginal of p(x; h) = 1 Z exp x > Bh + b > x + c > x kxk2 1 ; 2 where x 2R 50, h 2 f 1g 40 is latent. Randomly pick B; b; c. q(x) = p(x) with i.i.d. N (0; per ) noise added to all entries of B. Sample size n = Perturbation 4000SD σ per Sample size n Rejection rate 1.0 FSSD-opt FSSD-rand KSD LKS MMD-opt ME-opt KSD (O(n 2 )), FSSD-opt (O(n)) comparable. LKS has low power. 21/25

108 Interpretable Test Locations: Chicago Crime 22/25

109 Interpretable Test Locations: Chicago Crime 22/25

110 Interpretable Test Locations: Chicago Crime n = robbery events in Chicago in lat/long coordinates = sample from q. Model spatial density with Gaussian mixtures. 22/25

111 Interpretable Test Locations: Chicago Crime Model p = 2-component Gaussian mixture. 22/25

112 Interpretable Test Locations: Chicago Crime Score surface 22/25

113 Interpretable Test Locations: Chicago Crime F = optimized v. 22/25

114 Interpretable Test Locations: Chicago Crime F = optimized v. No robbery in Lake Michigan. 22/25

115 Interpretable Test Locations: Chicago Crime Model p = 10-component Gaussian mixture. 22/25

116 Interpretable Test Locations: Chicago Crime Capture the right tail better. 22/25

117 Interpretable Test Locations: Chicago Crime Still, does not capture the left tail. 22/25

118 Interpretable Test Locations: Chicago Crime Still, does not capture the left tail. Learned test locations are interpretable. 22/25

119 Bahadur Slope and Bahadur Efficiency Bahadur slope u rate of p-value! 0 under H 1 as n! 1. Measure a test s sensitivity to the departure from H 0. H 0 : = 0; H 1 : 6= 0: Typically pval n exp 1 2 c()n where c() > 0 under H 1, and c(0) = 0 [Bahadur, 1960]. c() higher =) more sensitive. Good. Bahadur slope c() := 2 plim n!1 log (1 F (T n )) ; n where F (t) = CDF of T n under H 0. Bahadur efficiency = ratio of slopes of two tests. 23/25

120 Bahadur Slope and Bahadur Efficiency Bahadur slope u rate of p-value! 0 under H 1 as n! 1. Measure a test s sensitivity to the departure from H 0. H 0 : = 0; H 1 : 6= 0: Typically pval n exp 1 2 c()n where c() > 0 under H 1, and c(0) = 0 [Bahadur, 1960]. c() higher =) more sensitive. Good. Bahadur slope c() := 2 plim n!1 log (1 F (T n )) ; n where F (t) = CDF of T n under H 0. Bahadur efficiency = ratio of slopes of two tests. 23/25

121 Bahadur Slope and Bahadur Efficiency Bahadur slope u rate of p-value! 0 under H 1 as n! 1. Measure a test s sensitivity to the departure from H 0. H 0 : = 0; H 1 : 6= 0: Typically pval n exp 1 2 c()n where c() > 0 under H 1, and c(0) = 0 [Bahadur, 1960]. c() higher =) more sensitive. Good. p-value p-value of T (1) n p-value of T (2) n n Bahadur slope c() := 2 plim n!1 log (1 F (T n )) ; n where F (t) = CDF of T n under H 0. Bahadur efficiency = ratio of slopes of two tests. 23/25

122 Bahadur Slope and Bahadur Efficiency Bahadur slope u rate of p-value! 0 under H 1 as n! 1. Measure a test s sensitivity to the departure from H 0. H 0 : = 0; H 1 : 6= 0: Typically pval n exp 1 2 c()n where c() > 0 under H 1, and c(0) = 0 [Bahadur, 1960]. c() higher =) more sensitive. Good. p-value p-value of T (1) n p-value of T (2) n n Bahadur slope c() := 2 plim n!1 log (1 F (T n )) ; n where F (t) = CDF of T n under H 0. Bahadur efficiency = ratio of slopes of two tests. 23/25

123 Gaussian Mean Shift Problem Consider p = N (0; 1) and q = N ( q ; 1). Assume J = 1 location for n \ FSSD 2. Gaussian kernel (bandwidth = 2 k ) c (FSSD) ( q ; v ; 2 k ) = q 2 2 k k 2 k v 2 (v q) 2 2 q e 2 k +2 2 k +1 For LKS, Gaussian kernel (bandwidth = 2 ). c (LKS) ( q ; 2 ) = k k + 44 k + (v 2 + 5) 2 k + 2 : 2 5= =2 4 q 2 ( 2 + 2) ( ) : Theorem 2 (FSSD is at least two times more efficient). Fix \ k 2 = 1 for n FSSD 2. Then, 8 q 6= 0, 9v 2R, 8 2 > 0, we have Bahadur efficiency c (FSSD) ( q ; v ; k 2 ) c (LKS) ( q ; 2 > 2: ) 24/25

124 Gaussian Mean Shift Problem Consider p = N (0; 1) and q = N ( q ; 1). Assume J = 1 location for n \ FSSD 2. Gaussian kernel (bandwidth = 2 k ) c (FSSD) ( q ; v ; 2 k ) = q 2 2 k k 2 k v 2 (v q) 2 2 q e 2 k +2 2 k +1 For LKS, Gaussian kernel (bandwidth = 2 ). c (LKS) ( q ; 2 ) = k k + 44 k + (v 2 + 5) 2 k + 2 : 2 5= =2 4 q 2 ( 2 + 2) ( ) : Theorem 2 (FSSD is at least two times more efficient). Fix \ k 2 = 1 for n FSSD 2. Then, 8 q 6= 0, 9v 2R, 8 2 > 0, we have Bahadur efficiency c (FSSD) ( q ; v ; k 2 ) c (LKS) ( q ; 2 > 2: ) 24/25

125 Gaussian Mean Shift Problem Consider p = N (0; 1) and q = N ( q ; 1). Assume J = 1 location for n \ FSSD 2. Gaussian kernel (bandwidth = 2 k ) c (FSSD) ( q ; v ; 2 k ) = q 2 2 k k 2 k v 2 (v q) 2 2 q e 2 k +2 2 k +1 For LKS, Gaussian kernel (bandwidth = 2 ). c (LKS) ( q ; 2 ) = k k + 44 k + (v 2 + 5) 2 k + 2 : 2 5= =2 4 q 2 ( 2 + 2) ( ) : Theorem 2 (FSSD is at least two times more efficient). Fix \ k 2 = 1 for n FSSD 2. Then, 8 q 6= 0, 9v 2R, 8 2 > 0, we have Bahadur efficiency c (FSSD) ( q ; v ; k 2 ) c (LKS) ( q ; 2 > 2: ) 24/25

126 Gaussian Mean Shift Problem Consider p = N (0; 1) and q = N ( q ; 1). Assume J = 1 location for n \ FSSD 2. Gaussian kernel (bandwidth = 2 k ) c (FSSD) ( q ; v ; 2 k ) = q 2 2 k k 2 k v 2 (v q) 2 2 q e 2 k +2 2 k +1 For LKS, Gaussian kernel (bandwidth = 2 ). c (LKS) ( q ; 2 ) = k k + 44 k + (v 2 + 5) 2 k + 2 : 2 5= =2 4 q 2 ( 2 + 2) ( ) : Theorem 2 (FSSD is at least two times more efficient). Fix \ k 2 = 1 for n FSSD 2. Then, 8 q 6= 0, 9v 2R, 8 2 > 0, we have Bahadur efficiency c (FSSD) ( q ; v ; k 2 ) c (LKS) ( q ; 2 > 2: ) 24/25

127 Conclusion Proposed The Finite Set Stein Discrepancy (FSSD). Goodness-of-fit test based on FSSD is 1 nonparametric, 2 linear-time, 3 tunable (parameters automatically tuned). 4 interpretable. A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum, Wenkai Xu, Zoltán Szabó, Kenji Fukumizu, Arthur Gretton NIPS 2017 (best paper award) Python code: 25/25

128 Questions? Thank you 26/25

129 Illustration: Score Surface Consider J = 1 location. FSSD score(v) = 2 (v) \ c H1 (v) (gray), p in wireframe, fx i g n i=1 q in purple, = best v. p = N 0; !! FSSD 2 / σ H1 vs. q = N 0; !! /25

130 Illustration: Score Surface Consider J = 1 location. FSSD score(v) = 2 (v) \ c H1 (v) (gray), p in wireframe, fx i g n i=1 q in purple, = best v. p = N (0; I) vs. q = Laplace with same mean & variance. p q FSSD 2 σ H1 4 2 v 0 v FSSD 2 / σ H /25

131 FSSD and KSD in 1D Gaussian Case Consider p = N (0; 1) and q = N ( q ; 2 q ). Assume J = 1 feature for n \ FSSD 2. Gaussian kernel (bandwidth = 2 k ). FSSD 2 = 2 k e (v q)2 2 k +2 q 2 k + 1 q + v 2 q k + 2 q 3 : ( 2 k +1)q If q 6= 0; q 2 6= 1, and v = (q 2 1), then FSSD2 = 0! This is why v should be drawn from a distribution with a density. For KSD, Gaussian kernel (bandwidth = 2 ). S 2 = 2 q 2 + 2q q q q 2 2 q : 28/25

132 FSSD and KSD in 1D Gaussian Case Consider p = N (0; 1) and q = N ( q ; 2 q ). Assume J = 1 feature for n \ FSSD 2. Gaussian kernel (bandwidth = 2 k ). FSSD 2 = 2 k e (v q)2 2 k +2 q 2 k + 1 q + v 2 q k + 2 q 3 : ( 2 k +1)q If q 6= 0; q 2 6= 1, and v = (q 2 1), then FSSD2 = 0! This is why v should be drawn from a distribution with a density. For KSD, Gaussian kernel (bandwidth = 2 ). S 2 = 2 q 2 + 2q q q q 2 2 q : 28/25

133 FSSD is a Discrepancy Measure Theorem 3. Let V = fv 1 ; : : : ; v J g R d be drawn i.i.d. from a distribution which has a density. Let X be a connected open set in R d. Assume 1 (Nice RKHS) Kernel k : X X!R is C 0 -universal, and real analytic. 2 (Stein witness not too rough) kgk 2 F < 1. 3 (Finite Fisher divergence) E xq kr x log p(x) q(x) k2 < 1. 4 (Vanishing boundary) lim kxk!1 p(x)g(x) = 0. Then, for any J 1, -almost surely FSSD 2 = 0 if and only if p = q. Gaussian kernel k (x; v) = exp kx vk k 2 In practice, J = 1 or J = 5. works. 29/25

134 Asymptotic Distributions of \FSSD 2 Recall (x; v) := 1 x[k (x; v)p(x)] 2R d : (x) := vertically stack (x; v 1 ); : : : (x; v J ) 2R dj. Feature vector of x. Mean feature: :=E xq [ (x)]. r := cov xr [ (x)] 2R dj dj for r 2 fp; qg Proposition 2 (Asymptotic distributions). Let Z 1 ; : : : ; Z dj i:i:d: N (0; 1), and f! i g dj i=1 be the eigenvalues of p. 1 Under H 0 : p = q, asymptotically n \ FSSD 2 d! P dj i=1 (Z 2 i 1)! i. Easy to simulate to get p-value. Simulation cost independent of n. 2 Under H 1 : p 6= q, we have p n( \ FSSD 2 FSSD 2 ) d! N (0; 2 H 1 ) where 2 H 1 := 4 > q. Implies P(reject H 0 )! 1 as n! 1. But, how to estimate p? No sample from p! Theorem: Using ^ q (computed with fx i g n i=1 q) still leads to a consistent test. 30/25

135 Asymptotic Distributions of \FSSD 2 Recall (x; v) := 1 x[k (x; v)p(x)] 2R d : (x) := vertically stack (x; v 1 ); : : : (x; v J ) 2R dj. Feature vector of x. Mean feature: :=E xq [ (x)]. r := cov xr [ (x)] 2R dj dj for r 2 fp; qg Proposition 2 (Asymptotic distributions). Let Z 1 ; : : : ; Z dj i:i:d: N (0; 1), and f! i g dj i=1 be the eigenvalues of p. 1 Under H 0 : p = q, asymptotically n \ FSSD 2 d! P dj i=1 (Z 2 i 1)! i. Easy to simulate to get p-value. Simulation cost independent of n. 2 Under H 1 : p 6= q, we have p n( \ FSSD 2 FSSD 2 ) d! N (0; 2 H 1 ) where 2 H 1 := 4 > q. Implies P(reject H 0 )! 1 as n! 1. But, how to estimate p? No sample from p! Theorem: Using ^ q (computed with fx i g n i=1 q) still leads to a consistent test. 30/25

136 Asymptotic Distributions of \FSSD 2 Recall (x; v) := 1 x[k (x; v)p(x)] 2R d : (x) := vertically stack (x; v 1 ); : : : (x; v J ) 2R dj. Feature vector of x. Mean feature: :=E xq [ (x)]. r := cov xr [ (x)] 2R dj dj for r 2 fp; qg Proposition 2 (Asymptotic distributions). Let Z 1 ; : : : ; Z dj i:i:d: N (0; 1), and f! i g dj i=1 be the eigenvalues of p. 1 Under H 0 : p = q, asymptotically n \ FSSD 2 d! P dj i=1 (Z 2 i 1)! i. Easy to simulate to get p-value. Simulation cost independent of n. 2 Under H 1 : p 6= q, we have p n( \ FSSD 2 FSSD 2 ) d! N (0; 2 H 1 ) where 2 H 1 := 4 > q. Implies P(reject H 0 )! 1 as n! 1. But, how to estimate p? No sample from p! Theorem: Using ^ q (computed with fx i g n i=1 q) still leads to a consistent test. 30/25

137 Asymptotic Distributions of \FSSD 2 Recall (x; v) := 1 x[k (x; v)p(x)] 2R d : (x) := vertically stack (x; v 1 ); : : : (x; v J ) 2R dj. Feature vector of x. Mean feature: :=E xq [ (x)]. r := cov xr [ (x)] 2R dj dj for r 2 fp; qg Proposition 2 (Asymptotic distributions). Let Z 1 ; : : : ; Z dj i:i:d: N (0; 1), and f! i g dj i=1 be the eigenvalues of p. 1 Under H 0 : p = q, asymptotically n \ FSSD 2 d! P dj i=1 (Z 2 i 1)! i. Easy to simulate to get p-value. Simulation cost independent of n. 2 Under H 1 : p 6= q, we have p n( \ FSSD 2 FSSD 2 ) d! N (0; 2 H 1 ) where 2 H 1 := 4 > q. Implies P(reject H 0 )! 1 as n! 1. But, how to estimate p? No sample from p! Theorem: Using ^ q (computed with fx i g n i=1 q) still leads to a consistent test. 30/25

138 Bahadur Slopes of FSSD and LKS Theorem 4. The Bahadur slope of n \ FSSD 2 is c (FSSD) := FSSD 2 =! 1 ; where! 1 is the maximum eigenvalue of p := cov xp [ (x)]. Theorem 5. The Bahadur slope of the linear-time kernel Stein (LKS) statistic p ns c2 l is c (LKS) = 1 2 [E q h p (x; x 0 )] 2 E p hh 2 p (x; x0 )i ; where h p is the U-statistic kernel of the KSD statistic. 31/25

139 Bahadur Slopes of FSSD and LKS Theorem 4. The Bahadur slope of n \ FSSD 2 is c (FSSD) := FSSD 2 =! 1 ; where! 1 is the maximum eigenvalue of p := cov xp [ (x)]. Theorem 5. The Bahadur slope of the linear-time kernel Stein (LKS) statistic p ns c2 l is c (LKS) = 1 2 [E q h p (x; x 0 )] 2 E p hh 2 p (x; x0 )i ; where h p is the U-statistic kernel of the KSD statistic. 31/25

140 Harder RBM Problem Perturb only one entry of B 2R (in the RBM). B 1;1 B 1;1 + N (0; 2 per = 0:12 ) Rejection rate Sample 4000 size n Sample size n FSSD-opt FSSD-rand KSD LKS MMD-opt ME-opt Two-sample tests fail. Samples from p; q look roughly the same. FSSD-opt is comparable to KSD at low n. One order of magnitude faster. 32/25

141 0 5 0 Harder RBM Problem Perturb only one entry of B 2R (in the RBM). B 1;1 B 1;1 + N (0; 2 per = 0:12 ). Time (s) Sample4000 size n Sample size n FSSD-opt FSSD-rand KSD LKS MMD-opt ME-opt Two-sample tests fail. Samples from p; q look roughly the same. FSSD-opt is comparable to KSD at low n. One order of magnitude faster. 32/25

142 References I Bahadur, R. R. (1960). Stochastic comparison of tests. The Annals of Mathematical Statistics, 31(2): Chwialkowski, K., Strathmann, H., and Gretton, A. (2016). A kernel test of goodness of fit. In ICML, pages Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A Kernel Two-Sample Test. JMLR, 13: Jitkrittum, W., Szabó, Z., Chwialkowski, K. P., and Gretton, A. (2016). Interpretable Distribution Features with Maximum Testing Power. In NIPS, pages /25

143 References II Liu, Q., Lee, J., and Jordan, M. (2016). A Kernelized Stein Discrepancy for Goodness-of-fit Tests. In ICML, pages /25

An Adaptive Test of Independence with Analytic Kernel Embeddings

An Adaptive Test of Independence with Analytic Kernel Embeddings An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum Gatsby Unit, University College London wittawat@gatsby.ucl.ac.uk Probabilistic Graphical Model Workshop 2017 Institute

More information

An Adaptive Test of Independence with Analytic Kernel Embeddings

An Adaptive Test of Independence with Analytic Kernel Embeddings An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum 1 Zoltán Szabó 2 Arthur Gretton 1 1 Gatsby Unit, University College London 2 CMAP, École Polytechnique ICML 2017, Sydney

More information

A Linear-Time Kernel Goodness-of-Fit Test

A Linear-Time Kernel Goodness-of-Fit Test A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum Gatsby Unit, UCL wittawatj@gmail.com Wenkai Xu Gatsby Unit, UCL wenkaix@gatsby.ucl.ac.uk Zoltán Szabó CMAP, École Polytechnique zoltan.szabo@polytechnique.edu

More information

Learning Interpretable Features to Compare Distributions

Learning Interpretable Features to Compare Distributions Learning Interpretable Features to Compare Distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London Theory of Big Data, 2017 1/41 Goal of this talk Given: Two collections

More information

Learning features to compare distributions

Learning features to compare distributions Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this

More information

Learning features to compare distributions

Learning features to compare distributions Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this

More information

Advances in kernel exponential families

Advances in kernel exponential families Advances in kernel exponential families Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS, 2017 1/39 Outline Motivating application: Fast estimation of complex multivariate

More information

Hilbert Space Representations of Probability Distributions

Hilbert Space Representations of Probability Distributions Hilbert Space Representations of Probability Distributions Arthur Gretton joint work with Karsten Borgwardt, Kenji Fukumizu, Malte Rasch, Bernhard Schölkopf, Alex Smola, Le Song, Choon Hui Teo Max Planck

More information

A Kernelized Stein Discrepancy for Goodness-of-fit Tests

A Kernelized Stein Discrepancy for Goodness-of-fit Tests Qiang Liu QLIU@CS.DARTMOUTH.EDU Computer Science, Dartmouth College, NH, 3755 Jason D. Lee JASONDLEE88@EECS.BERKELEY.EDU Michael Jordan JORDAN@CS.BERKELEY.EDU Department of Electrical Engineering and Computer

More information

Kernel Adaptive Metropolis-Hastings

Kernel Adaptive Metropolis-Hastings Kernel Adaptive Metropolis-Hastings Arthur Gretton,?? Gatsby Unit, CSML, University College London NIPS, December 2015 Arthur Gretton (Gatsby Unit, UCL) Kernel Adaptive Metropolis-Hastings 12/12/2015 1

More information

Adaptive HMC via the Infinite Exponential Family

Adaptive HMC via the Infinite Exponential Family Adaptive HMC via the Infinite Exponential Family Arthur Gretton Gatsby Unit, CSML, University College London RegML, 2017 Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family

More information

Kernel methods for comparing distributions, measuring dependence

Kernel methods for comparing distributions, measuring dependence Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations

More information

Big Hypothesis Testing with Kernel Embeddings

Big Hypothesis Testing with Kernel Embeddings Big Hypothesis Testing with Kernel Embeddings Dino Sejdinovic Department of Statistics University of Oxford 9 January 2015 UCL Workshop on the Theory of Big Data D. Sejdinovic (Statistics, Oxford) Big

More information

Hypothesis Testing with Kernel Embeddings on Interdependent Data

Hypothesis Testing with Kernel Embeddings on Interdependent Data Hypothesis Testing with Kernel Embeddings on Interdependent Data Dino Sejdinovic Department of Statistics University of Oxford joint work with Kacper Chwialkowski and Arthur Gretton (Gatsby Unit, UCL)

More information

Maximum Mean Discrepancy

Maximum Mean Discrepancy Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia

More information

Tensor Product Kernels: Characteristic Property, Universality

Tensor Product Kernels: Characteristic Property, Universality Tensor Product Kernels: Characteristic Property, Universality Zolta n Szabo CMAP, E cole Polytechnique Joint work with: Bharath K. Sriperumbudur Hangzhou International Conference on Frontiers of Data Science

More information

Kernel adaptive Sequential Monte Carlo

Kernel adaptive Sequential Monte Carlo Kernel adaptive Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) December 7, 2015 1 / 36 Section 1 Outline

More information

Gaussian processes for inference in stochastic differential equations

Gaussian processes for inference in stochastic differential equations Gaussian processes for inference in stochastic differential equations Manfred Opper, AI group, TU Berlin November 6, 2017 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017

More information

Unsupervised Nonparametric Anomaly Detection: A Kernel Method

Unsupervised Nonparametric Anomaly Detection: A Kernel Method Fifty-second Annual Allerton Conference Allerton House, UIUC, Illinois, USA October - 3, 24 Unsupervised Nonparametric Anomaly Detection: A Kernel Method Shaofeng Zou Yingbin Liang H. Vincent Poor 2 Xinghua

More information

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,

More information

Examples are not Enough, Learn to Criticize! Criticism for Interpretability

Examples are not Enough, Learn to Criticize! Criticism for Interpretability Examples are not Enough, Learn to Criticize! Criticism for Interpretability Been Kim, Rajiv Khanna, Oluwasanmi Koyejo Wittawat Jitkrittum Gatsby Machine Learning Journal Club 16 Jan 2017 1/20 Summary Examples

More information

Convergence Rates of Kernel Quadrature Rules

Convergence Rates of Kernel Quadrature Rules Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction

More information

Hilbert Space Embedding of Probability Measures

Hilbert Space Embedding of Probability Measures Lecture 2 Hilbert Space Embedding of Probability Measures Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 2017 Recap of Lecture

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Minimax Estimation of Kernel Mean Embeddings

Minimax Estimation of Kernel Mean Embeddings Minimax Estimation of Kernel Mean Embeddings Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University Gatsby Computational Neuroscience Unit May 4, 2016 Collaborators Dr. Ilya Tolstikhin

More information

Kernel methods for Bayesian inference

Kernel methods for Bayesian inference Kernel methods for Bayesian inference Arthur Gretton Gatsby Computational Neuroscience Unit Lancaster, Nov. 2014 Motivating Example: Bayesian inference without a model 3600 downsampled frames of 20 20

More information

Recovering Distributions from Gaussian RKHS Embeddings

Recovering Distributions from Gaussian RKHS Embeddings Motonobu Kanagawa Graduate University for Advanced Studies kanagawa@ism.ac.jp Kenji Fukumizu Institute of Statistical Mathematics fukumizu@ism.ac.jp Abstract Recent advances of kernel methods have yielded

More information

Measuring Sample Quality with Stein s Method

Measuring Sample Quality with Stein s Method Measuring Sample Quality with Stein s Method Lester Mackey, Joint work with Jackson Gorham Microsoft Research Stanford University July 4, 2017 Mackey (MSR) Kernel Stein Discrepancy July 4, 2017 1 / 25

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Nonparametric Indepedence Tests: Space Partitioning and Kernel Approaches

Nonparametric Indepedence Tests: Space Partitioning and Kernel Approaches Nonparametric Indepedence Tests: Space Partitioning and Kernel Approaches Arthur Gretton 1 and László Györfi 2 1. Gatsby Computational Neuroscience Unit London, UK 2. Budapest University of Technology

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning 12. Gaussian Processes Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 The Normal Distribution http://www.gaussianprocess.org/gpml/chapters/

More information

Nonparametric Inference via Bootstrapping the Debiased Estimator

Nonparametric Inference via Bootstrapping the Debiased Estimator Nonparametric Inference via Bootstrapping the Debiased Estimator Yen-Chi Chen Department of Statistics, University of Washington ICSA-Canada Chapter Symposium 2017 1 / 21 Problem Setup Let X 1,, X n be

More information

Kernel Sequential Monte Carlo

Kernel Sequential Monte Carlo Kernel Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) * equal contribution April 25, 2016 1 / 37 Section

More information

17 : Markov Chain Monte Carlo

17 : Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo

More information

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December

More information

8.1 Concentration inequality for Gaussian random matrix (cont d)

8.1 Concentration inequality for Gaussian random matrix (cont d) MGMT 69: Topics in High-dimensional Data Analysis Falll 26 Lecture 8: Spectral clustering and Laplacian matrices Lecturer: Jiaming Xu Scribe: Hyun-Ju Oh and Taotao He, October 4, 26 Outline Concentration

More information

41903: Introduction to Nonparametrics

41903: Introduction to Nonparametrics 41903: Notes 5 Introduction Nonparametrics fundamentally about fitting flexible models: want model that is flexible enough to accommodate important patterns but not so flexible it overspecializes to specific

More information

Undirected Graphical Models

Undirected Graphical Models Undirected Graphical Models 1 Conditional Independence Graphs Let G = (V, E) be an undirected graph with vertex set V and edge set E, and let A, B, and C be subsets of vertices. We say that C separates

More information

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13 Energy Based Models Stefano Ermon, Aditya Grover Stanford University Lecture 13 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 1 / 21 Summary Story so far Representation: Latent

More information

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix) 1 EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix) Taisuke Otsu London School of Economics Summer 2018 A.1. Summation operator (Wooldridge, App. A.1) 2 3 Summation operator For

More information

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department,

More information

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space Robert Jenssen, Deniz Erdogmus 2, Jose Principe 2, Torbjørn Eltoft Department of Physics, University of Tromsø, Norway

More information

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic

More information

Fast Two Sample Tests using Smooth Random Features

Fast Two Sample Tests using Smooth Random Features Kacper Chwialkowski University College London, Computer Science Department Aaditya Ramdas Carnegie Mellon University, Machine Learning and Statistics School of Computer Science Dino Sejdinovic University

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

CS Lecture 19. Exponential Families & Expectation Propagation

CS Lecture 19. Exponential Families & Expectation Propagation CS 6347 Lecture 19 Exponential Families & Expectation Propagation Discrete State Spaces We have been focusing on the case of MRFs over discrete state spaces Probability distributions over discrete spaces

More information

Monte Carlo Methods. Leon Gu CSD, CMU

Monte Carlo Methods. Leon Gu CSD, CMU Monte Carlo Methods Leon Gu CSD, CMU Approximate Inference EM: y-observed variables; x-hidden variables; θ-parameters; E-step: q(x) = p(x y, θ t 1 ) M-step: θ t = arg max E q(x) [log p(y, x θ)] θ Monte

More information

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline.

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline. Practitioner Course: Portfolio Optimization September 10, 2008 Before we define dependence, it is useful to define Random variables X and Y are independent iff For all x, y. In particular, F (X,Y ) (x,

More information

Learning gradients: prescriptive models

Learning gradients: prescriptive models Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan

More information

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d

More information

Hilbert Schmidt Independence Criterion

Hilbert Schmidt Independence Criterion Hilbert Schmidt Independence Criterion Thanks to Arthur Gretton, Le Song, Bernhard Schölkopf, Olivier Bousquet Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au

More information

Particle Filtering Approaches for Dynamic Stochastic Optimization

Particle Filtering Approaches for Dynamic Stochastic Optimization Particle Filtering Approaches for Dynamic Stochastic Optimization John R. Birge The University of Chicago Booth School of Business Joint work with Nicholas Polson, Chicago Booth. JRBirge I-Sim Workshop,

More information

22 : Hilbert Space Embeddings of Distributions

22 : Hilbert Space Embeddings of Distributions 10-708: Probabilistic Graphical Models 10-708, Spring 2014 22 : Hilbert Space Embeddings of Distributions Lecturer: Eric P. Xing Scribes: Sujay Kumar Jauhar and Zhiguang Huo 1 Introduction and Motivation

More information

Approximate Message Passing

Approximate Message Passing Approximate Message Passing Mohammad Emtiyaz Khan CS, UBC February 8, 2012 Abstract In this note, I summarize Sections 5.1 and 5.2 of Arian Maleki s PhD thesis. 1 Notation We denote scalars by small letters

More information

Approximate Bayesian Computation and Particle Filters

Approximate Bayesian Computation and Particle Filters Approximate Bayesian Computation and Particle Filters Dennis Prangle Reading University 5th February 2014 Introduction Talk is mostly a literature review A few comments on my own ongoing research See Jasra

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Statistical Convergence of Kernel CCA

Statistical Convergence of Kernel CCA Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,

More information

Chapter 16. Structured Probabilistic Models for Deep Learning

Chapter 16. Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe

More information

The Variational Gaussian Approximation Revisited

The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

More information

How to learn from very few examples?

How to learn from very few examples? How to learn from very few examples? Dengyong Zhou Department of Empirical Inference Max Planck Institute for Biological Cybernetics Spemannstr. 38, 72076 Tuebingen, Germany Outline Introduction Part A

More information

Importance Reweighting Using Adversarial-Collaborative Training

Importance Reweighting Using Adversarial-Collaborative Training Importance Reweighting Using Adversarial-Collaborative Training Yifan Wu yw4@andrew.cmu.edu Tianshu Ren tren@andrew.cmu.edu Lidan Mu lmu@andrew.cmu.edu Abstract We consider the problem of reweighting a

More information

Dependence. MFM Practitioner Module: Risk & Asset Allocation. John Dodson. September 11, Dependence. John Dodson. Outline.

Dependence. MFM Practitioner Module: Risk & Asset Allocation. John Dodson. September 11, Dependence. John Dodson. Outline. MFM Practitioner Module: Risk & Asset Allocation September 11, 2013 Before we define dependence, it is useful to define Random variables X and Y are independent iff For all x, y. In particular, F (X,Y

More information

Bayesian Support Vector Machines for Feature Ranking and Selection

Bayesian Support Vector Machines for Feature Ranking and Selection Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction

More information

Probabilistic Graphical Models for Image Analysis - Lecture 4

Probabilistic Graphical Models for Image Analysis - Lecture 4 Probabilistic Graphical Models for Image Analysis - Lecture 4 Stefan Bauer 12 October 2018 Max Planck ETH Center for Learning Systems Overview 1. Repetition 2. α-divergence 3. Variational Inference 4.

More information

Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise

Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise Makoto Yamada and Masashi Sugiyama Department of Computer Science, Tokyo Institute of Technology

More information

Distribution Regression

Distribution Regression Zoltán Szabó (École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Dagstuhl Seminar 16481

More information

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 89 Part II

More information

Non-parametric Inference and Resampling

Non-parametric Inference and Resampling Non-parametric Inference and Resampling Exercises by David Wozabal (Last update 3. Juni 2013) 1 Basic Facts about Rank and Order Statistics 1.1 10 students were asked about the amount of time they spend

More information

A Novel Nonparametric Density Estimator

A Novel Nonparametric Density Estimator A Novel Nonparametric Density Estimator Z. I. Botev The University of Queensland Australia Abstract We present a novel nonparametric density estimator and a new data-driven bandwidth selection method with

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 18, 2015 1 / 45 Resources and Attribution Image credits,

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of

More information

From graph to manifold Laplacian: The convergence rate

From graph to manifold Laplacian: The convergence rate Appl. Comput. Harmon. Anal. 2 (2006) 28 34 www.elsevier.com/locate/acha Letter to the Editor From graph to manifold Laplacian: The convergence rate A. Singer Department of athematics, Yale University,

More information

Markov Chain Monte Carlo Methods for Stochastic Optimization

Markov Chain Monte Carlo Methods for Stochastic Optimization Markov Chain Monte Carlo Methods for Stochastic Optimization John R. Birge The University of Chicago Booth School of Business Joint work with Nicholas Polson, Chicago Booth. JRBirge U of Toronto, MIE,

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory

More information

Performance Evaluation and Comparison

Performance Evaluation and Comparison Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Cross Validation and Resampling 3 Interval Estimation

More information

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Some slides have been adopted from Prof. H.R. Rabiee s and also Prof. R. Gutierrez-Osuna

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

5 Introduction to the Theory of Order Statistics and Rank Statistics

5 Introduction to the Theory of Order Statistics and Rank Statistics 5 Introduction to the Theory of Order Statistics and Rank Statistics This section will contain a summary of important definitions and theorems that will be useful for understanding the theory of order

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable Lecture Notes 1 Probability and Random Variables Probability Spaces Conditional Probability and Independence Random Variables Functions of a Random Variable Generation of a Random Variable Jointly Distributed

More information

1 Exponential manifold by reproducing kernel Hilbert spaces

1 Exponential manifold by reproducing kernel Hilbert spaces 1 Exponential manifold by reproducing kernel Hilbert spaces 1.1 Introduction The purpose of this paper is to propose a method of constructing exponential families of Hilbert manifold, on which estimation

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

March 13, Paper: R.R. Coifman, S. Lafon, Diffusion maps ([Coifman06]) Seminar: Learning with Graphs, Prof. Hein, Saarland University

March 13, Paper: R.R. Coifman, S. Lafon, Diffusion maps ([Coifman06]) Seminar: Learning with Graphs, Prof. Hein, Saarland University Kernels March 13, 2008 Paper: R.R. Coifman, S. Lafon, maps ([Coifman06]) Seminar: Learning with Graphs, Prof. Hein, Saarland University Kernels Figure: Example Application from [LafonWWW] meaningful geometric

More information

Kernel Measures of Conditional Dependence

Kernel Measures of Conditional Dependence Kernel Measures of Conditional Dependence Kenji Fukumizu Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku Tokyo 6-8569 Japan fukumizu@ism.ac.jp Arthur Gretton Max-Planck Institute for

More information

Kernel-Based Just-In-Time Learning for Passing Expectation Propagation Messages

Kernel-Based Just-In-Time Learning for Passing Expectation Propagation Messages Kernel-Based Just-In-Time Learning for Passing Expectation Propagation Messages 27 April 2015 Wittawat Jitkrittum 1, Arthur Gretton 1, Nicolas Heess, S. M. Ali Eslami, Balaji Lakshminarayanan 1, Dino Sejdinovic

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

Practice Problem - Skewness of Bernoulli Random Variable. Lecture 7: Joint Distributions and the Law of Large Numbers. Joint Distributions - Example

Practice Problem - Skewness of Bernoulli Random Variable. Lecture 7: Joint Distributions and the Law of Large Numbers. Joint Distributions - Example A little more E(X Practice Problem - Skewness of Bernoulli Random Variable Lecture 7: and the Law of Large Numbers Sta30/Mth30 Colin Rundel February 7, 014 Let X Bern(p We have shown that E(X = p Var(X

More information

Robust Support Vector Machines for Probability Distributions

Robust Support Vector Machines for Probability Distributions Robust Support Vector Machines for Probability Distributions Andreas Christmann joint work with Ingo Steinwart (Los Alamos National Lab) ICORS 2008, Antalya, Turkey, September 8-12, 2008 Andreas Christmann,

More information

Autoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas

Autoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas On for Energy Based Models Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas Toronto Machine Learning Group Meeting, 2011 Motivation Models Learning Goal: Unsupervised

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

16 : Markov Chain Monte Carlo (MCMC)

16 : Markov Chain Monte Carlo (MCMC) 10-708: Probabilistic Graphical Models 10-708, Spring 2014 16 : Markov Chain Monte Carlo MCMC Lecturer: Matthew Gormley Scribes: Yining Wang, Renato Negrinho 1 Sampling from low-dimensional distributions

More information

Statistics for scientists and engineers

Statistics for scientists and engineers Statistics for scientists and engineers February 0, 006 Contents Introduction. Motivation - why study statistics?................................... Examples..................................................3

More information

19 : Slice Sampling and HMC

19 : Slice Sampling and HMC 10-708: Probabilistic Graphical Models 10-708, Spring 2018 19 : Slice Sampling and HMC Lecturer: Kayhan Batmanghelich Scribes: Boxiang Lyu 1 MCMC (Auxiliary Variables Methods) In inference, we are often

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

Explicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression

Explicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression JMLR: Workshop and Conference Proceedings : 9, 08. Symposium on Advances in Approximate Bayesian Inference Explicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression

More information

Monte Carlo and cold gases. Lode Pollet.

Monte Carlo and cold gases. Lode Pollet. Monte Carlo and cold gases Lode Pollet lpollet@physics.harvard.edu 1 Outline Classical Monte Carlo The Monte Carlo trick Markov chains Metropolis algorithm Ising model critical slowing down Quantum Monte

More information