A Linear-Time Kernel Goodness-of-Fit Test
|
|
- Jesse Hodges
- 5 years ago
- Views:
Transcription
1 A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum 1; Wenkai Xu 1 Zoltán Szabó 2 NIPS 2017 Best paper! Kenji Fukumizu 3 Arthur Gretton 1 wittawatj@gmail.com 1 Gatsby Unit, University College London (Now at Max Planck Institute for Intelligent Systems) 2 CMAP, École Polytechnique 3 The Institute of Statistical Mathematics, Tokyo Workshop on Functional Inference and Machine Intelligence, Tokyo 20 February /25
2 Model Criticism 2/25
3 Model Criticism 2/25
4 Model Criticism Data = robbery events in Chicago in /25
5 Model Criticism Is this a good model? 2/25
6 Model Criticism Goals: 1 Test if a (complicated) model fits the data. 2 If it does not, show a location where it fails. 2/25
7 Model Criticism Goals: 1 Test if a (complicated) model fits the data. 2 If it does not, show a location where it fails. 2/25
8 Goodness-of-fit Testing Given: 1 Sample fx i g n i=1 i:i:d: q (unknown) on R d, 2 Unnormalized density p (known model). H 0 : p = q H 1 : p 6= q Want a test : : : 1 Nonparametric. 2 Linear-time. Runtime is O(n). Fast. 3 Interpretable. Model criticism by finding F. 3/25
9 Goodness-of-fit Testing Given: 1 Sample fx i g n i=1 i:i:d: q (unknown) on R d, 2 Unnormalized density p (known model). (unknown) H 0 : p = q H 1 : p 6= q? z } { x 1 ; x 2 ; : : : ; x n (model) Want a test : : : 1 Nonparametric. 2 Linear-time. Runtime is O(n). Fast. 3 Interpretable. Model criticism by finding F. 3/25
10 Goodness-of-fit Testing Given: 1 Sample fx i g n i=1 i:i:d: q (unknown) on R d, 2 Unnormalized density p (known model). (unknown) H 0 : p = q H 1 : p 6= q? z } { x 1 ; x 2 ; : : : ; x n (model) Want a test : : : 1 Nonparametric. 2 Linear-time. Runtime is O(n). Fast. 3 Interpretable. Model criticism by finding F. 3/25
11 Maximum Mean Discrepancy (MMD) Witness Function (Gretton et al., 2012) 4/25
12 Maximum Mean Discrepancy (MMD) Witness Function (Gretton et al., 2012) Observe X = fx 1 ; : : : ; x n g q Observe Y = fy 1 ; : : : ; y n g p 4/25
13 Maximum Mean Discrepancy (MMD) Witness Function (Gretton et al., 2012) Gaussian kernel on x i Gaussian kernel on y i 4/25
14 Maximum Mean Discrepancy (MMD) Witness Function (Gretton et al., 2012) q (v) = E xq k v (x) p(v) = E yp k v (y) (mean embedding of p) v 4/25
15 Maximum Mean Discrepancy (MMD) Witness Function (Gretton et al., 2012) q (v) = E xq k v (x) p(v) = E yp k v (y) (mean embedding of p) witness(v) = q (v) p(v) {z } v 4/25
16 Maximum Mean Discrepancy (MMD) Witness Function (Gretton et al., 2012) MMD(q; p) = kwitnessk v 4/25
17 Maximum Mean Discrepancy (MMD) Witness Function (Gretton et al., 2012) witness 2 (v) = (q (v) p(v)) {z } 2 v 4/25
18 Maximum Mean Discrepancy (MMD) Witness Function (Gretton et al., 2012) witness 2 (v) = (q (v) p(v)) {z } 2 v witness 2 (v) can be used to find a good test location v = F. 4/25
19 Model Criticism by the MMD Witness Find a location v at which q and p differ most (ME test) [Jitkrittum et al., 2016]. 5/25
20 Model Criticism by the MMD Witness Find a location v at which q and p differ most (ME test) [Jitkrittum et al., 2016]. witness(v) = E xq [ k v (x) ] E yp [ k v (y) ] 5/25
21 Model Criticism by the MMD Witness Find a location v at which q and p differ most (ME test) [Jitkrittum et al., 2016]. witness(v) = E xq [ k v (x) ] E yp [ k v (y) ] score(v) = witness2 (v) noise(v) = witness 2 (v) qv xq [k v (x)] + V yp [k v (y)] : 5/25
22 Model Criticism by the MMD Witness Find a location v at which q and p differ most (ME test) [Jitkrittum et al., 2016]. witness(v) = E xq [ k v (x) ] E yp [ k v (y) ] score(v) = witness2 (v) noise(v) = witness 2 (v) qv xq [k v (x)] + V yp [k v (y)] : 5/25
23 Model Criticism by the MMD Witness Find a location v at which q and p differ most (ME test) [Jitkrittum et al., 2016]. score: k v (x) = v witness(v) = E xq [ k v (x) ] E yp [ k v (y) ] score(v) = witness2 (v) noise(v) = witness 2 (v) qv xq [k v (x)] + V yp [k v (y)] : 5/25
24 Model Criticism by the MMD Witness Find a location v at which q and p differ most (ME test) [Jitkrittum et al., 2016]. score: 1.6 witness(v) = E xq [ k v (x) ] E yp [ k v (y) ] score(v) = witness2 (v) noise(v) = witness 2 (v) qv xq [k v (x)] + V yp [k v (y)] : 5/25
25 Model Criticism by the MMD Witness Find a location v at which q and p differ most (ME test) [Jitkrittum et al., 2016]. score: 13 witness(v) = E xq [ k v (x) ] E yp [ k v (y) ] score(v) = witness2 (v) noise(v) = witness 2 (v) qv xq [k v (x)] + V yp [k v (y)] : 5/25
26 Model Criticism by the MMD Witness Find a location v at which q and p differ most (ME test) [Jitkrittum et al., 2016]. score: 25 Best v witness(v) = E xq [ k v (x) ] E yp [ k v (y) ] score(v) = witness2 (v) noise(v) = witness 2 (v) qv xq [k v (x)] + V yp [k v (y)] : 5/25
27 Model Criticism by the MMD Witness Find a location v at which q and p differ most (ME test) [Jitkrittum et al., 2016]. score: 25 Best v witness(v) = E xq [ k v (x) ] E yp [ k v (y) ] score(v) = witness2 (v) noise(v) = No sample witnessfrom 2 (v) p. : qv Difficult xq [k v (x)] to + generate. V yp [k v (y)] 5/25
28 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. 6/25
29 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. (Stein) witness(v) = E xq [ T p k v (x) ] E yp [ T p k v (y) ] 6/25
30 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. (Stein) witness(v) = E xq [T p v ] E yp [T p v ] 6/25
31 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. (Stein) witness(v) = E xq [ v ] E yp [ v ] 6/25
32 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. (Stein) witness(v) = E xq [ v ] E yp [ v ] Idea: Define T p such that E yp (T p k v )(y) = 0; for any v. 6/25
33 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. (Stein) witness(v) = E xq [ v ] Idea: Define T p such that E yp (T p k v )(y) = 0; for any v. 6/25
34 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. (Stein) witness(v) = E xq [ T p k v (x) ] Idea: Define T p such that E yp (T p k v )(y) = 0; for any v. 6/25
35 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. (Stein) witness(v) = E xq [ T p k v (x) ] Idea: Define T p such that E yp (T p k v )(y) = 0; for any v. Proposal: Good v should have high score(v) = witness2 (v) : noise(v) 6/25
36 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. (Stein) witness(v) = E xq [ T p k v (x) ] Idea: Define T p such that E yp (T p k v )(y) = 0; for any v. Proposal: Good v should have high signal-to-noise ratio score(v) = witness2 (v) : noise(v) 6/25
37 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. (Stein) witness(v) = E xq [ T p k v (x) ] Idea: Define T p such that E yp (T p k v )(y) = 0; for any v. Proposal: Good v should have high signal-to-noise ratio score(v) = witness2 (v) : noise(v) score(v) can be estimated in linear-time. 6/25
38 The Stein Witness Function [Liu et al., 2016, Chwialkowski et al., 2016] Problem: No sample from p. Cannot estimate E yp [k v (y)]. (Stein) witness(v) = E xq [ T p k v (x) ] Idea: Define T p such that E yp (T p k v )(y) = 0; for any v. Proposal: Good v should have high signal-to-noise ratio score(v) = witness2 (v) : noise(v) score(v) can be estimated in linear-time. Goodness-of-fit test: 1 Find v = arg max v score(v). 2 Reject H 0 if witness 2 (v ) > threshold. 6/25
39 Proposal: Model Criticism with the Stein Witness score(v) = witness2 (v) : noise(v) 7/25
40 Proposal: Model Criticism with the Stein Witness score(v) = witness2 (v) : noise(v) 7/25
41 Proposal: Model Criticism with the Stein Witness score: (T p k v )(x) = v score(v) = witness2 (v) : noise(v) 7/25
42 Proposal: Model Criticism with the Stein Witness score: score(v) = witness2 (v) : noise(v) 7/25
43 Proposal: Model Criticism with the Stein Witness score: 0.17 score(v) = witness2 (v) : noise(v) 7/25
44 Proposal: Model Criticism with the Stein Witness score: 0.26 score(v) = witness2 (v) : noise(v) 7/25
45 Proposal: Model Criticism with the Stein Witness score: 0.33 score(v) = witness2 (v) : noise(v) 7/25
46 Proposal: Model Criticism with the Stein Witness score: 0.37 score(v) = witness2 (v) : noise(v) 7/25
47 Proposal: Model Criticism with the Stein Witness score: 0.37 score(v) = witness2 (v) : noise(v) 7/25
48 Proposal: Model Criticism with the Stein Witness score: 0.45 score(v) = witness2 (v) : noise(v) 7/25
49 Proposal: Model Criticism with the Stein Witness score: 0.44 score(v) = witness2 (v) : noise(v) 7/25
50 Proposal: Model Criticism with the Stein Witness score: 0.39 score(v) = witness2 (v) : noise(v) 7/25
51 Proposal: Model Criticism with the Stein Witness score: 0.31 score(v) = witness2 (v) : noise(v) 7/25
52 Proposal: Model Criticism with the Stein Witness score: 0.32 score(v) = witness2 (v) : noise(v) 7/25
53 Proposal: Model Criticism with the Stein Witness score: 0.32 score(v) = witness2 (v) : noise(v) 7/25
54 Proposal: Model Criticism with the Stein Witness score: 0.37 score(v) = witness2 (v) : noise(v) 7/25
55 Proposal: Model Criticism with the Stein Witness score: 0.48 score(v) = witness2 (v) : noise(v) 7/25
56 Proposal: Model Criticism with the Stein Witness score: 0.49 score(v) = witness2 (v) : noise(v) 7/25
57 Proposal: Model Criticism with the Stein Witness score: 0.47 score(v) = witness2 (v) : noise(v) 7/25
58 Proposal: Model Criticism with the Stein Witness score: 0.44 score(v) = witness2 (v) : noise(v) 7/25
59 Proposal: Model Criticism with the Stein Witness score: score(v) = witness2 (v) : noise(v) 7/25
60 Proposal: Model Criticism with the Stein Witness score: 0.37 score(v) = witness2 (v) : noise(v) 7/25
61 Proposal: Model Criticism with the Stein Witness score: 0.16 score(v) = witness2 (v) : noise(v) 7/25
62 Proposal: Model Criticism with the Stein Witness score: 0.44 score(v) = witness2 (v) : noise(v) 7/25
63 Theory 1 What is T p k v? 2 Test statistic 3 Distributions of the test statistic, test threshold. 4 What does v = arg max v score(v) do theoretically? 8/25
64 (1) What is T p k v? Recall witness(v) =E xq (T p k v )(x) E yp (T p k v )(y) 9/25
65 (1) What is T p k v? Recall witness(v) =E xq (T p k v )(x) E yp (T p k v )(y) (T p k 1 d v )(y) = p(y) dy [k v(y)p(y)]: Then, E yp (T p k v )(y) = 0: [Liu et al., 2016, Chwialkowski et al., 2016] 9/25
66 (1) What is T p k v? Recall witness(v) =E xq (T p k v )(x) E yp (T p k v )(y) (T p k 1 d v )(y) = p(y) dy [k v(y)p(y)]: Normalizer cancels Then, E yp (T p k v )(y) = 0: [Liu et al., 2016, Chwialkowski et al., 2016] 9/25
67 (1) What is T p k v? Recall witness(v) =E xq (T p k v )(x) E yp (T p k v )(y) (T p k 1 d v )(y) = p(y) dy [k v(y)p(y)]: Normalizer cancels Then, E yp (T p k v )(y) = 0: [Liu et al., 2016, Chwialkowski et al., 2016] Proof: E yp [(T p k v )(y)] 9/25
68 (1) What is T p k v? Recall witness(v) =E xq (T p k v )(x) E yp (T p k v )(y) (T p k 1 d v )(y) = p(y) dy [k v(y)p(y)]: Normalizer cancels Then, E yp (T p k v )(y) = 0: [Liu et al., 2016, Chwialkowski et al., 2016] Z 1 Proof: E yp [(T p k v )(y)] = 1 [(T p k v )(y)] p(y)dy 9/25
69 (1) What is T p k v? Recall witness(v) =E xq (T p k v )(x) E yp (T p k v )(y) (T p k 1 d v )(y) = p(y) dy [k v(y)p(y)]: Normalizer cancels Then, E yp (T p k v )(y) = 0: [Liu et al., 2016, Chwialkowski et al., 2016] Z 1 Proof: E yp [(T p k v )(y)] = 1 d 1 p(y) dy [k v(y)p(y)] p(y) dy 9/25
70 (1) What is T p k v? Recall witness(v) =E xq (T p k v )(x) E yp (T p k v )(y) (T p k 1 d v )(y) = p(y) dy [k v(y)p(y)]: Normalizer cancels Then, E yp (T p k v )(y) = 0: [Liu et al., 2016, Chwialkowski et al., 2016] Z 1 Proof: E yp [(T p k v )(y)] = 1 1 d p(y) dy [k v(y)p(y)] p(y) dy 9/25
71 (1) What is T p k v? Recall witness(v) =E xq (T p k v )(x) E yp (T p k v )(y) (T p k 1 d v )(y) = p(y) dy [k v(y)p(y)]: Normalizer cancels Then, E yp (T p k v )(y) = 0: [Liu et al., 2016, Chwialkowski et al., 2016] Z E 1 yp [(T p k v )(y)] = 1 Z 1 Proof: = 1 1 d p(y) dy [k v(y)p(y)] p(y) dy d dy [k v(y)p(y)] dy 9/25
72 (1) What is T p k v? Recall witness(v) =E xq (T p k v )(x) E yp (T p k v )(y) (T p k 1 d v )(y) = p(y) dy [k v(y)p(y)]: Normalizer cancels Then, E yp (T p k v )(y) = 0: [Liu et al., 2016, Chwialkowski et al., 2016] Z E 1 yp [(T p k v )(y)] = 1 Z 1 Proof: = 1 1 d p(y) dy [k v(y)p(y)] p(y) dy d dy [k v(y)p(y)] dy = [k v (y)p(y)] y=1 y= 1 9/25
73 (1) What is T p k v? Recall witness(v) =E xq (T p k v )(x) E yp (T p k v )(y) (T p k 1 d v )(y) = p(y) dy [k v(y)p(y)]: Normalizer cancels Then, E yp (T p k v )(y) = 0: [Liu et al., 2016, Chwialkowski et al., 2016] Z E 1 yp [(T p k v )(y)] = 1 Z 1 Proof: = (assume lim jyj!1 k v (y)p(y)) 1 1 d p(y) dy [k v(y)p(y)] p(y) dy d dy [k v(y)p(y)] dy = [k v (y)p(y)] y=1 y= 1 = 0 9/25
74 (2) Proposal: The Finite Set Stein Discrepancy (FSSD) Recall Stein witness: g(v) :=E xq h 1 p(x) d dx [k v(x)p(x)]i p(x) q(x) g(x) /25
75 (2) Proposal: The Finite Set Stein Discrepancy (FSSD) Recall Stein witness: g(v) :=E xq h 1 p(x) d dx [k v(x)p(x)]i p(x) q(x) g(x) /25
76 (2) Proposal: The Finite Set Stein Discrepancy (FSSD) Recall Stein witness: g(v) :=E xq h 1 p(x) d dx [k v(x)p(x)]i p(x) q(x) g(x) 10/25
77 (2) Proposal: The Finite Set Stein Discrepancy (FSSD) Recall Stein witness: g(v) :=E xq h 1 p(x) d dx [k v(x)p(x)]i p(x) q(x) g(x) FSSD statistic: Evaluate g 2 at J test locations V = fv 1 ; : : : ; v J g. Population FSSD FSSD 2 = 1 dj JX j=1 kg(v j )k 2 2 : Unbiased estimator \ FSSD 2 computable in O(d 2 Jn) time. (d = input dimension) 10/25
78 (2) FSSD is a Discrepancy Measure FSSD 2 = 1 dj P Jj=1 kg(v j )k 2 2 : Theorem 1 (FSSD is a discrepancy measure). Main conditions: 1 (Nice kernel) Kernel k is C 0 -universal, and real analytic e.g., Gaussian kernel. 2 (Vanishing boundary) lim kxk!1 p(x)k v (x) = 0. 3 (Avoid blind spots ) Locations v 1 ; : : : ; v J which has a density. Then, for any J 1, -almost surely, FSSD 2 = 0 () p = q. Summary: Evaluating the witness at random locations is sufficient to detect the discrepancy between p; q. 11/25
79 (2) What Are Blind Spots? Recall g(v) : =E xq 1 d p(x) dx [k v(x)p(x)] : Consider p = N (0; 1) and q = N (0; q 2 ). Use unit-width Gaussian kernel. v exp g(v ) = v 2 2+2q 1 2 q q 3=2 2 If v = 0, then FSSD 2 = g 2 (v ) = 0 regardless of 2 q. If g 6= 0, and k is real analytic, R = fv j g(v ) = 0g (blind spots) has 0 Lebesgue measure. So, if v a distribution with a density, then v =2 R. 12/25
80 (2) What Are Blind Spots? Recall g(v) : =E xq 1 d p(x) dx [k v(x)p(x)] : Consider p = N (0; 1) and q = N (0; q 2 ). Use unit-width Gaussian kernel. v exp g(v ) = v 2 2+2q 1 2 q q 3= p = N (0, 1) q = N (0, 4) g If v = 0, then FSSD 2 = g 2 (v ) = 0 regardless of 2 q If g 6= 0, and k is real analytic, R = fv j g(v ) = 0g (blind spots) has 0 Lebesgue measure. So, if v a distribution with a density, then v =2 R. 12/25
81 (2) What Are Blind Spots? Recall g(v) : =E xq 1 d p(x) dx [k v(x)p(x)] : Consider p = N (0; 1) and q = N (0; q 2 ). Use unit-width Gaussian kernel. v exp g(v ) = v 2 2+2q 1 2 q q 3= p = N (0, 1) q = N (0, 4) g If v = 0, then FSSD 2 = g 2 (v ) = 0 regardless of 2 q If g 6= 0, and k is real analytic, R = fv j g(v ) = 0g (blind spots) has 0 Lebesgue measure. So, if v a distribution with a density, then v =2 R. 12/25
82 (2) What Are Blind Spots? Recall g(v) : =E xq 1 d p(x) dx [k v(x)p(x)] : Consider p = N (0; 1) and q = N (0; q 2 ). Use unit-width Gaussian kernel. v exp g(v ) = v 2 2+2q 1 2 q q 3= p = N (0, 1) q = N (0, 4) g If v = 0, then FSSD 2 = g 2 (v ) = 0 regardless of 2 q If g 6= 0, and k is real analytic, R = fv j g(v ) = 0g (blind spots) has 0 Lebesgue measure. So, if v a distribution with a density, then v =2 R. 12/25
83 (2) What Are Blind Spots? Recall g(v) : =E xq 1 d p(x) dx [k v(x)p(x)] : Consider p = N (0; 1) and q = N (0; q 2 ). Use unit-width Gaussian kernel. v exp g(v ) = v 2 2+2q 1 2 q q 3= p = N (0, 1) q = N (0, 4) g If v = 0, then FSSD 2 = g 2 (v ) = 0 regardless of 2 q If g 6= 0, and k is real analytic, R = fv j g(v ) = 0g (blind spots) has 0 Lebesgue measure. So, if v a distribution with a density, then v =2 R. 12/25
84 (3) Asymptotic Distributions of ^n := n \ FSSD 2 P H0 (ˆλ n ) ˆλ n 13/25
85 (3) Asymptotic Distributions of ^n := n \ FSSD 2 P H0 (ˆλ n ) T α ˆλ n 13/25
86 (3) Asymptotic Distributions of ^n := n \ FSSD 2 P H0 (ˆλ n ) T α ˆλ n ˆλ n 13/25
87 (3) Asymptotic Distributions of ^n := n \ FSSD 2 P H0 (ˆλ n ) T α ˆλ n ˆλ n Under H 0 : p = q, asymptotically ^ n := n \ FSSD 2 d! djx i=1 (Z 2 i 1)! i ; f! i g dj i=1 are non-negative, computable quantities. i:i:d: Z 1 ; : : : ; Z dj N (0; 1) 13/25
88 (3) Asymptotic Distributions of ^n := n \ FSSD 2 P H0 (ˆλ n ) T α P H1 (ˆλ n ) ˆλ n Under H 0 : p = q, asymptotically ^ n := n \ FSSD 2 d! djx i=1 (Z 2 i 1)! i ; f! i g dj i=1 are non-negative, computable quantities. i:i:d: Z 1 ; : : : ; Z dj N (0; 1) Under H 1 : p 6= q, asymptotically p n( \ FSSD 2 FSSD 2 ) d! N (0; 2 H 1 ): 13/25
89 (3) Asymptotic Distributions of ^n := n \ FSSD 2 P H0 (ˆλ n ) T α P H1 (ˆλ n ) ˆλ n Under H 0 : p = q, asymptotically ^ n := n \ FSSD 2 d! djx i=1 (Z 2 i 1)! i ; f! i g dj i=1 are non-negative, computable quantities. i:i:d: Z 1 ; : : : ; Z dj N (0; 1) Under H 1 : p 6= q, asymptotically p n( \ FSSD 2 FSSD 2 ) d! N (0; 2 H 1 ): witness 2 (V ) noise(v ) 13/25
90 (4) What Does arg max v score(v) Do? Proposition 1 (Asymptotic test power). For large n, the test power P(reject H 0 j H 1 true) = PH 1 (n \ FSSD 2 > T ) pn FSSD 2 H1 where = CDF of N (0; 1). T p ; nh1 For large n, the 2 nd term dominates. arg max V ; 2 k P H1 (n \ FSSD 2 > T ) arg max P H0 (ˆλ n ) T α P H1 (ˆλ n ) ˆλ n V ; 2 k 2 4 \ 3 FSSD 2 = score(v ; k 2 d ) H1 5 : Maximize score(v ; k 2 ) () Maximize test power In practice, split fx i g n i=1 into independent training/test sets. Optimize on tr. Goodness-of-fit test on te. 14/25
91 (4) What Does arg max v score(v) Do? Proposition 1 (Asymptotic test power). For large n, the test power P(reject H 0 j H 1 true) = PH 1 (n \ FSSD 2 > T ) pn FSSD 2 H1 where = CDF of N (0; 1). T p ; nh1 For large n, the 2 nd term dominates. arg max V ; 2 k P H1 (n \ FSSD 2 > T ) arg max P H0 (ˆλ n ) T α P H1 (ˆλ n ) ˆλ n V ; 2 k 2 4 \ 3 FSSD 2 = score(v ; k 2 d ) H1 5 : Maximize score(v ; k 2 ) () Maximize test power In practice, split fx i g n i=1 into independent training/test sets. Optimize on tr. Goodness-of-fit test on te. 14/25
92 (4) What Does arg max v score(v) Do? Proposition 1 (Asymptotic test power). For large n, the test power P(reject H 0 j H 1 true) = PH 1 (n \ FSSD 2 > T ) pn FSSD 2 H1 where = CDF of N (0; 1). T p ; nh1 For large n, the 2 nd term dominates. arg max V ; 2 k P H1 (n \ FSSD 2 > T ) arg max P H0 (ˆλ n ) T α P H1 (ˆλ n ) ˆλ n V ; 2 k 2 4 \ 3 FSSD 2 = score(v ; k 2 d ) H1 5 : Maximize score(v ; k 2 ) () Maximize test power In practice, split fx i g n i=1 into independent training/test sets. Optimize on tr. Goodness-of-fit test on te. 14/25
93 Related Works 15/25
94 Kernel Stein Discrepancy (KSD) [Liu et al., 2016, Chwialkowski et al., 2016] 0.4 Recall Stein witness: g(v) :=E xq h 1 p(x) d dx [k v(x)p(x)]i p(x) q(x) g(x) 16/25
95 Kernel Stein Discrepancy (KSD) [Liu et al., 2016, Chwialkowski et al., 2016] 0.4 Recall Stein witness: g(v) :=E xq h 1 p(x) d dx [k v(x)p(x)]i p(x) q(x) g(x) witness KSD v KSD 2 = kgk 2 RKHS (RKHS norm). Good when the difference between p; q is spatially diffuse. 16/25
96 Kernel Stein Discrepancy (KSD) [Liu et al., 2016, Chwialkowski et al., 2016] 0.4 Recall Stein witness: g(v) :=E xq h 1 p(x) d dx [k v(x)p(x)]i p(x) q(x) g(x) witness KSD Proposed FSSD witness KSD 2 = kgk 2 RKHS (RKHS norm). Good when the difference between p; q is spatially diffuse. v v FSSD 2 = 1 dj P Jj=1 kg(v j )k 2 2 : Good when the difference between p; q is local. 16/25
97 Kernel Stein Discrepancy (KSD) Closed-form expression for KSD: [Liu et al., 2016, Chwialkowski et al., 2016] z } { E xq E yq h p (x; y) KSD 2 = kgk 2 RKHS = double sums where h p (x; y) := [@ x log p(x)] k (x; y) [@ y log p(y)] + [@ y log x k (x; y) + [@ x log y k (x; y) y k (x; y) and k is a kernel. The double sums make it O(d 2 n 2 ). Slow. 17/25
98 Kernel Stein Discrepancy (KSD) Closed-form expression for KSD: [Liu et al., 2016, Chwialkowski et al., 2016] z } { E xq E yq h p (x; y) KSD 2 = kgk 2 RKHS = double sums where h p (x; y) := [@ x log p(x)] k (x; y) [@ y log p(y)] + [@ y log x k (x; y) + [@ x log y k (x; y) y k (x; y) and k is a kernel. The double sums make it O(d 2 n 2 ). Slow. 17/25
99 Linear-Time Kernel Stein Discrepancy (LKS) [Liu et al., 2016] also proposed a linear version of KSD. For fx i g n i=1 q, KSD test statistic is 2 n(n 1) X i<j h p (x i ; x j ): 8 LKS test statistic is a running average X n=2 2 h p (x 2i 1 ; x 2i ): n i= Both unbiased. LKS has O(d 2 n) runtime. Same as proposed FSSD. LKS has high variance. Poor test power. 18/25
100 Simulation Settings Gaussian kernel k (x; v) = exp kx vk k Method Description 1 FSSD-opt Proposed. With optimization. J = 5. 2 FSSD-rand Proposed. Random test locations. 3 KSD Quadratic-time kernel Stein discrepancy [Liu et al., 2016, Chwialkowski et al., 2016] 4 LKS Linear-time running average version of KSD. 5 MMD-opt 6 ME-test MMD two-sample test [Gretton et al., 2012]. With optimization. Mean Embeddings two-sample test [Jitkrittum et al., 2016]. With optimization. Two-sample tests need to draw sample from p. Tests with optimization use 20% of the data. Significance level = 0:05. 19/25
101 Simulation Settings Gaussian kernel k (x; v) = exp kx vk k Method Description 1 FSSD-opt Proposed. With optimization. J = 5. 2 FSSD-rand Proposed. Random test locations. 3 KSD Quadratic-time kernel Stein discrepancy [Liu et al., 2016, Chwialkowski et al., 2016] 4 LKS Linear-time running average version of KSD. 5 MMD-opt 6 ME-test MMD two-sample test [Gretton et al., 2012]. With optimization. Mean Embeddings two-sample test [Jitkrittum et al., 2016]. With optimization. Two-sample tests need to draw sample from p. Tests with optimization use 20% of the data. Significance level = 0:05. 19/25
102 Simulation Settings Gaussian kernel k (x; v) = exp kx vk k Method Description 1 FSSD-opt Proposed. With optimization. J = 5. 2 FSSD-rand Proposed. Random test locations. 3 KSD Quadratic-time kernel Stein discrepancy [Liu et al., 2016, Chwialkowski et al., 2016] 4 LKS Linear-time running average version of KSD. 5 MMD-opt 6 ME-test MMD two-sample test [Gretton et al., 2012]. With optimization. Mean Embeddings two-sample test [Jitkrittum et al., 2016]. With optimization. Two-sample tests need to draw sample from p. Tests with optimization use 20% of the data. Significance level = 0:05. 19/25
103 Gaussian Vs. Laplace p = Gaussian. q = Laplace. Same mean and variance. High-order moments differ. Sample size n = dimension 4000 d Sample size n Rejection rate 1.0 FSSD-opt FSSD-rand KSD LKS MMD-opt ME-opt Optimization increases the power. Two-sample tests can perform well in this case (p; q clearly differ). 20/25
104 Gaussian-Bernoulli Restricted Boltzmann Machine (RBM) p(x) is the marginal of p(x; h) = 1 Z exp x > Bh + b > x + c > x kxk2 1 ; 2 where x 2R 50, h 2 f 1g 40 is latent. Randomly pick B; b; c. q(x) = p(x) with i.i.d. N (0; per ) noise added to all entries of B. Sample size n = /25
105 Gaussian-Bernoulli Restricted Boltzmann Machine (RBM) p(x) is the marginal of p(x; h) = 1 Z exp x > Bh + b > x + c > x kxk2 1 ; 2 where x 2R 50, h 2 f 1g 40 is latent. Randomly pick B; b; c. q(x) = p(x) with i.i.d. N (0; per ) noise added to all entries of B. Sample size n = /25
106 Gaussian-Bernoulli Restricted Boltzmann Machine (RBM) p(x) is the marginal of p(x; h) = 1 Z exp x > Bh + b > x + c > x kxk2 1 ; 2 where x 2R 50, h 2 f 1g 40 is latent. Randomly pick B; b; c. q(x) = p(x) with i.i.d. N (0; per ) noise added to all entries of B. Sample size n = /25
107 Gaussian-Bernoulli Restricted Boltzmann Machine (RBM) p(x) is the marginal of p(x; h) = 1 Z exp x > Bh + b > x + c > x kxk2 1 ; 2 where x 2R 50, h 2 f 1g 40 is latent. Randomly pick B; b; c. q(x) = p(x) with i.i.d. N (0; per ) noise added to all entries of B. Sample size n = Perturbation 4000SD σ per Sample size n Rejection rate 1.0 FSSD-opt FSSD-rand KSD LKS MMD-opt ME-opt KSD (O(n 2 )), FSSD-opt (O(n)) comparable. LKS has low power. 21/25
108 Interpretable Test Locations: Chicago Crime 22/25
109 Interpretable Test Locations: Chicago Crime 22/25
110 Interpretable Test Locations: Chicago Crime n = robbery events in Chicago in lat/long coordinates = sample from q. Model spatial density with Gaussian mixtures. 22/25
111 Interpretable Test Locations: Chicago Crime Model p = 2-component Gaussian mixture. 22/25
112 Interpretable Test Locations: Chicago Crime Score surface 22/25
113 Interpretable Test Locations: Chicago Crime F = optimized v. 22/25
114 Interpretable Test Locations: Chicago Crime F = optimized v. No robbery in Lake Michigan. 22/25
115 Interpretable Test Locations: Chicago Crime Model p = 10-component Gaussian mixture. 22/25
116 Interpretable Test Locations: Chicago Crime Capture the right tail better. 22/25
117 Interpretable Test Locations: Chicago Crime Still, does not capture the left tail. 22/25
118 Interpretable Test Locations: Chicago Crime Still, does not capture the left tail. Learned test locations are interpretable. 22/25
119 Bahadur Slope and Bahadur Efficiency Bahadur slope u rate of p-value! 0 under H 1 as n! 1. Measure a test s sensitivity to the departure from H 0. H 0 : = 0; H 1 : 6= 0: Typically pval n exp 1 2 c()n where c() > 0 under H 1, and c(0) = 0 [Bahadur, 1960]. c() higher =) more sensitive. Good. Bahadur slope c() := 2 plim n!1 log (1 F (T n )) ; n where F (t) = CDF of T n under H 0. Bahadur efficiency = ratio of slopes of two tests. 23/25
120 Bahadur Slope and Bahadur Efficiency Bahadur slope u rate of p-value! 0 under H 1 as n! 1. Measure a test s sensitivity to the departure from H 0. H 0 : = 0; H 1 : 6= 0: Typically pval n exp 1 2 c()n where c() > 0 under H 1, and c(0) = 0 [Bahadur, 1960]. c() higher =) more sensitive. Good. Bahadur slope c() := 2 plim n!1 log (1 F (T n )) ; n where F (t) = CDF of T n under H 0. Bahadur efficiency = ratio of slopes of two tests. 23/25
121 Bahadur Slope and Bahadur Efficiency Bahadur slope u rate of p-value! 0 under H 1 as n! 1. Measure a test s sensitivity to the departure from H 0. H 0 : = 0; H 1 : 6= 0: Typically pval n exp 1 2 c()n where c() > 0 under H 1, and c(0) = 0 [Bahadur, 1960]. c() higher =) more sensitive. Good. p-value p-value of T (1) n p-value of T (2) n n Bahadur slope c() := 2 plim n!1 log (1 F (T n )) ; n where F (t) = CDF of T n under H 0. Bahadur efficiency = ratio of slopes of two tests. 23/25
122 Bahadur Slope and Bahadur Efficiency Bahadur slope u rate of p-value! 0 under H 1 as n! 1. Measure a test s sensitivity to the departure from H 0. H 0 : = 0; H 1 : 6= 0: Typically pval n exp 1 2 c()n where c() > 0 under H 1, and c(0) = 0 [Bahadur, 1960]. c() higher =) more sensitive. Good. p-value p-value of T (1) n p-value of T (2) n n Bahadur slope c() := 2 plim n!1 log (1 F (T n )) ; n where F (t) = CDF of T n under H 0. Bahadur efficiency = ratio of slopes of two tests. 23/25
123 Gaussian Mean Shift Problem Consider p = N (0; 1) and q = N ( q ; 1). Assume J = 1 location for n \ FSSD 2. Gaussian kernel (bandwidth = 2 k ) c (FSSD) ( q ; v ; 2 k ) = q 2 2 k k 2 k v 2 (v q) 2 2 q e 2 k +2 2 k +1 For LKS, Gaussian kernel (bandwidth = 2 ). c (LKS) ( q ; 2 ) = k k + 44 k + (v 2 + 5) 2 k + 2 : 2 5= =2 4 q 2 ( 2 + 2) ( ) : Theorem 2 (FSSD is at least two times more efficient). Fix \ k 2 = 1 for n FSSD 2. Then, 8 q 6= 0, 9v 2R, 8 2 > 0, we have Bahadur efficiency c (FSSD) ( q ; v ; k 2 ) c (LKS) ( q ; 2 > 2: ) 24/25
124 Gaussian Mean Shift Problem Consider p = N (0; 1) and q = N ( q ; 1). Assume J = 1 location for n \ FSSD 2. Gaussian kernel (bandwidth = 2 k ) c (FSSD) ( q ; v ; 2 k ) = q 2 2 k k 2 k v 2 (v q) 2 2 q e 2 k +2 2 k +1 For LKS, Gaussian kernel (bandwidth = 2 ). c (LKS) ( q ; 2 ) = k k + 44 k + (v 2 + 5) 2 k + 2 : 2 5= =2 4 q 2 ( 2 + 2) ( ) : Theorem 2 (FSSD is at least two times more efficient). Fix \ k 2 = 1 for n FSSD 2. Then, 8 q 6= 0, 9v 2R, 8 2 > 0, we have Bahadur efficiency c (FSSD) ( q ; v ; k 2 ) c (LKS) ( q ; 2 > 2: ) 24/25
125 Gaussian Mean Shift Problem Consider p = N (0; 1) and q = N ( q ; 1). Assume J = 1 location for n \ FSSD 2. Gaussian kernel (bandwidth = 2 k ) c (FSSD) ( q ; v ; 2 k ) = q 2 2 k k 2 k v 2 (v q) 2 2 q e 2 k +2 2 k +1 For LKS, Gaussian kernel (bandwidth = 2 ). c (LKS) ( q ; 2 ) = k k + 44 k + (v 2 + 5) 2 k + 2 : 2 5= =2 4 q 2 ( 2 + 2) ( ) : Theorem 2 (FSSD is at least two times more efficient). Fix \ k 2 = 1 for n FSSD 2. Then, 8 q 6= 0, 9v 2R, 8 2 > 0, we have Bahadur efficiency c (FSSD) ( q ; v ; k 2 ) c (LKS) ( q ; 2 > 2: ) 24/25
126 Gaussian Mean Shift Problem Consider p = N (0; 1) and q = N ( q ; 1). Assume J = 1 location for n \ FSSD 2. Gaussian kernel (bandwidth = 2 k ) c (FSSD) ( q ; v ; 2 k ) = q 2 2 k k 2 k v 2 (v q) 2 2 q e 2 k +2 2 k +1 For LKS, Gaussian kernel (bandwidth = 2 ). c (LKS) ( q ; 2 ) = k k + 44 k + (v 2 + 5) 2 k + 2 : 2 5= =2 4 q 2 ( 2 + 2) ( ) : Theorem 2 (FSSD is at least two times more efficient). Fix \ k 2 = 1 for n FSSD 2. Then, 8 q 6= 0, 9v 2R, 8 2 > 0, we have Bahadur efficiency c (FSSD) ( q ; v ; k 2 ) c (LKS) ( q ; 2 > 2: ) 24/25
127 Conclusion Proposed The Finite Set Stein Discrepancy (FSSD). Goodness-of-fit test based on FSSD is 1 nonparametric, 2 linear-time, 3 tunable (parameters automatically tuned). 4 interpretable. A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum, Wenkai Xu, Zoltán Szabó, Kenji Fukumizu, Arthur Gretton NIPS 2017 (best paper award) Python code: 25/25
128 Questions? Thank you 26/25
129 Illustration: Score Surface Consider J = 1 location. FSSD score(v) = 2 (v) \ c H1 (v) (gray), p in wireframe, fx i g n i=1 q in purple, = best v. p = N 0; !! FSSD 2 / σ H1 vs. q = N 0; !! /25
130 Illustration: Score Surface Consider J = 1 location. FSSD score(v) = 2 (v) \ c H1 (v) (gray), p in wireframe, fx i g n i=1 q in purple, = best v. p = N (0; I) vs. q = Laplace with same mean & variance. p q FSSD 2 σ H1 4 2 v 0 v FSSD 2 / σ H /25
131 FSSD and KSD in 1D Gaussian Case Consider p = N (0; 1) and q = N ( q ; 2 q ). Assume J = 1 feature for n \ FSSD 2. Gaussian kernel (bandwidth = 2 k ). FSSD 2 = 2 k e (v q)2 2 k +2 q 2 k + 1 q + v 2 q k + 2 q 3 : ( 2 k +1)q If q 6= 0; q 2 6= 1, and v = (q 2 1), then FSSD2 = 0! This is why v should be drawn from a distribution with a density. For KSD, Gaussian kernel (bandwidth = 2 ). S 2 = 2 q 2 + 2q q q q 2 2 q : 28/25
132 FSSD and KSD in 1D Gaussian Case Consider p = N (0; 1) and q = N ( q ; 2 q ). Assume J = 1 feature for n \ FSSD 2. Gaussian kernel (bandwidth = 2 k ). FSSD 2 = 2 k e (v q)2 2 k +2 q 2 k + 1 q + v 2 q k + 2 q 3 : ( 2 k +1)q If q 6= 0; q 2 6= 1, and v = (q 2 1), then FSSD2 = 0! This is why v should be drawn from a distribution with a density. For KSD, Gaussian kernel (bandwidth = 2 ). S 2 = 2 q 2 + 2q q q q 2 2 q : 28/25
133 FSSD is a Discrepancy Measure Theorem 3. Let V = fv 1 ; : : : ; v J g R d be drawn i.i.d. from a distribution which has a density. Let X be a connected open set in R d. Assume 1 (Nice RKHS) Kernel k : X X!R is C 0 -universal, and real analytic. 2 (Stein witness not too rough) kgk 2 F < 1. 3 (Finite Fisher divergence) E xq kr x log p(x) q(x) k2 < 1. 4 (Vanishing boundary) lim kxk!1 p(x)g(x) = 0. Then, for any J 1, -almost surely FSSD 2 = 0 if and only if p = q. Gaussian kernel k (x; v) = exp kx vk k 2 In practice, J = 1 or J = 5. works. 29/25
134 Asymptotic Distributions of \FSSD 2 Recall (x; v) := 1 x[k (x; v)p(x)] 2R d : (x) := vertically stack (x; v 1 ); : : : (x; v J ) 2R dj. Feature vector of x. Mean feature: :=E xq [ (x)]. r := cov xr [ (x)] 2R dj dj for r 2 fp; qg Proposition 2 (Asymptotic distributions). Let Z 1 ; : : : ; Z dj i:i:d: N (0; 1), and f! i g dj i=1 be the eigenvalues of p. 1 Under H 0 : p = q, asymptotically n \ FSSD 2 d! P dj i=1 (Z 2 i 1)! i. Easy to simulate to get p-value. Simulation cost independent of n. 2 Under H 1 : p 6= q, we have p n( \ FSSD 2 FSSD 2 ) d! N (0; 2 H 1 ) where 2 H 1 := 4 > q. Implies P(reject H 0 )! 1 as n! 1. But, how to estimate p? No sample from p! Theorem: Using ^ q (computed with fx i g n i=1 q) still leads to a consistent test. 30/25
135 Asymptotic Distributions of \FSSD 2 Recall (x; v) := 1 x[k (x; v)p(x)] 2R d : (x) := vertically stack (x; v 1 ); : : : (x; v J ) 2R dj. Feature vector of x. Mean feature: :=E xq [ (x)]. r := cov xr [ (x)] 2R dj dj for r 2 fp; qg Proposition 2 (Asymptotic distributions). Let Z 1 ; : : : ; Z dj i:i:d: N (0; 1), and f! i g dj i=1 be the eigenvalues of p. 1 Under H 0 : p = q, asymptotically n \ FSSD 2 d! P dj i=1 (Z 2 i 1)! i. Easy to simulate to get p-value. Simulation cost independent of n. 2 Under H 1 : p 6= q, we have p n( \ FSSD 2 FSSD 2 ) d! N (0; 2 H 1 ) where 2 H 1 := 4 > q. Implies P(reject H 0 )! 1 as n! 1. But, how to estimate p? No sample from p! Theorem: Using ^ q (computed with fx i g n i=1 q) still leads to a consistent test. 30/25
136 Asymptotic Distributions of \FSSD 2 Recall (x; v) := 1 x[k (x; v)p(x)] 2R d : (x) := vertically stack (x; v 1 ); : : : (x; v J ) 2R dj. Feature vector of x. Mean feature: :=E xq [ (x)]. r := cov xr [ (x)] 2R dj dj for r 2 fp; qg Proposition 2 (Asymptotic distributions). Let Z 1 ; : : : ; Z dj i:i:d: N (0; 1), and f! i g dj i=1 be the eigenvalues of p. 1 Under H 0 : p = q, asymptotically n \ FSSD 2 d! P dj i=1 (Z 2 i 1)! i. Easy to simulate to get p-value. Simulation cost independent of n. 2 Under H 1 : p 6= q, we have p n( \ FSSD 2 FSSD 2 ) d! N (0; 2 H 1 ) where 2 H 1 := 4 > q. Implies P(reject H 0 )! 1 as n! 1. But, how to estimate p? No sample from p! Theorem: Using ^ q (computed with fx i g n i=1 q) still leads to a consistent test. 30/25
137 Asymptotic Distributions of \FSSD 2 Recall (x; v) := 1 x[k (x; v)p(x)] 2R d : (x) := vertically stack (x; v 1 ); : : : (x; v J ) 2R dj. Feature vector of x. Mean feature: :=E xq [ (x)]. r := cov xr [ (x)] 2R dj dj for r 2 fp; qg Proposition 2 (Asymptotic distributions). Let Z 1 ; : : : ; Z dj i:i:d: N (0; 1), and f! i g dj i=1 be the eigenvalues of p. 1 Under H 0 : p = q, asymptotically n \ FSSD 2 d! P dj i=1 (Z 2 i 1)! i. Easy to simulate to get p-value. Simulation cost independent of n. 2 Under H 1 : p 6= q, we have p n( \ FSSD 2 FSSD 2 ) d! N (0; 2 H 1 ) where 2 H 1 := 4 > q. Implies P(reject H 0 )! 1 as n! 1. But, how to estimate p? No sample from p! Theorem: Using ^ q (computed with fx i g n i=1 q) still leads to a consistent test. 30/25
138 Bahadur Slopes of FSSD and LKS Theorem 4. The Bahadur slope of n \ FSSD 2 is c (FSSD) := FSSD 2 =! 1 ; where! 1 is the maximum eigenvalue of p := cov xp [ (x)]. Theorem 5. The Bahadur slope of the linear-time kernel Stein (LKS) statistic p ns c2 l is c (LKS) = 1 2 [E q h p (x; x 0 )] 2 E p hh 2 p (x; x0 )i ; where h p is the U-statistic kernel of the KSD statistic. 31/25
139 Bahadur Slopes of FSSD and LKS Theorem 4. The Bahadur slope of n \ FSSD 2 is c (FSSD) := FSSD 2 =! 1 ; where! 1 is the maximum eigenvalue of p := cov xp [ (x)]. Theorem 5. The Bahadur slope of the linear-time kernel Stein (LKS) statistic p ns c2 l is c (LKS) = 1 2 [E q h p (x; x 0 )] 2 E p hh 2 p (x; x0 )i ; where h p is the U-statistic kernel of the KSD statistic. 31/25
140 Harder RBM Problem Perturb only one entry of B 2R (in the RBM). B 1;1 B 1;1 + N (0; 2 per = 0:12 ) Rejection rate Sample 4000 size n Sample size n FSSD-opt FSSD-rand KSD LKS MMD-opt ME-opt Two-sample tests fail. Samples from p; q look roughly the same. FSSD-opt is comparable to KSD at low n. One order of magnitude faster. 32/25
141 0 5 0 Harder RBM Problem Perturb only one entry of B 2R (in the RBM). B 1;1 B 1;1 + N (0; 2 per = 0:12 ). Time (s) Sample4000 size n Sample size n FSSD-opt FSSD-rand KSD LKS MMD-opt ME-opt Two-sample tests fail. Samples from p; q look roughly the same. FSSD-opt is comparable to KSD at low n. One order of magnitude faster. 32/25
142 References I Bahadur, R. R. (1960). Stochastic comparison of tests. The Annals of Mathematical Statistics, 31(2): Chwialkowski, K., Strathmann, H., and Gretton, A. (2016). A kernel test of goodness of fit. In ICML, pages Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A Kernel Two-Sample Test. JMLR, 13: Jitkrittum, W., Szabó, Z., Chwialkowski, K. P., and Gretton, A. (2016). Interpretable Distribution Features with Maximum Testing Power. In NIPS, pages /25
143 References II Liu, Q., Lee, J., and Jordan, M. (2016). A Kernelized Stein Discrepancy for Goodness-of-fit Tests. In ICML, pages /25
An Adaptive Test of Independence with Analytic Kernel Embeddings
An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum Gatsby Unit, University College London wittawat@gatsby.ucl.ac.uk Probabilistic Graphical Model Workshop 2017 Institute
More informationAn Adaptive Test of Independence with Analytic Kernel Embeddings
An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum 1 Zoltán Szabó 2 Arthur Gretton 1 1 Gatsby Unit, University College London 2 CMAP, École Polytechnique ICML 2017, Sydney
More informationA Linear-Time Kernel Goodness-of-Fit Test
A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum Gatsby Unit, UCL wittawatj@gmail.com Wenkai Xu Gatsby Unit, UCL wenkaix@gatsby.ucl.ac.uk Zoltán Szabó CMAP, École Polytechnique zoltan.szabo@polytechnique.edu
More informationLearning Interpretable Features to Compare Distributions
Learning Interpretable Features to Compare Distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London Theory of Big Data, 2017 1/41 Goal of this talk Given: Two collections
More informationLearning features to compare distributions
Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this
More informationLearning features to compare distributions
Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this
More informationAdvances in kernel exponential families
Advances in kernel exponential families Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS, 2017 1/39 Outline Motivating application: Fast estimation of complex multivariate
More informationHilbert Space Representations of Probability Distributions
Hilbert Space Representations of Probability Distributions Arthur Gretton joint work with Karsten Borgwardt, Kenji Fukumizu, Malte Rasch, Bernhard Schölkopf, Alex Smola, Le Song, Choon Hui Teo Max Planck
More informationA Kernelized Stein Discrepancy for Goodness-of-fit Tests
Qiang Liu QLIU@CS.DARTMOUTH.EDU Computer Science, Dartmouth College, NH, 3755 Jason D. Lee JASONDLEE88@EECS.BERKELEY.EDU Michael Jordan JORDAN@CS.BERKELEY.EDU Department of Electrical Engineering and Computer
More informationKernel Adaptive Metropolis-Hastings
Kernel Adaptive Metropolis-Hastings Arthur Gretton,?? Gatsby Unit, CSML, University College London NIPS, December 2015 Arthur Gretton (Gatsby Unit, UCL) Kernel Adaptive Metropolis-Hastings 12/12/2015 1
More informationAdaptive HMC via the Infinite Exponential Family
Adaptive HMC via the Infinite Exponential Family Arthur Gretton Gatsby Unit, CSML, University College London RegML, 2017 Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family
More informationKernel methods for comparing distributions, measuring dependence
Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations
More informationBig Hypothesis Testing with Kernel Embeddings
Big Hypothesis Testing with Kernel Embeddings Dino Sejdinovic Department of Statistics University of Oxford 9 January 2015 UCL Workshop on the Theory of Big Data D. Sejdinovic (Statistics, Oxford) Big
More informationHypothesis Testing with Kernel Embeddings on Interdependent Data
Hypothesis Testing with Kernel Embeddings on Interdependent Data Dino Sejdinovic Department of Statistics University of Oxford joint work with Kacper Chwialkowski and Arthur Gretton (Gatsby Unit, UCL)
More informationMaximum Mean Discrepancy
Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia
More informationTensor Product Kernels: Characteristic Property, Universality
Tensor Product Kernels: Characteristic Property, Universality Zolta n Szabo CMAP, E cole Polytechnique Joint work with: Bharath K. Sriperumbudur Hangzhou International Conference on Frontiers of Data Science
More informationKernel adaptive Sequential Monte Carlo
Kernel adaptive Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) December 7, 2015 1 / 36 Section 1 Outline
More informationGaussian processes for inference in stochastic differential equations
Gaussian processes for inference in stochastic differential equations Manfred Opper, AI group, TU Berlin November 6, 2017 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017
More informationUnsupervised Nonparametric Anomaly Detection: A Kernel Method
Fifty-second Annual Allerton Conference Allerton House, UIUC, Illinois, USA October - 3, 24 Unsupervised Nonparametric Anomaly Detection: A Kernel Method Shaofeng Zou Yingbin Liang H. Vincent Poor 2 Xinghua
More informationKernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton
Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,
More informationExamples are not Enough, Learn to Criticize! Criticism for Interpretability
Examples are not Enough, Learn to Criticize! Criticism for Interpretability Been Kim, Rajiv Khanna, Oluwasanmi Koyejo Wittawat Jitkrittum Gatsby Machine Learning Journal Club 16 Jan 2017 1/20 Summary Examples
More informationConvergence Rates of Kernel Quadrature Rules
Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction
More informationHilbert Space Embedding of Probability Measures
Lecture 2 Hilbert Space Embedding of Probability Measures Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 2017 Recap of Lecture
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationMinimax Estimation of Kernel Mean Embeddings
Minimax Estimation of Kernel Mean Embeddings Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University Gatsby Computational Neuroscience Unit May 4, 2016 Collaborators Dr. Ilya Tolstikhin
More informationKernel methods for Bayesian inference
Kernel methods for Bayesian inference Arthur Gretton Gatsby Computational Neuroscience Unit Lancaster, Nov. 2014 Motivating Example: Bayesian inference without a model 3600 downsampled frames of 20 20
More informationRecovering Distributions from Gaussian RKHS Embeddings
Motonobu Kanagawa Graduate University for Advanced Studies kanagawa@ism.ac.jp Kenji Fukumizu Institute of Statistical Mathematics fukumizu@ism.ac.jp Abstract Recent advances of kernel methods have yielded
More informationMeasuring Sample Quality with Stein s Method
Measuring Sample Quality with Stein s Method Lester Mackey, Joint work with Jackson Gorham Microsoft Research Stanford University July 4, 2017 Mackey (MSR) Kernel Stein Discrepancy July 4, 2017 1 / 25
More informationUnsupervised Learning
Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College
More informationNonparametric Indepedence Tests: Space Partitioning and Kernel Approaches
Nonparametric Indepedence Tests: Space Partitioning and Kernel Approaches Arthur Gretton 1 and László Györfi 2 1. Gatsby Computational Neuroscience Unit London, UK 2. Budapest University of Technology
More informationLecture 35: December The fundamental statistical distances
36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose
More informationIntroduction to Machine Learning
Introduction to Machine Learning 12. Gaussian Processes Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 The Normal Distribution http://www.gaussianprocess.org/gpml/chapters/
More informationNonparametric Inference via Bootstrapping the Debiased Estimator
Nonparametric Inference via Bootstrapping the Debiased Estimator Yen-Chi Chen Department of Statistics, University of Washington ICSA-Canada Chapter Symposium 2017 1 / 21 Problem Setup Let X 1,, X n be
More informationKernel Sequential Monte Carlo
Kernel Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) * equal contribution April 25, 2016 1 / 37 Section
More information17 : Markov Chain Monte Carlo
10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo
More informationKernel Bayes Rule: Nonparametric Bayesian inference with kernels
Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December
More information8.1 Concentration inequality for Gaussian random matrix (cont d)
MGMT 69: Topics in High-dimensional Data Analysis Falll 26 Lecture 8: Spectral clustering and Laplacian matrices Lecturer: Jiaming Xu Scribe: Hyun-Ju Oh and Taotao He, October 4, 26 Outline Concentration
More information41903: Introduction to Nonparametrics
41903: Notes 5 Introduction Nonparametrics fundamentally about fitting flexible models: want model that is flexible enough to accommodate important patterns but not so flexible it overspecializes to specific
More informationUndirected Graphical Models
Undirected Graphical Models 1 Conditional Independence Graphs Let G = (V, E) be an undirected graph with vertex set V and edge set E, and let A, B, and C be subsets of vertices. We say that C separates
More informationEnergy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13
Energy Based Models Stefano Ermon, Aditya Grover Stanford University Lecture 13 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 1 / 21 Summary Story so far Representation: Latent
More informationEC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)
1 EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix) Taisuke Otsu London School of Economics Summer 2018 A.1. Summation operator (Wooldridge, App. A.1) 2 3 Summation operator For
More informationDistribution Regression: A Simple Technique with Minimax-optimal Guarantee
Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department,
More informationThe Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space
The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space Robert Jenssen, Deniz Erdogmus 2, Jose Principe 2, Torbjørn Eltoft Department of Physics, University of Tromsø, Norway
More informationDistributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College
Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic
More informationFast Two Sample Tests using Smooth Random Features
Kacper Chwialkowski University College London, Computer Science Department Aaditya Ramdas Carnegie Mellon University, Machine Learning and Statistics School of Computer Science Dino Sejdinovic University
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:
More informationCS Lecture 19. Exponential Families & Expectation Propagation
CS 6347 Lecture 19 Exponential Families & Expectation Propagation Discrete State Spaces We have been focusing on the case of MRFs over discrete state spaces Probability distributions over discrete spaces
More informationMonte Carlo Methods. Leon Gu CSD, CMU
Monte Carlo Methods Leon Gu CSD, CMU Approximate Inference EM: y-observed variables; x-hidden variables; θ-parameters; E-step: q(x) = p(x y, θ t 1 ) M-step: θ t = arg max E q(x) [log p(y, x θ)] θ Monte
More informationDependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline.
Practitioner Course: Portfolio Optimization September 10, 2008 Before we define dependence, it is useful to define Random variables X and Y are independent iff For all x, y. In particular, F (X,Y ) (x,
More informationLearning gradients: prescriptive models
Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan
More informationStein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm
Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d
More informationHilbert Schmidt Independence Criterion
Hilbert Schmidt Independence Criterion Thanks to Arthur Gretton, Le Song, Bernhard Schölkopf, Olivier Bousquet Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au
More informationParticle Filtering Approaches for Dynamic Stochastic Optimization
Particle Filtering Approaches for Dynamic Stochastic Optimization John R. Birge The University of Chicago Booth School of Business Joint work with Nicholas Polson, Chicago Booth. JRBirge I-Sim Workshop,
More information22 : Hilbert Space Embeddings of Distributions
10-708: Probabilistic Graphical Models 10-708, Spring 2014 22 : Hilbert Space Embeddings of Distributions Lecturer: Eric P. Xing Scribes: Sujay Kumar Jauhar and Zhiguang Huo 1 Introduction and Motivation
More informationApproximate Message Passing
Approximate Message Passing Mohammad Emtiyaz Khan CS, UBC February 8, 2012 Abstract In this note, I summarize Sections 5.1 and 5.2 of Arian Maleki s PhD thesis. 1 Notation We denote scalars by small letters
More informationApproximate Bayesian Computation and Particle Filters
Approximate Bayesian Computation and Particle Filters Dennis Prangle Reading University 5th February 2014 Introduction Talk is mostly a literature review A few comments on my own ongoing research See Jasra
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationStatistical Convergence of Kernel CCA
Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,
More informationChapter 16. Structured Probabilistic Models for Deep Learning
Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe
More informationThe Variational Gaussian Approximation Revisited
The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much
More informationHow to learn from very few examples?
How to learn from very few examples? Dengyong Zhou Department of Empirical Inference Max Planck Institute for Biological Cybernetics Spemannstr. 38, 72076 Tuebingen, Germany Outline Introduction Part A
More informationImportance Reweighting Using Adversarial-Collaborative Training
Importance Reweighting Using Adversarial-Collaborative Training Yifan Wu yw4@andrew.cmu.edu Tianshu Ren tren@andrew.cmu.edu Lidan Mu lmu@andrew.cmu.edu Abstract We consider the problem of reweighting a
More informationDependence. MFM Practitioner Module: Risk & Asset Allocation. John Dodson. September 11, Dependence. John Dodson. Outline.
MFM Practitioner Module: Risk & Asset Allocation September 11, 2013 Before we define dependence, it is useful to define Random variables X and Y are independent iff For all x, y. In particular, F (X,Y
More informationBayesian Support Vector Machines for Feature Ranking and Selection
Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction
More informationProbabilistic Graphical Models for Image Analysis - Lecture 4
Probabilistic Graphical Models for Image Analysis - Lecture 4 Stefan Bauer 12 October 2018 Max Planck ETH Center for Learning Systems Overview 1. Repetition 2. α-divergence 3. Variational Inference 4.
More informationDependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise
Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise Makoto Yamada and Masashi Sugiyama Department of Computer Science, Tokyo Institute of Technology
More informationDistribution Regression
Zoltán Szabó (École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Dagstuhl Seminar 16481
More informationTutorial on Gaussian Processes and the Gaussian Process Latent Variable Model
Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 89 Part II
More informationNon-parametric Inference and Resampling
Non-parametric Inference and Resampling Exercises by David Wozabal (Last update 3. Juni 2013) 1 Basic Facts about Rank and Order Statistics 1.1 10 students were asked about the amount of time they spend
More informationA Novel Nonparametric Density Estimator
A Novel Nonparametric Density Estimator Z. I. Botev The University of Queensland Australia Abstract We present a novel nonparametric density estimator and a new data-driven bandwidth selection method with
More informationLecture 4: Types of errors. Bayesian regression models. Logistic regression
Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture
More informationProbabilistic Graphical Models Lecture 17: Markov chain Monte Carlo
Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 18, 2015 1 / 45 Resources and Attribution Image credits,
More informationIntroduction to Gaussian Processes
Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of
More informationFrom graph to manifold Laplacian: The convergence rate
Appl. Comput. Harmon. Anal. 2 (2006) 28 34 www.elsevier.com/locate/acha Letter to the Editor From graph to manifold Laplacian: The convergence rate A. Singer Department of athematics, Yale University,
More informationMarkov Chain Monte Carlo Methods for Stochastic Optimization
Markov Chain Monte Carlo Methods for Stochastic Optimization John R. Birge The University of Chicago Booth School of Business Joint work with Nicholas Polson, Chicago Booth. JRBirge U of Toronto, MIE,
More informationApproximate Inference Part 1 of 2
Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory
More informationPerformance Evaluation and Comparison
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Cross Validation and Resampling 3 Interval Estimation
More informationReview (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology
Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Some slides have been adopted from Prof. H.R. Rabiee s and also Prof. R. Gutierrez-Osuna
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More information5 Introduction to the Theory of Order Statistics and Rank Statistics
5 Introduction to the Theory of Order Statistics and Rank Statistics This section will contain a summary of important definitions and theorems that will be useful for understanding the theory of order
More information9 Classification. 9.1 Linear Classifiers
9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive
More informationLecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable
Lecture Notes 1 Probability and Random Variables Probability Spaces Conditional Probability and Independence Random Variables Functions of a Random Variable Generation of a Random Variable Jointly Distributed
More information1 Exponential manifold by reproducing kernel Hilbert spaces
1 Exponential manifold by reproducing kernel Hilbert spaces 1.1 Introduction The purpose of this paper is to propose a method of constructing exponential families of Hilbert manifold, on which estimation
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationMarch 13, Paper: R.R. Coifman, S. Lafon, Diffusion maps ([Coifman06]) Seminar: Learning with Graphs, Prof. Hein, Saarland University
Kernels March 13, 2008 Paper: R.R. Coifman, S. Lafon, maps ([Coifman06]) Seminar: Learning with Graphs, Prof. Hein, Saarland University Kernels Figure: Example Application from [LafonWWW] meaningful geometric
More informationKernel Measures of Conditional Dependence
Kernel Measures of Conditional Dependence Kenji Fukumizu Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku Tokyo 6-8569 Japan fukumizu@ism.ac.jp Arthur Gretton Max-Planck Institute for
More informationKernel-Based Just-In-Time Learning for Passing Expectation Propagation Messages
Kernel-Based Just-In-Time Learning for Passing Expectation Propagation Messages 27 April 2015 Wittawat Jitkrittum 1, Arthur Gretton 1, Nicolas Heess, S. M. Ali Eslami, Balaji Lakshminarayanan 1, Dino Sejdinovic
More informationExpectation Propagation Algorithm
Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,
More informationPractice Problem - Skewness of Bernoulli Random Variable. Lecture 7: Joint Distributions and the Law of Large Numbers. Joint Distributions - Example
A little more E(X Practice Problem - Skewness of Bernoulli Random Variable Lecture 7: and the Law of Large Numbers Sta30/Mth30 Colin Rundel February 7, 014 Let X Bern(p We have shown that E(X = p Var(X
More informationRobust Support Vector Machines for Probability Distributions
Robust Support Vector Machines for Probability Distributions Andreas Christmann joint work with Ingo Steinwart (Los Alamos National Lab) ICORS 2008, Antalya, Turkey, September 8-12, 2008 Andreas Christmann,
More informationAutoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas
On for Energy Based Models Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas Toronto Machine Learning Group Meeting, 2011 Motivation Models Learning Goal: Unsupervised
More informationBased on slides by Richard Zemel
CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we
More information16 : Markov Chain Monte Carlo (MCMC)
10-708: Probabilistic Graphical Models 10-708, Spring 2014 16 : Markov Chain Monte Carlo MCMC Lecturer: Matthew Gormley Scribes: Yining Wang, Renato Negrinho 1 Sampling from low-dimensional distributions
More informationStatistics for scientists and engineers
Statistics for scientists and engineers February 0, 006 Contents Introduction. Motivation - why study statistics?................................... Examples..................................................3
More information19 : Slice Sampling and HMC
10-708: Probabilistic Graphical Models 10-708, Spring 2018 19 : Slice Sampling and HMC Lecturer: Kayhan Batmanghelich Scribes: Boxiang Lyu 1 MCMC (Auxiliary Variables Methods) In inference, we are often
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationIntroduction to Probabilistic Machine Learning
Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning
More informationExplicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression
JMLR: Workshop and Conference Proceedings : 9, 08. Symposium on Advances in Approximate Bayesian Inference Explicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression
More informationMonte Carlo and cold gases. Lode Pollet.
Monte Carlo and cold gases Lode Pollet lpollet@physics.harvard.edu 1 Outline Classical Monte Carlo The Monte Carlo trick Markov chains Metropolis algorithm Ising model critical slowing down Quantum Monte
More information