Measuring Sample Quality with Stein s Method

Size: px

Start display at page:

Download "Measuring Sample Quality with Stein s Method"

Grant Walters
6 years ago
Views:

1 Measuring Sample Quality with Stein s Method Lester Mackey, Joint work with Jackson Gorham Microsoft Research Stanford University July 4, 2017 Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

2 Motivation: Large-scale Posterior Inference Question: How do we scale Markov chain Monte Carlo (MCMC) posterior inference to massive datasets? MCMC Benefit: Approximates intractable posterior expectations E P [h(z)] = p(x)h(x)dx with asymptotically X exact sample estimates E Q [h(x)] = 1 n n i=1 h(x i) Problem: Each point x i requires iterating over entire dataset! Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

3 Motivation: Large-scale Posterior Inference Question: How do we scale Markov chain Monte Carlo (MCMC) posterior inference to massive datasets? MCMC Benefit: Approximates intractable posterior expectations E P [h(z)] = p(x)h(x)dx with asymptotically X exact sample estimates E Q [h(x)] = 1 n n i=1 h(x i) Problem: Each point x i requires iterating over entire dataset! Template solution: Approximate MCMC with subset posteriors [Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014] Approximate standard MCMC procedure in a manner that makes use of only a small subset of datapoints per sample Reduced computational overhead leads to faster sampling and reduced Monte Carlo variance Introduces asymptotic bias: target distribution is not stationary Hope that for fixed amount of sampling time, variance reduction will outweigh bias introduced Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

4 Motivation: Large-scale Posterior Inference Template solution: Approximate MCMC with subset posteriors [Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014] Hope that for fixed amount of sampling time, variance reduction will outweigh bias introduced Introduces new challenges Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

5 Motivation: Large-scale Posterior Inference Template solution: Approximate MCMC with subset posteriors [Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014] Hope that for fixed amount of sampling time, variance reduction will outweigh bias introduced Introduces new challenges How do we compare and evaluate samples from approximate MCMC procedures? Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

6 Motivation: Large-scale Posterior Inference Template solution: Approximate MCMC with subset posteriors [Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014] Hope that for fixed amount of sampling time, variance reduction will outweigh bias introduced Introduces new challenges How do we compare and evaluate samples from approximate MCMC procedures? How do we select samplers and their tuning parameters? Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

7 Motivation: Large-scale Posterior Inference Template solution: Approximate MCMC with subset posteriors [Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014] Hope that for fixed amount of sampling time, variance reduction will outweigh bias introduced Introduces new challenges How do we compare and evaluate samples from approximate MCMC procedures? How do we select samplers and their tuning parameters? How do we quantify the bias-variance trade-off explicitly? Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

8 Motivation: Large-scale Posterior Inference Template solution: Approximate MCMC with subset posteriors [Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014] Hope that for fixed amount of sampling time, variance reduction will outweigh bias introduced Introduces new challenges How do we compare and evaluate samples from approximate MCMC procedures? How do we select samplers and their tuning parameters? How do we quantify the bias-variance trade-off explicitly? Difficulty: Standard evaluation criteria like effective sample size, trace plots, and variance diagnostics assume convergence to the target distribution and do not account for asymptotic bias Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

9 Motivation: Large-scale Posterior Inference Template solution: Approximate MCMC with subset posteriors [Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014] Hope that for fixed amount of sampling time, variance reduction will outweigh bias introduced Introduces new challenges How do we compare and evaluate samples from approximate MCMC procedures? How do we select samplers and their tuning parameters? How do we quantify the bias-variance trade-off explicitly? Difficulty: Standard evaluation criteria like effective sample size, trace plots, and variance diagnostics assume convergence to the target distribution and do not account for asymptotic bias This talk: Introduce new quality measures suitable for comparing the quality of approximate MCMC samples Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

10 Quality Measures for Samples Challenge: Develop measure suitable for comparing the quality of any two samples approximating a common target distribution Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

11 Quality Measures for Samples Challenge: Develop measure suitable for comparing the quality of any two samples approximating a common target distribution Given Continuous target distribution P with support X = R d and density p p known up to normalization, integration under P is intractable Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

12 Quality Measures for Samples Challenge: Develop measure suitable for comparing the quality of any two samples approximating a common target distribution Given Continuous target distribution P with support X = R d and density p p known up to normalization, integration under P is intractable Sample points x 1,..., x n X Define discrete distribution Q n with, for any function h, E Qn [h(x)] = 1 n n i=1 h(x i) used to approximate E P [h(z)] We make no assumption about the provenance of the x i Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

13 Quality Measures for Samples Challenge: Develop measure suitable for comparing the quality of any two samples approximating a common target distribution Given Continuous target distribution P with support X = R d and density p p known up to normalization, integration under P is intractable Sample points x 1,..., x n X Define discrete distribution Q n with, for any function h, E Qn [h(x)] = 1 n n i=1 h(x i) used to approximate E P [h(z)] We make no assumption about the provenance of the x i Goal: Quantify how well E Qn approximates E P Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

14 Quality Measures for Samples Challenge: Develop measure suitable for comparing the quality of any two samples approximating a common target distribution Given Continuous target distribution P with support X = R d and density p p known up to normalization, integration under P is intractable Sample points x 1,..., x n X Define discrete distribution Q n with, for any function h, E Qn [h(x)] = 1 n n i=1 h(x i) used to approximate E P [h(z)] We make no assumption about the provenance of the x i Goal: Quantify how well E Qn approximates E P in a manner that I. Detects when a sample sequence is converging to the target Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

15 Quality Measures for Samples Challenge: Develop measure suitable for comparing the quality of any two samples approximating a common target distribution Given Continuous target distribution P with support X = R d and density p p known up to normalization, integration under P is intractable Sample points x 1,..., x n X Define discrete distribution Q n with, for any function h, E Qn [h(x)] = 1 n n i=1 h(x i) used to approximate E P [h(z)] We make no assumption about the provenance of the x i Goal: Quantify how well E Qn approximates E P in a manner that I. Detects when a sample sequence is converging to the target II. Detects when a sample sequence is not converging to the target Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

16 Quality Measures for Samples Challenge: Develop measure suitable for comparing the quality of any two samples approximating a common target distribution Given Continuous target distribution P with support X = R d and density p p known up to normalization, integration under P is intractable Sample points x 1,..., x n X Define discrete distribution Q n with, for any function h, E Qn [h(x)] = 1 n n i=1 h(x i) used to approximate E P [h(z)] We make no assumption about the provenance of the x i Goal: Quantify how well E Qn approximates E P in a manner that I. Detects when a sample sequence is converging to the target II. Detects when a sample sequence is not converging to the target III. Is computationally feasible Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

17 Integral Probability Metrics Goal: Quantify how well E Qn approximates E P Idea: Consider an integral probability metric (IPM) [Müller, 1997] d H (Q n, P ) = sup E Qn [h(x)] E P [h(z)] h H Measures maximum discrepancy between sample and target expectations over a class of real-valued test functions H When H sufficiently large, convergence of d H (Q n, P ) to zero implies (Q n ) n 1 converges weakly to P (Requirement II) Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

18 Integral Probability Metrics Goal: Quantify how well E Qn approximates E P Idea: Consider an integral probability metric (IPM) [Müller, 1997] d H (Q n, P ) = sup E Qn [h(x)] E P [h(z)] h H Measures maximum discrepancy between sample and target expectations over a class of real-valued test functions H When H sufficiently large, convergence of d H (Q n, P ) to zero implies (Q n ) n 1 converges weakly to P (Requirement II) Problem: Integration under P intractable! Most IPMs cannot be computed in practice Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

19 Integral Probability Metrics Goal: Quantify how well E Qn approximates E P Idea: Consider an integral probability metric (IPM) [Müller, 1997] d H (Q n, P ) = sup E Qn [h(x)] E P [h(z)] h H Measures maximum discrepancy between sample and target expectations over a class of real-valued test functions H When H sufficiently large, convergence of d H (Q n, P ) to zero implies (Q n ) n 1 converges weakly to P (Requirement II) Problem: Integration under P intractable! Most IPMs cannot be computed in practice Idea: Only consider functions with E P [h(z)] known a priori to be 0 Then IPM computation only depends on Q n! Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

20 Integral Probability Metrics Goal: Quantify how well E Qn approximates E P Idea: Consider an integral probability metric (IPM) [Müller, 1997] d H (Q n, P ) = sup E Qn [h(x)] E P [h(z)] h H Measures maximum discrepancy between sample and target expectations over a class of real-valued test functions H When H sufficiently large, convergence of d H (Q n, P ) to zero implies (Q n ) n 1 converges weakly to P (Requirement II) Problem: Integration under P intractable! Most IPMs cannot be computed in practice Idea: Only consider functions with E P [h(z)] known a priori to be 0 Then IPM computation only depends on Q n! How do we select this class of test functions? Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

21 Integral Probability Metrics Goal: Quantify how well E Qn approximates E P Idea: Consider an integral probability metric (IPM) [Müller, 1997] d H (Q n, P ) = sup E Qn [h(x)] E P [h(z)] h H Measures maximum discrepancy between sample and target expectations over a class of real-valued test functions H When H sufficiently large, convergence of d H (Q n, P ) to zero implies (Q n ) n 1 converges weakly to P (Requirement II) Problem: Integration under P intractable! Most IPMs cannot be computed in practice Idea: Only consider functions with E P [h(z)] known a priori to be 0 Then IPM computation only depends on Q n! How do we select this class of test functions? Will the resulting discrepancy measure track sample sequence convergence (Requirements I and II)? Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

22 Integral Probability Metrics Goal: Quantify how well E Qn approximates E P Idea: Consider an integral probability metric (IPM) [Müller, 1997] d H (Q n, P ) = sup E Qn [h(x)] E P [h(z)] h H Measures maximum discrepancy between sample and target expectations over a class of real-valued test functions H When H sufficiently large, convergence of d H (Q n, P ) to zero implies (Q n ) n 1 converges weakly to P (Requirement II) Problem: Integration under P intractable! Most IPMs cannot be computed in practice Idea: Only consider functions with E P [h(z)] known a priori to be 0 Then IPM computation only depends on Q n! How do we select this class of test functions? Will the resulting discrepancy measure track sample sequence convergence (Requirements I and II)? How do we solve the resulting optimization problem in practice? Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

23 Stein s Method Stein s method for bounding IPMs [Stein, 1972] proceeds in 3 steps: 1 Identify operator T and set G of functions g : X R d with E P [(T g)(z)] = 0 for all g G. Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

24 Stein s Method Stein s method for bounding IPMs [Stein, 1972] proceeds in 3 steps: 1 Identify operator T and set G of functions g : X R d with E P [(T g)(z)] = 0 for all g G. Together, T and G define the Stein discrepancy S(Q n, T, G) sup E Qn [(T g)(x)] = d T G (Q n, P ), g G an IPM-type measure with no explicit integration under P Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

25 Stein s Method Stein s method for bounding IPMs [Stein, 1972] proceeds in 3 steps: 1 Identify operator T and set G of functions g : X R d with E P [(T g)(z)] = 0 for all g G. Together, T and G define the Stein discrepancy S(Q n, T, G) sup E Qn [(T g)(x)] = d T G (Q n, P ), g G an IPM-type measure with no explicit integration under P 2 Lower bound S(Q n, T, G) by reference IPM d H (Q n, P ) S(Q n, T, G) 0 only if (Q n ) n 1 converges to P (Req. II) Performed once, in advance, for large classes of distributions Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

26 Stein s Method Stein s method for bounding IPMs [Stein, 1972] proceeds in 3 steps: 1 Identify operator T and set G of functions g : X R d with E P [(T g)(z)] = 0 for all g G. Together, T and G define the Stein discrepancy S(Q n, T, G) sup E Qn [(T g)(x)] = d T G (Q n, P ), g G an IPM-type measure with no explicit integration under P 2 Lower bound S(Q n, T, G) by reference IPM d H (Q n, P ) S(Q n, T, G) 0 only if (Q n ) n 1 converges to P (Req. II) Performed once, in advance, for large classes of distributions 3 Upper bound S(Q n, T, G) by any means necessary to demonstrate convergence to 0 (Requirement I) Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

27 Stein s Method Stein s method for bounding IPMs [Stein, 1972] proceeds in 3 steps: 1 Identify operator T and set G of functions g : X R d with E P [(T g)(z)] = 0 for all g G. Together, T and G define the Stein discrepancy S(Q n, T, G) sup E Qn [(T g)(x)] = d T G (Q n, P ), g G an IPM-type measure with no explicit integration under P 2 Lower bound S(Q n, T, G) by reference IPM d H (Q n, P ) S(Q n, T, G) 0 only if (Q n ) n 1 converges to P (Req. II) Performed once, in advance, for large classes of distributions 3 Upper bound S(Q n, T, G) by any means necessary to demonstrate convergence to 0 (Requirement I) Standard use: As analytical tool to prove convergence Our goal: Develop Stein discrepancy into practical quality measure Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

28 Identifying a Stein Operator T Goal: Identify operator T for which E P [(T g)(z)] = 0 for all g G Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

29 Identifying a Stein Operator T Goal: Identify operator T for which E P [(T g)(z)] = 0 for all g G Approach: Generator method of Barbour [1988, 1990], Götze [1991] Identify a Markov process (Z t ) t 0 with stationary distribution P Under mild conditions, its infinitesimal generator (Au)(x) = lim (E[u(Z t ) Z 0 = x] u(x))/t t 0 satisfies E P [(Au)(Z)] = 0 Overdamped Langevin diffusion: dz t = 1 2 log p(z t)dt + dw t Generator: (A P u)(x) = 1 u(x), log p(x) + 1, u(x) 2 2 Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

30 Identifying a Stein Operator T Goal: Identify operator T for which E P [(T g)(z)] = 0 for all g G Approach: Generator method of Barbour [1988, 1990], Götze [1991] Identify a Markov process (Z t ) t 0 with stationary distribution P Under mild conditions, its infinitesimal generator (Au)(x) = lim (E[u(Z t ) Z 0 = x] u(x))/t t 0 satisfies E P [(Au)(Z)] = 0 Overdamped Langevin diffusion: dz t = 1 2 log p(z t)dt + dw t Generator: (A P u)(x) = 1 u(x), log p(x) + 1, u(x) 2 2 Stein operator: (T P g)(x) g(x), log p(x) +, g(x) [Gorham and Mackey, 2015, Oates, Girolami, and Chopin, 2016] Depends on P only through log p; computable even if p cannot be normalized! Multivariate generalization of density method operator (T g)(x) = g(x) d dx log p(x) + g (x) [Stein, Diaconis, Holmes, and Reinert, 2004] Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

31 Identifying a Stein Set G Goal: Identify set G for which E P [(T P g)(z)] = 0 for all g G Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

32 Identifying a Stein Set G Goal: Identify set G for which E P [(T P g)(z)] = 0 for all g G Approach: Reproducing kernels k : X X R A reproducing kernel k is symmetric (k(x, y) = k(y, x)) and positive semidefinite ( i,l c ic l k(z i, z l ) 0, z i X, c i R) e.g., Gaussian kernel k(x, y) = e 1 2 x y 2 2 Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

33 Identifying a Stein Set G Goal: Identify set G for which E P [(T P g)(z)] = 0 for all g G Approach: Reproducing kernels k : X X R A reproducing kernel k is symmetric (k(x, y) = k(y, x)) and positive semidefinite ( i,l c ic l k(z i, z l ) 0, z i X, c i R) e.g., Gaussian kernel k(x, y) = e 1 2 x y 2 2 Yields function space K k = {f : f(x) = m i=1 c ik(z i, x), m N} m with norm f Kk = i,l=1 c ic l k(z i, z l ) Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

34 Identifying a Stein Set G Goal: Identify set G for which E P [(T P g)(z)] = 0 for all g G Approach: Reproducing kernels k : X X R A reproducing kernel k is symmetric (k(x, y) = k(y, x)) and positive semidefinite ( i,l c ic l k(z i, z l ) 0, z i X, c i R) e.g., Gaussian kernel k(x, y) = e 1 2 x y 2 2 Yields function space K k = {f : f(x) = m i=1 c ik(z i, x), m N} m with norm f Kk = i,l=1 c ic l k(z i, z l ) We define the kernel Stein set of vector-valued g : X R d as G k, {g = (g 1,..., g d ) v 1 for v j g j Kk }. Each g j belongs to reproducing kernel Hilbert space (RKHS) K k Component norms v j g j Kk are jointly bounded by 1 Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

35 Identifying a Stein Set G Goal: Identify set G for which E P [(T P g)(z)] = 0 for all g G Approach: Reproducing kernels k : X X R A reproducing kernel k is symmetric (k(x, y) = k(y, x)) and positive semidefinite ( i,l c ic l k(z i, z l ) 0, z i X, c i R) e.g., Gaussian kernel k(x, y) = e 1 2 x y 2 2 Yields function space K k = {f : f(x) = m i=1 c ik(z i, x), m N} m with norm f Kk = i,l=1 c ic l k(z i, z l ) We define the kernel Stein set of vector-valued g : X R d as G k, {g = (g 1,..., g d ) v 1 for v j g j Kk }. Each g j belongs to reproducing kernel Hilbert space (RKHS) K k Component norms v j g j Kk are jointly bounded by 1 E P [(T P g)(z)] = 0 for all g G k, whenever k C (1,1) b log p integrable under P [Gorham and Mackey, 2017] and Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

36 Computing the Kernel Stein Discrepancy Kernel Stein discrepancy (KSD) S(Q n, T P, G k, ) Stein operator (T P g)(x) g(x), log p(x) +, g(x) Stein set G k, {g = (g 1,..., g d ) v 1 for v j g j Kk } Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

37 Computing the Kernel Stein Discrepancy Kernel Stein discrepancy (KSD) S(Q n, T P, G k, ) Stein operator (T P g)(x) g(x), log p(x) +, g(x) Stein set G k, {g = (g 1,..., g d ) v 1 for v j g j Kk } Benefit: Computable in closed form [Gorham and Mackey, 2017] n S(Q n, T P, G k, ) = w for w j i,i =1 kj 0(x i, x i ). Reduces to parallelizable pairwise evaluations of Stein kernels k j 0 (x, y) 1 p(x)p(y) x j yj (p(x)k(x, y)p(y)) Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

38 Computing the Kernel Stein Discrepancy Kernel Stein discrepancy (KSD) S(Q n, T P, G k, ) Stein operator (T P g)(x) g(x), log p(x) +, g(x) Stein set G k, {g = (g 1,..., g d ) v 1 for v j g j Kk } Benefit: Computable in closed form [Gorham and Mackey, 2017] n S(Q n, T P, G k, ) = w for w j i,i =1 kj 0(x i, x i ). Reduces to parallelizable pairwise evaluations of Stein kernels k j 0 (x, y) 1 p(x)p(y) x j yj (p(x)k(x, y)p(y)) Stein set choice inspired by control functional kernels k 0 = d j=1 kj 0 of Oates, Girolami, and Chopin [2016] Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

39 Computing the Kernel Stein Discrepancy Kernel Stein discrepancy (KSD) S(Q n, T P, G k, ) Stein operator (T P g)(x) g(x), log p(x) +, g(x) Stein set G k, {g = (g 1,..., g d ) v 1 for v j g j Kk } Benefit: Computable in closed form [Gorham and Mackey, 2017] n S(Q n, T P, G k, ) = w for w j i,i =1 kj 0(x i, x i ). Reduces to parallelizable pairwise evaluations of Stein kernels k j 0 (x, y) 1 p(x)p(y) x j yj (p(x)k(x, y)p(y)) Stein set choice inspired by control functional kernels k 0 = d j=1 kj 0 of Oates, Girolami, and Chopin [2016] When = 2, recovers the KSD of Chwialkowski, Strathmann, and Gretton [2016], Liu, Lee, and Jordan [2016] Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

40 Computing the Kernel Stein Discrepancy Kernel Stein discrepancy (KSD) S(Q n, T P, G k, ) Stein operator (T P g)(x) g(x), log p(x) +, g(x) Stein set G k, {g = (g 1,..., g d ) v 1 for v j g j Kk } Benefit: Computable in closed form [Gorham and Mackey, 2017] n S(Q n, T P, G k, ) = w for w j i,i =1 kj 0(x i, x i ). Reduces to parallelizable pairwise evaluations of Stein kernels k j 0 (x, y) 1 p(x)p(y) x j yj (p(x)k(x, y)p(y)) Stein set choice inspired by control functional kernels k 0 = d j=1 kj 0 of Oates, Girolami, and Chopin [2016] When = 2, recovers the KSD of Chwialkowski, Strathmann, and Gretton [2016], Liu, Lee, and Jordan [2016] To ease notation, will use G k G k, 2 in remainder of the talk Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

41 Detecting Non-convergence Goal: Show S(Q n, T P, G k ) 0 only if Q n converges to P Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

42 Detecting Non-convergence Goal: Show S(Q n, T P, G k ) 0 only if Q n converges to P In higher dimensions, KSDs based on common kernels fail to detect non-convergence, even for Gaussian targets P Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

43 Detecting Non-convergence Goal: Show S(Q n, T P, G k ) 0 only if Q n converges to P In higher dimensions, KSDs based on common kernels fail to detect non-convergence, even for Gaussian targets P Theorem (KSD fails with light kernel tails [Gorham and Mackey, 2017]) Suppose d 3, P = N (0, I d ), and α ( d ) 1. If k(x, y) and its derivatives decay at a o( x y α 2 ) rate as x y 2, then S(Q n, T P, G k ) 0 for some (Q n ) n 1 not converging to P. Gaussian (k(x, y) = e 1 2 x y 2 2 ) and Matérn kernels fail for d 3 Inverse multiquadric kernels (k(x, y) = (1 + x y 2 2 )β ) with β < 1 fail for d > 2β 1+β The violating sample sequences (Q n ) n 1 are simple to construct Problem: Kernels with light tails ignore excess mass in the tails Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

44 Detecting Non-convergence Goal: Show S(Q n, T P, G k ) 0 only if Q n converges to P Consider the inverse multiquadric (IMQ) kernel k(x, y) = (c 2 + x y 2 2 )β for some β < 0, c R. IMQ KSD fails to detect non-convergence when β < 1 Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

45 Detecting Non-convergence Goal: Show S(Q n, T P, G k ) 0 only if Q n converges to P Consider the inverse multiquadric (IMQ) kernel k(x, y) = (c 2 + x y 2 2 )β for some β < 0, c R. IMQ KSD fails to detect non-convergence when β < 1 However, IMQ KSD detects non-convergence when β ( 1, 0) Theorem (IMQ KSD detects non-convergence [Gorham and Mackey, 2017]) Suppose P P and k(x, y) = (c 2 + x y 2 2) β for β ( 1, 0). If S(Q n, T P, G k ) 0, then (Q n ) n 1 converges weakly to P. No extra assumptions on sample sequence (Q n ) n 1 needed Proof sketch: Slow decay rate of kernel unbounded (coercive) test functions in T P G k detects excess mass in the tails Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

46 Detecting Convergence Goal: Show S(Q n, T P, G k ) 0 when Q n converges to P Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

47 Detecting Convergence Goal: Show S(Q n, T P, G k ) 0 when Q n converges to P Proposition (KSD detects convergence [Gorham and Mackey, 2017]) If k C (2,2) b and log p Lipschitz and square integrable under P, then S(Q n, T P, G k ) 0 whenever the Wasserstein distance d W 2 (Q n, P ) 0. Covers Gaussian, Matérn, IMQ, and other common bounded kernels k Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

48 A Simple Example i.i.d. from mixture target P i.i.d. from single mixture component 0.8 i.i.d. from mixture target P i.i.d. from single mixture component Discrepancy value g h = T P g Number of sample points, n x Left plot: For target p(x) e 1 2 (x+1.5)2 + e 1 2 (x 1.5)2, compare an i.i.d. sample Q n from P and an i.i.d. sample Q n from one component Expect S(Q 1:n, T P, G k ) 0 & S(Q 1:n, T P, G k ) 0 Compare IMQ KSD (β = 1, c = 1) with Wasserstein distance 2 Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

49 A Simple Example i.i.d. from mixture target P i.i.d. from single mixture component i.i.d. from mixture target P i.i.d. from single mixture component Discrepancy value g h = T P g Number of sample points, n Right plot: For n = 10 3 sample points, x (Top) Recovered optimal Stein functions g G k (Bottom) Associated test functions h T P g which best discriminate sample Q n from target P Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

50 The Importance of Kernel Choice Kernel Stein discrepancy i.i.d. from target P Off target sample Number of sample points, n Gaussian Matérn Inverse Multiquadric Dimension d = 5 d = 8 d = 20 Target P = N (0, I d ) Off-target Q n has all x i 2 2n 1/d log n, x i x j 2 2 log n Gaussian and Matérn KSDs driven to 0 by an off-target sequence that does not converge to P IMQ KSD (β = 1, c = 1) does 2 not have this deficiency Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

51 Selecting Sampler Hyperparameters Approximate slice sampling [DuBois, Korattikara, Welling, and Smyth, 2014] Approximate MCMC procedure designed for scalability Random subset of datapoints used to approximate each sampling step Target P is not stationary distribution Tolerance parameter ɛ controls number of datapoints evaluated ɛ too small too few sample points generated ɛ too large sampling from very different distribution Standard MCMC selection criteria like effective sample size (ESS) and asymptotic variance do not account for this bias Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

52 Selecting Sampler Hyperparameters ESS (higher is better) ε = 0 (n = 230) ε = 10 2 (n = 416) ε = 10 1 (n = 1000) KSD (lower is better) Tolerance parameter, ε x x 1 ESS maximized at tolerance ɛ = 10 1 IMQ KSD minimized at tolerance ɛ = 10 2 Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

53 Selecting Samplers Stochastic Gradient Fisher Scoring (SGFS) [Ahn, Korattikara, and Welling, 2012] Approximate MCMC procedure designed for scalability Approximates Metropolis-adjusted Langevin algorithm but does not use Metropolis-Hastings correction Target P is not stationary distribution Goal: Choose between two variants SGFS-f inverts a d d matrix for each new sample point SGFS-d inverts a diagonal matrix to reduce sampling time MNIST handwritten digits [Ahn, Korattikara, and Welling, 2012] images, 51 features, binary label indicating whether image of a 7 or a 9 Bayesian logistic regression posterior P Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

54 Selecting Samplers IMQ kernel Stein discrepancy Number of sample points, n Sampler SGFS d SGFS f x 51 x SGFS d x 7 SGFS f x 32 BEST BEST x SGFS d x 8 SGFS f x 2 Left: IMQ KSD quality comparison for SGFS Bayesian logistic regression (no surrogate ground truth used) Right: SGFS sample points (n = ) with bivariate marginal means and 95% confidence ellipses (blue) that align best and worst with surrogate ground truth sample (red). Both suggest small speed-up of SGFS-d (0.0017s per sample vs s for SGFS-f) outweighed by loss in inferential accuracy Mackey (MSR) Kernel Stein Discrepancy July 4, / 25 x 25 WORST WORST

55 Beyond Sample Quality Comparison One-sample hypothesis testing Chwialkowski, Strathmann, and Gretton [2016] used the KSD S(Q n, T P, G k ) to test whether a sample was drawn from a target distribution P (see also Liu, Lee, and Jordan [2016]) Test with default Gaussian kernel k experienced considerable loss of power as the dimension d increased Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

56 Beyond Sample Quality Comparison One-sample hypothesis testing Chwialkowski, Strathmann, and Gretton [2016] used the KSD S(Q n, T P, G k ) to test whether a sample was drawn from a target distribution P (see also Liu, Lee, and Jordan [2016]) Test with default Gaussian kernel k experienced considerable loss of power as the dimension d increased We recreate their experiment with IMQ kernel (β = 1 2, c = 1) For n = 500, generate sample (x i ) n i=1 with x i = z i + u i e 1 z i iid N (0, Id ) and u i iid Unif[0, 1]. Target P = N (0, Id ). Compare with standard normality test of Baringhaus and Henze [1988] Table: Mean power of multivariate normality tests across 400 simulations d=2 d=5 d=10 d=15 d=20 d=25 B&H Gaussian IMQ Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

57 Beyond Sample Quality Comparison Improving sample quality Given sample points (x i ) n i=1, can minimize KSD S( Q n, T P, G k ) over all weighted samples Q n = n i=1 q n(x i )δ xi for q n a probability mass function Liu and Lee [2016] do this with Gaussian kernel k(x, y) = e 1 h x y 2 2 Bandwidth h set to median of the squared Euclidean distance between pairs of sample points We recreate their experiment with the IMQ kernel k(x, y) = (1 + 1 h x y 2 2 ) 1/2 Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

58 Improving Sample Quality d 10 2 Average MSE, E P Z E ~ 2 Qn X Sample Initial Q n Gaussian KSD IMQ KSD Dimension, d MSE averaged over 500 simulations (±2 standard errors) Target P = N (0, I d ) Starting sample Q n = 1 n n i=1 δ x i for x i iid P, n = 100. Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

59 Future Directions Many opportunities for future development 1 Improve scalability of KSD while maintaining convergence determining properties Low-rank, sparse, or stochastic approximations of kernel matrix Subsampling of likelihood terms in log p 2 Addressing other inferential tasks Design of control variates [Oates, Girolami, and Chopin, 2016, Oates and Girolami, 2016] Variational inference [Liu and Wang, 2016, Liu and Feng, 2016] Training generative adversarial networks [Wang and Liu, 2016] 3 Exploring the impact of Stein operator choice An infinite number of operators T characterize P How is discrepancy impacted? How do we select the best T? Thm: If log p bounded and k C (1,1) 0, then S(Q n, T P, G k ) 0 for some (Q n ) n 1 not converging to P Diffusion Stein operators (T g)(x) = 1 p(x), p(x)m(x)g(x) of Gorham, Duncan, Vollmer, and Mackey [2016] may be more appropriate for these heavy-tailed targets Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

60 References I S. Ahn, A. Korattikara, and M. Welling. Bayesian posterior sampling via stochastic gradient Fisher scoring. In Proc. 29th ICML, ICML 12, A. D. Barbour. Stein s method and Poisson process convergence. J. Appl. Probab., (Special Vol. 25A): , ISSN A celebration of applied probability. A. D. Barbour. Stein s method for diffusion approximations. Probab. Theory Related Fields, 84(3): , ISSN doi: /BF L. Baringhaus and N. Henze. A consistent test for multivariate normality based on the empirical characteristic function. Metrika, 35(1): , K. Chwialkowski, H. Strathmann, and A. Gretton. A kernel test of goodness of fit. In Proc. 33rd ICML, ICML, C. DuBois, A. Korattikara, M. Welling, and P. Smyth. Approximate slice sampling for Bayesian posterior inference. In Proc. 17th AISTATS, pages , J. Gorham and L. Mackey. Measuring sample quality with Stein s method. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Adv. NIPS 28, pages Curran Associates, Inc., J. Gorham and L. Mackey. Measuring sample quality with kernels. arxiv: , Mar J. Gorham, A. Duncan, S. Vollmer, and L. Mackey. Measuring sample quality with diffusions. arxiv: , Nov F. Götze. On the rate of convergence in the multivariate CLT. Ann. Probab., 19(2): , A. Korattikara, Y. Chen, and M. Welling. Austerity in MCMC land: Cutting the Metropolis-Hastings budget. In Proc. of 31st ICML, ICML 14, Q. Liu and Y. Feng. Two methods for wild variational inference. arxiv preprint arxiv: , Q. Liu and J. Lee. Black-box importance sampling. arxiv: , Oct To appear in AISTATS Q. Liu and D. Wang. Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm. arxiv: , Aug Q. Liu, J. Lee, and M. Jordan. A kernelized Stein discrepancy for goodness-of-fit tests. In Proc. of 33rd ICML, volume 48 of ICML, pages , L. Mackey and J. Gorham. Multivariate Stein factors for a class of strongly log-concave distributions. arxiv: , A. Müller. Integral probability metrics and their generating classes of functions. Ann. Appl. Probab., 29(2):pp , Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

61 References II C. Oates and M. Girolami. Control functionals for quasi-monte carlo integration. In A. Gretton and C. C. Robert, editors, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 56 65, Cadiz, Spain, May PMLR. C. J. Oates, M. Girolami, and N. Chopin. Control functionals for Monte Carlo integration. Journal of the Royal Statistical Society: Series B (Statistical Methodology), pages n/a n/a, ISSN doi: /rssb C. Stein. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proc. 6th Berkeley Symposium on Mathematical Statistics and Probability (Univ. California, Berkeley, Calif., 1970/1971), Vol. II: Probability theory, pages Univ. California Press, Berkeley, Calif., C. Stein, P. Diaconis, S. Holmes, and G. Reinert. Use of exchangeable pairs in the analysis of simulations. In Stein s method: expository lectures and applications, volume 46 of IMS Lecture Notes Monogr. Ser., pages Inst. Math. Statist., Beachwood, OH, D. Wang and Q. Liu. Learning to Draw Samples: With Application to Amortized MLE for Generative Adversarial Learning. arxiv: , Nov M. Welling and Y. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In ICML, Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

62 Comparing Discrepancies i.i.d. from mixture target P i.i.d. from single mixture component d = 1 d = 4 Discrepancy IMQ KSD Graph Stein discrepancy Wasserstein Discrepancy value Number of sample points, n Computation time (sec) Number of sample points, n Left: Samples drawn i.i.d. from either the bimodal Gaussian mixture target p(x) e 1 2 (x+1.5)2 + e 1 2 (x 1.5)2 or a single mixture component. Right: Discrepancy computation time using d cores in d dimensions. Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

63 Selecting Sampler Hyperparameters Setup [Welling and Teh, 2011] Consider the posterior distribution P induced by L datapoints y l drawn i.i.d. from a Gaussian mixture likelihood Y l X iid 1N (X 2 1, 2) + 1N (X X 2, 2) under Gaussian priors on the parameters X R 2 X 1 N (0, 10) X 2 N (0, 1) Draw m = 100 datapoints y l with parameters (x 1, x 2 ) = (0, 1) Induces posterior with second mode at (x 1, x 2 ) = (1, 1) For range of parameters ɛ, run approximate slice sampling for datapoint likelihood evaluations and store resulting posterior sample Q n Use minimum IMQ KSD (β = 1, c = 1) to select appropriate ɛ 2 Compare with standard MCMC parameter selection criterion, effective sample size (ESS), a measure of Markov chain autocorrelation Compute median of diagnostic over 50 random sequences Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

64 Selecting Samplers Setup MNIST handwritten digits [Ahn, Korattikara, and Welling, 2012] images, 51 features, binary label indicating whether image of a 7 or a 9 Bayesian logistic regression posterior P L independent observations (y l, v l ) {1, 1} R d with P(Y l = 1 v l, X) = 1/(1 + exp( v l, X )) Flat improper prior on the parameters X R d Use IMQ KSD (β = 1, c = 1) to compare SGFS-f to SGFS-d 2 drawing 10 5 sample points and discarding first half as burn-in For external support, compare bivariate marginal means and 95% confidence ellipses with surrogate ground truth Hamiltonian Monte chain with 10 5 sample points [Ahn, Korattikara, and Welling, 2012] Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

65 Lower Bounding the Kernel Stein Discrepancy Goal: Show S(Q n, T P, G k ) 0 only if (Q n ) n 1 converges to P Let P be the set of targets P with Lipschitz log p and distant log(p(x)/p(y)),y x strong log concavity ( k for x y x y 2 2 r) 2 Includes Gaussian mixtures with common covariance, Bayesian logistic and Student s t regression with Gaussian priors,... For a different Stein set G, Gorham, Duncan, Vollmer, and Mackey [2016] showed (Q n ) n 1 converges to P if P P and S(Q n, T P, G) 0 New contribution [Gorham and Mackey, 2017] Theorem (Univarite KSD detects non-convergence) Suppose P P and k(x, y) = Φ(x y) for Φ C 2 with a non-vanishing generalized Fourier transform. If d = 1, then S(Q n, T P, G k ) 0 only if (Q n ) n 1 converges weakly to P. Justifies use of KSD with Gaussian, Matérn, or inverse multiquadric kernels k in the univariate case Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

66 The Importance of Tightness Goal: Show S(Q n, T P, G k ) 0 only if Q n converges to P A sequence (Q n ) n 1 is uniformly tight if for every ɛ > 0, there is a finite number R(ɛ) such that sup n Q n ( X 2 > R(ɛ)) ɛ Intuitively, no mass in the sequence escapes to infinity Theorem (KSD detects tight non-convergence [Gorham and Mackey, 2017]) Suppose that P P and k(x, y) = Φ(x y) for Φ C 2 with a non-vanishing generalized Fourier transform. If (Q n ) n 1 is uniformly tight and S(Q n, T P, G k ) 0, then (Q n ) n 1 converges weakly to P. Good news, but, ideally, KSD would detect non-tight sequences automatically... Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

Measuring Sample Quality with Stein s Method

Measuring Sample Quality with Stein s Method Lester Mackey Joint work with Jackson Gorham, Andrew Duncan, Sebastian Vollmer Microsoft Research, Opendoor Labs, University of Sussex, University of Warwick