Measuring Sample Quality with Stein s Method

Size: px
Start display at page:

Download "Measuring Sample Quality with Stein s Method"

Transcription

1 Measuring Sample Quality with Stein s Method Lester Mackey, Joint work with Jackson Gorham Microsoft Research Stanford University July 4, 2017 Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

2 Motivation: Large-scale Posterior Inference Question: How do we scale Markov chain Monte Carlo (MCMC) posterior inference to massive datasets? MCMC Benefit: Approximates intractable posterior expectations E P [h(z)] = p(x)h(x)dx with asymptotically X exact sample estimates E Q [h(x)] = 1 n n i=1 h(x i) Problem: Each point x i requires iterating over entire dataset! Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

3 Motivation: Large-scale Posterior Inference Question: How do we scale Markov chain Monte Carlo (MCMC) posterior inference to massive datasets? MCMC Benefit: Approximates intractable posterior expectations E P [h(z)] = p(x)h(x)dx with asymptotically X exact sample estimates E Q [h(x)] = 1 n n i=1 h(x i) Problem: Each point x i requires iterating over entire dataset! Template solution: Approximate MCMC with subset posteriors [Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014] Approximate standard MCMC procedure in a manner that makes use of only a small subset of datapoints per sample Reduced computational overhead leads to faster sampling and reduced Monte Carlo variance Introduces asymptotic bias: target distribution is not stationary Hope that for fixed amount of sampling time, variance reduction will outweigh bias introduced Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

4 Motivation: Large-scale Posterior Inference Template solution: Approximate MCMC with subset posteriors [Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014] Hope that for fixed amount of sampling time, variance reduction will outweigh bias introduced Introduces new challenges Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

5 Motivation: Large-scale Posterior Inference Template solution: Approximate MCMC with subset posteriors [Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014] Hope that for fixed amount of sampling time, variance reduction will outweigh bias introduced Introduces new challenges How do we compare and evaluate samples from approximate MCMC procedures? Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

6 Motivation: Large-scale Posterior Inference Template solution: Approximate MCMC with subset posteriors [Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014] Hope that for fixed amount of sampling time, variance reduction will outweigh bias introduced Introduces new challenges How do we compare and evaluate samples from approximate MCMC procedures? How do we select samplers and their tuning parameters? Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

7 Motivation: Large-scale Posterior Inference Template solution: Approximate MCMC with subset posteriors [Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014] Hope that for fixed amount of sampling time, variance reduction will outweigh bias introduced Introduces new challenges How do we compare and evaluate samples from approximate MCMC procedures? How do we select samplers and their tuning parameters? How do we quantify the bias-variance trade-off explicitly? Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

8 Motivation: Large-scale Posterior Inference Template solution: Approximate MCMC with subset posteriors [Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014] Hope that for fixed amount of sampling time, variance reduction will outweigh bias introduced Introduces new challenges How do we compare and evaluate samples from approximate MCMC procedures? How do we select samplers and their tuning parameters? How do we quantify the bias-variance trade-off explicitly? Difficulty: Standard evaluation criteria like effective sample size, trace plots, and variance diagnostics assume convergence to the target distribution and do not account for asymptotic bias Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

9 Motivation: Large-scale Posterior Inference Template solution: Approximate MCMC with subset posteriors [Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014] Hope that for fixed amount of sampling time, variance reduction will outweigh bias introduced Introduces new challenges How do we compare and evaluate samples from approximate MCMC procedures? How do we select samplers and their tuning parameters? How do we quantify the bias-variance trade-off explicitly? Difficulty: Standard evaluation criteria like effective sample size, trace plots, and variance diagnostics assume convergence to the target distribution and do not account for asymptotic bias This talk: Introduce new quality measures suitable for comparing the quality of approximate MCMC samples Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

10 Quality Measures for Samples Challenge: Develop measure suitable for comparing the quality of any two samples approximating a common target distribution Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

11 Quality Measures for Samples Challenge: Develop measure suitable for comparing the quality of any two samples approximating a common target distribution Given Continuous target distribution P with support X = R d and density p p known up to normalization, integration under P is intractable Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

12 Quality Measures for Samples Challenge: Develop measure suitable for comparing the quality of any two samples approximating a common target distribution Given Continuous target distribution P with support X = R d and density p p known up to normalization, integration under P is intractable Sample points x 1,..., x n X Define discrete distribution Q n with, for any function h, E Qn [h(x)] = 1 n n i=1 h(x i) used to approximate E P [h(z)] We make no assumption about the provenance of the x i Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

13 Quality Measures for Samples Challenge: Develop measure suitable for comparing the quality of any two samples approximating a common target distribution Given Continuous target distribution P with support X = R d and density p p known up to normalization, integration under P is intractable Sample points x 1,..., x n X Define discrete distribution Q n with, for any function h, E Qn [h(x)] = 1 n n i=1 h(x i) used to approximate E P [h(z)] We make no assumption about the provenance of the x i Goal: Quantify how well E Qn approximates E P Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

14 Quality Measures for Samples Challenge: Develop measure suitable for comparing the quality of any two samples approximating a common target distribution Given Continuous target distribution P with support X = R d and density p p known up to normalization, integration under P is intractable Sample points x 1,..., x n X Define discrete distribution Q n with, for any function h, E Qn [h(x)] = 1 n n i=1 h(x i) used to approximate E P [h(z)] We make no assumption about the provenance of the x i Goal: Quantify how well E Qn approximates E P in a manner that I. Detects when a sample sequence is converging to the target Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

15 Quality Measures for Samples Challenge: Develop measure suitable for comparing the quality of any two samples approximating a common target distribution Given Continuous target distribution P with support X = R d and density p p known up to normalization, integration under P is intractable Sample points x 1,..., x n X Define discrete distribution Q n with, for any function h, E Qn [h(x)] = 1 n n i=1 h(x i) used to approximate E P [h(z)] We make no assumption about the provenance of the x i Goal: Quantify how well E Qn approximates E P in a manner that I. Detects when a sample sequence is converging to the target II. Detects when a sample sequence is not converging to the target Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

16 Quality Measures for Samples Challenge: Develop measure suitable for comparing the quality of any two samples approximating a common target distribution Given Continuous target distribution P with support X = R d and density p p known up to normalization, integration under P is intractable Sample points x 1,..., x n X Define discrete distribution Q n with, for any function h, E Qn [h(x)] = 1 n n i=1 h(x i) used to approximate E P [h(z)] We make no assumption about the provenance of the x i Goal: Quantify how well E Qn approximates E P in a manner that I. Detects when a sample sequence is converging to the target II. Detects when a sample sequence is not converging to the target III. Is computationally feasible Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

17 Integral Probability Metrics Goal: Quantify how well E Qn approximates E P Idea: Consider an integral probability metric (IPM) [Müller, 1997] d H (Q n, P ) = sup E Qn [h(x)] E P [h(z)] h H Measures maximum discrepancy between sample and target expectations over a class of real-valued test functions H When H sufficiently large, convergence of d H (Q n, P ) to zero implies (Q n ) n 1 converges weakly to P (Requirement II) Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

18 Integral Probability Metrics Goal: Quantify how well E Qn approximates E P Idea: Consider an integral probability metric (IPM) [Müller, 1997] d H (Q n, P ) = sup E Qn [h(x)] E P [h(z)] h H Measures maximum discrepancy between sample and target expectations over a class of real-valued test functions H When H sufficiently large, convergence of d H (Q n, P ) to zero implies (Q n ) n 1 converges weakly to P (Requirement II) Problem: Integration under P intractable! Most IPMs cannot be computed in practice Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

19 Integral Probability Metrics Goal: Quantify how well E Qn approximates E P Idea: Consider an integral probability metric (IPM) [Müller, 1997] d H (Q n, P ) = sup E Qn [h(x)] E P [h(z)] h H Measures maximum discrepancy between sample and target expectations over a class of real-valued test functions H When H sufficiently large, convergence of d H (Q n, P ) to zero implies (Q n ) n 1 converges weakly to P (Requirement II) Problem: Integration under P intractable! Most IPMs cannot be computed in practice Idea: Only consider functions with E P [h(z)] known a priori to be 0 Then IPM computation only depends on Q n! Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

20 Integral Probability Metrics Goal: Quantify how well E Qn approximates E P Idea: Consider an integral probability metric (IPM) [Müller, 1997] d H (Q n, P ) = sup E Qn [h(x)] E P [h(z)] h H Measures maximum discrepancy between sample and target expectations over a class of real-valued test functions H When H sufficiently large, convergence of d H (Q n, P ) to zero implies (Q n ) n 1 converges weakly to P (Requirement II) Problem: Integration under P intractable! Most IPMs cannot be computed in practice Idea: Only consider functions with E P [h(z)] known a priori to be 0 Then IPM computation only depends on Q n! How do we select this class of test functions? Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

21 Integral Probability Metrics Goal: Quantify how well E Qn approximates E P Idea: Consider an integral probability metric (IPM) [Müller, 1997] d H (Q n, P ) = sup E Qn [h(x)] E P [h(z)] h H Measures maximum discrepancy between sample and target expectations over a class of real-valued test functions H When H sufficiently large, convergence of d H (Q n, P ) to zero implies (Q n ) n 1 converges weakly to P (Requirement II) Problem: Integration under P intractable! Most IPMs cannot be computed in practice Idea: Only consider functions with E P [h(z)] known a priori to be 0 Then IPM computation only depends on Q n! How do we select this class of test functions? Will the resulting discrepancy measure track sample sequence convergence (Requirements I and II)? Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

22 Integral Probability Metrics Goal: Quantify how well E Qn approximates E P Idea: Consider an integral probability metric (IPM) [Müller, 1997] d H (Q n, P ) = sup E Qn [h(x)] E P [h(z)] h H Measures maximum discrepancy between sample and target expectations over a class of real-valued test functions H When H sufficiently large, convergence of d H (Q n, P ) to zero implies (Q n ) n 1 converges weakly to P (Requirement II) Problem: Integration under P intractable! Most IPMs cannot be computed in practice Idea: Only consider functions with E P [h(z)] known a priori to be 0 Then IPM computation only depends on Q n! How do we select this class of test functions? Will the resulting discrepancy measure track sample sequence convergence (Requirements I and II)? How do we solve the resulting optimization problem in practice? Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

23 Stein s Method Stein s method for bounding IPMs [Stein, 1972] proceeds in 3 steps: 1 Identify operator T and set G of functions g : X R d with E P [(T g)(z)] = 0 for all g G. Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

24 Stein s Method Stein s method for bounding IPMs [Stein, 1972] proceeds in 3 steps: 1 Identify operator T and set G of functions g : X R d with E P [(T g)(z)] = 0 for all g G. Together, T and G define the Stein discrepancy S(Q n, T, G) sup E Qn [(T g)(x)] = d T G (Q n, P ), g G an IPM-type measure with no explicit integration under P Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

25 Stein s Method Stein s method for bounding IPMs [Stein, 1972] proceeds in 3 steps: 1 Identify operator T and set G of functions g : X R d with E P [(T g)(z)] = 0 for all g G. Together, T and G define the Stein discrepancy S(Q n, T, G) sup E Qn [(T g)(x)] = d T G (Q n, P ), g G an IPM-type measure with no explicit integration under P 2 Lower bound S(Q n, T, G) by reference IPM d H (Q n, P ) S(Q n, T, G) 0 only if (Q n ) n 1 converges to P (Req. II) Performed once, in advance, for large classes of distributions Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

26 Stein s Method Stein s method for bounding IPMs [Stein, 1972] proceeds in 3 steps: 1 Identify operator T and set G of functions g : X R d with E P [(T g)(z)] = 0 for all g G. Together, T and G define the Stein discrepancy S(Q n, T, G) sup E Qn [(T g)(x)] = d T G (Q n, P ), g G an IPM-type measure with no explicit integration under P 2 Lower bound S(Q n, T, G) by reference IPM d H (Q n, P ) S(Q n, T, G) 0 only if (Q n ) n 1 converges to P (Req. II) Performed once, in advance, for large classes of distributions 3 Upper bound S(Q n, T, G) by any means necessary to demonstrate convergence to 0 (Requirement I) Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

27 Stein s Method Stein s method for bounding IPMs [Stein, 1972] proceeds in 3 steps: 1 Identify operator T and set G of functions g : X R d with E P [(T g)(z)] = 0 for all g G. Together, T and G define the Stein discrepancy S(Q n, T, G) sup E Qn [(T g)(x)] = d T G (Q n, P ), g G an IPM-type measure with no explicit integration under P 2 Lower bound S(Q n, T, G) by reference IPM d H (Q n, P ) S(Q n, T, G) 0 only if (Q n ) n 1 converges to P (Req. II) Performed once, in advance, for large classes of distributions 3 Upper bound S(Q n, T, G) by any means necessary to demonstrate convergence to 0 (Requirement I) Standard use: As analytical tool to prove convergence Our goal: Develop Stein discrepancy into practical quality measure Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

28 Identifying a Stein Operator T Goal: Identify operator T for which E P [(T g)(z)] = 0 for all g G Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

29 Identifying a Stein Operator T Goal: Identify operator T for which E P [(T g)(z)] = 0 for all g G Approach: Generator method of Barbour [1988, 1990], Götze [1991] Identify a Markov process (Z t ) t 0 with stationary distribution P Under mild conditions, its infinitesimal generator (Au)(x) = lim (E[u(Z t ) Z 0 = x] u(x))/t t 0 satisfies E P [(Au)(Z)] = 0 Overdamped Langevin diffusion: dz t = 1 2 log p(z t)dt + dw t Generator: (A P u)(x) = 1 u(x), log p(x) + 1, u(x) 2 2 Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

30 Identifying a Stein Operator T Goal: Identify operator T for which E P [(T g)(z)] = 0 for all g G Approach: Generator method of Barbour [1988, 1990], Götze [1991] Identify a Markov process (Z t ) t 0 with stationary distribution P Under mild conditions, its infinitesimal generator (Au)(x) = lim (E[u(Z t ) Z 0 = x] u(x))/t t 0 satisfies E P [(Au)(Z)] = 0 Overdamped Langevin diffusion: dz t = 1 2 log p(z t)dt + dw t Generator: (A P u)(x) = 1 u(x), log p(x) + 1, u(x) 2 2 Stein operator: (T P g)(x) g(x), log p(x) +, g(x) [Gorham and Mackey, 2015, Oates, Girolami, and Chopin, 2016] Depends on P only through log p; computable even if p cannot be normalized! Multivariate generalization of density method operator (T g)(x) = g(x) d dx log p(x) + g (x) [Stein, Diaconis, Holmes, and Reinert, 2004] Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

31 Identifying a Stein Set G Goal: Identify set G for which E P [(T P g)(z)] = 0 for all g G Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

32 Identifying a Stein Set G Goal: Identify set G for which E P [(T P g)(z)] = 0 for all g G Approach: Reproducing kernels k : X X R A reproducing kernel k is symmetric (k(x, y) = k(y, x)) and positive semidefinite ( i,l c ic l k(z i, z l ) 0, z i X, c i R) e.g., Gaussian kernel k(x, y) = e 1 2 x y 2 2 Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

33 Identifying a Stein Set G Goal: Identify set G for which E P [(T P g)(z)] = 0 for all g G Approach: Reproducing kernels k : X X R A reproducing kernel k is symmetric (k(x, y) = k(y, x)) and positive semidefinite ( i,l c ic l k(z i, z l ) 0, z i X, c i R) e.g., Gaussian kernel k(x, y) = e 1 2 x y 2 2 Yields function space K k = {f : f(x) = m i=1 c ik(z i, x), m N} m with norm f Kk = i,l=1 c ic l k(z i, z l ) Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

34 Identifying a Stein Set G Goal: Identify set G for which E P [(T P g)(z)] = 0 for all g G Approach: Reproducing kernels k : X X R A reproducing kernel k is symmetric (k(x, y) = k(y, x)) and positive semidefinite ( i,l c ic l k(z i, z l ) 0, z i X, c i R) e.g., Gaussian kernel k(x, y) = e 1 2 x y 2 2 Yields function space K k = {f : f(x) = m i=1 c ik(z i, x), m N} m with norm f Kk = i,l=1 c ic l k(z i, z l ) We define the kernel Stein set of vector-valued g : X R d as G k, {g = (g 1,..., g d ) v 1 for v j g j Kk }. Each g j belongs to reproducing kernel Hilbert space (RKHS) K k Component norms v j g j Kk are jointly bounded by 1 Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

35 Identifying a Stein Set G Goal: Identify set G for which E P [(T P g)(z)] = 0 for all g G Approach: Reproducing kernels k : X X R A reproducing kernel k is symmetric (k(x, y) = k(y, x)) and positive semidefinite ( i,l c ic l k(z i, z l ) 0, z i X, c i R) e.g., Gaussian kernel k(x, y) = e 1 2 x y 2 2 Yields function space K k = {f : f(x) = m i=1 c ik(z i, x), m N} m with norm f Kk = i,l=1 c ic l k(z i, z l ) We define the kernel Stein set of vector-valued g : X R d as G k, {g = (g 1,..., g d ) v 1 for v j g j Kk }. Each g j belongs to reproducing kernel Hilbert space (RKHS) K k Component norms v j g j Kk are jointly bounded by 1 E P [(T P g)(z)] = 0 for all g G k, whenever k C (1,1) b log p integrable under P [Gorham and Mackey, 2017] and Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

36 Computing the Kernel Stein Discrepancy Kernel Stein discrepancy (KSD) S(Q n, T P, G k, ) Stein operator (T P g)(x) g(x), log p(x) +, g(x) Stein set G k, {g = (g 1,..., g d ) v 1 for v j g j Kk } Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

37 Computing the Kernel Stein Discrepancy Kernel Stein discrepancy (KSD) S(Q n, T P, G k, ) Stein operator (T P g)(x) g(x), log p(x) +, g(x) Stein set G k, {g = (g 1,..., g d ) v 1 for v j g j Kk } Benefit: Computable in closed form [Gorham and Mackey, 2017] n S(Q n, T P, G k, ) = w for w j i,i =1 kj 0(x i, x i ). Reduces to parallelizable pairwise evaluations of Stein kernels k j 0 (x, y) 1 p(x)p(y) x j yj (p(x)k(x, y)p(y)) Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

38 Computing the Kernel Stein Discrepancy Kernel Stein discrepancy (KSD) S(Q n, T P, G k, ) Stein operator (T P g)(x) g(x), log p(x) +, g(x) Stein set G k, {g = (g 1,..., g d ) v 1 for v j g j Kk } Benefit: Computable in closed form [Gorham and Mackey, 2017] n S(Q n, T P, G k, ) = w for w j i,i =1 kj 0(x i, x i ). Reduces to parallelizable pairwise evaluations of Stein kernels k j 0 (x, y) 1 p(x)p(y) x j yj (p(x)k(x, y)p(y)) Stein set choice inspired by control functional kernels k 0 = d j=1 kj 0 of Oates, Girolami, and Chopin [2016] Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

39 Computing the Kernel Stein Discrepancy Kernel Stein discrepancy (KSD) S(Q n, T P, G k, ) Stein operator (T P g)(x) g(x), log p(x) +, g(x) Stein set G k, {g = (g 1,..., g d ) v 1 for v j g j Kk } Benefit: Computable in closed form [Gorham and Mackey, 2017] n S(Q n, T P, G k, ) = w for w j i,i =1 kj 0(x i, x i ). Reduces to parallelizable pairwise evaluations of Stein kernels k j 0 (x, y) 1 p(x)p(y) x j yj (p(x)k(x, y)p(y)) Stein set choice inspired by control functional kernels k 0 = d j=1 kj 0 of Oates, Girolami, and Chopin [2016] When = 2, recovers the KSD of Chwialkowski, Strathmann, and Gretton [2016], Liu, Lee, and Jordan [2016] Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

40 Computing the Kernel Stein Discrepancy Kernel Stein discrepancy (KSD) S(Q n, T P, G k, ) Stein operator (T P g)(x) g(x), log p(x) +, g(x) Stein set G k, {g = (g 1,..., g d ) v 1 for v j g j Kk } Benefit: Computable in closed form [Gorham and Mackey, 2017] n S(Q n, T P, G k, ) = w for w j i,i =1 kj 0(x i, x i ). Reduces to parallelizable pairwise evaluations of Stein kernels k j 0 (x, y) 1 p(x)p(y) x j yj (p(x)k(x, y)p(y)) Stein set choice inspired by control functional kernels k 0 = d j=1 kj 0 of Oates, Girolami, and Chopin [2016] When = 2, recovers the KSD of Chwialkowski, Strathmann, and Gretton [2016], Liu, Lee, and Jordan [2016] To ease notation, will use G k G k, 2 in remainder of the talk Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

41 Detecting Non-convergence Goal: Show S(Q n, T P, G k ) 0 only if Q n converges to P Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

42 Detecting Non-convergence Goal: Show S(Q n, T P, G k ) 0 only if Q n converges to P In higher dimensions, KSDs based on common kernels fail to detect non-convergence, even for Gaussian targets P Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

43 Detecting Non-convergence Goal: Show S(Q n, T P, G k ) 0 only if Q n converges to P In higher dimensions, KSDs based on common kernels fail to detect non-convergence, even for Gaussian targets P Theorem (KSD fails with light kernel tails [Gorham and Mackey, 2017]) Suppose d 3, P = N (0, I d ), and α ( d ) 1. If k(x, y) and its derivatives decay at a o( x y α 2 ) rate as x y 2, then S(Q n, T P, G k ) 0 for some (Q n ) n 1 not converging to P. Gaussian (k(x, y) = e 1 2 x y 2 2 ) and Matérn kernels fail for d 3 Inverse multiquadric kernels (k(x, y) = (1 + x y 2 2 )β ) with β < 1 fail for d > 2β 1+β The violating sample sequences (Q n ) n 1 are simple to construct Problem: Kernels with light tails ignore excess mass in the tails Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

44 Detecting Non-convergence Goal: Show S(Q n, T P, G k ) 0 only if Q n converges to P Consider the inverse multiquadric (IMQ) kernel k(x, y) = (c 2 + x y 2 2 )β for some β < 0, c R. IMQ KSD fails to detect non-convergence when β < 1 Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

45 Detecting Non-convergence Goal: Show S(Q n, T P, G k ) 0 only if Q n converges to P Consider the inverse multiquadric (IMQ) kernel k(x, y) = (c 2 + x y 2 2 )β for some β < 0, c R. IMQ KSD fails to detect non-convergence when β < 1 However, IMQ KSD detects non-convergence when β ( 1, 0) Theorem (IMQ KSD detects non-convergence [Gorham and Mackey, 2017]) Suppose P P and k(x, y) = (c 2 + x y 2 2) β for β ( 1, 0). If S(Q n, T P, G k ) 0, then (Q n ) n 1 converges weakly to P. No extra assumptions on sample sequence (Q n ) n 1 needed Proof sketch: Slow decay rate of kernel unbounded (coercive) test functions in T P G k detects excess mass in the tails Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

46 Detecting Convergence Goal: Show S(Q n, T P, G k ) 0 when Q n converges to P Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

47 Detecting Convergence Goal: Show S(Q n, T P, G k ) 0 when Q n converges to P Proposition (KSD detects convergence [Gorham and Mackey, 2017]) If k C (2,2) b and log p Lipschitz and square integrable under P, then S(Q n, T P, G k ) 0 whenever the Wasserstein distance d W 2 (Q n, P ) 0. Covers Gaussian, Matérn, IMQ, and other common bounded kernels k Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

48 A Simple Example i.i.d. from mixture target P i.i.d. from single mixture component 0.8 i.i.d. from mixture target P i.i.d. from single mixture component Discrepancy value g h = T P g Number of sample points, n x Left plot: For target p(x) e 1 2 (x+1.5)2 + e 1 2 (x 1.5)2, compare an i.i.d. sample Q n from P and an i.i.d. sample Q n from one component Expect S(Q 1:n, T P, G k ) 0 & S(Q 1:n, T P, G k ) 0 Compare IMQ KSD (β = 1, c = 1) with Wasserstein distance 2 Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

49 A Simple Example i.i.d. from mixture target P i.i.d. from single mixture component i.i.d. from mixture target P i.i.d. from single mixture component Discrepancy value g h = T P g Number of sample points, n Right plot: For n = 10 3 sample points, x (Top) Recovered optimal Stein functions g G k (Bottom) Associated test functions h T P g which best discriminate sample Q n from target P Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

50 The Importance of Kernel Choice Kernel Stein discrepancy i.i.d. from target P Off target sample Number of sample points, n Gaussian Matérn Inverse Multiquadric Dimension d = 5 d = 8 d = 20 Target P = N (0, I d ) Off-target Q n has all x i 2 2n 1/d log n, x i x j 2 2 log n Gaussian and Matérn KSDs driven to 0 by an off-target sequence that does not converge to P IMQ KSD (β = 1, c = 1) does 2 not have this deficiency Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

51 Selecting Sampler Hyperparameters Approximate slice sampling [DuBois, Korattikara, Welling, and Smyth, 2014] Approximate MCMC procedure designed for scalability Random subset of datapoints used to approximate each sampling step Target P is not stationary distribution Tolerance parameter ɛ controls number of datapoints evaluated ɛ too small too few sample points generated ɛ too large sampling from very different distribution Standard MCMC selection criteria like effective sample size (ESS) and asymptotic variance do not account for this bias Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

52 Selecting Sampler Hyperparameters ESS (higher is better) ε = 0 (n = 230) ε = 10 2 (n = 416) ε = 10 1 (n = 1000) KSD (lower is better) Tolerance parameter, ε x x 1 ESS maximized at tolerance ɛ = 10 1 IMQ KSD minimized at tolerance ɛ = 10 2 Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

53 Selecting Samplers Stochastic Gradient Fisher Scoring (SGFS) [Ahn, Korattikara, and Welling, 2012] Approximate MCMC procedure designed for scalability Approximates Metropolis-adjusted Langevin algorithm but does not use Metropolis-Hastings correction Target P is not stationary distribution Goal: Choose between two variants SGFS-f inverts a d d matrix for each new sample point SGFS-d inverts a diagonal matrix to reduce sampling time MNIST handwritten digits [Ahn, Korattikara, and Welling, 2012] images, 51 features, binary label indicating whether image of a 7 or a 9 Bayesian logistic regression posterior P Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

54 Selecting Samplers IMQ kernel Stein discrepancy Number of sample points, n Sampler SGFS d SGFS f x 51 x SGFS d x 7 SGFS f x 32 BEST BEST x SGFS d x 8 SGFS f x 2 Left: IMQ KSD quality comparison for SGFS Bayesian logistic regression (no surrogate ground truth used) Right: SGFS sample points (n = ) with bivariate marginal means and 95% confidence ellipses (blue) that align best and worst with surrogate ground truth sample (red). Both suggest small speed-up of SGFS-d (0.0017s per sample vs s for SGFS-f) outweighed by loss in inferential accuracy Mackey (MSR) Kernel Stein Discrepancy July 4, / 25 x 25 WORST WORST

55 Beyond Sample Quality Comparison One-sample hypothesis testing Chwialkowski, Strathmann, and Gretton [2016] used the KSD S(Q n, T P, G k ) to test whether a sample was drawn from a target distribution P (see also Liu, Lee, and Jordan [2016]) Test with default Gaussian kernel k experienced considerable loss of power as the dimension d increased Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

56 Beyond Sample Quality Comparison One-sample hypothesis testing Chwialkowski, Strathmann, and Gretton [2016] used the KSD S(Q n, T P, G k ) to test whether a sample was drawn from a target distribution P (see also Liu, Lee, and Jordan [2016]) Test with default Gaussian kernel k experienced considerable loss of power as the dimension d increased We recreate their experiment with IMQ kernel (β = 1 2, c = 1) For n = 500, generate sample (x i ) n i=1 with x i = z i + u i e 1 z i iid N (0, Id ) and u i iid Unif[0, 1]. Target P = N (0, Id ). Compare with standard normality test of Baringhaus and Henze [1988] Table: Mean power of multivariate normality tests across 400 simulations d=2 d=5 d=10 d=15 d=20 d=25 B&H Gaussian IMQ Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

57 Beyond Sample Quality Comparison Improving sample quality Given sample points (x i ) n i=1, can minimize KSD S( Q n, T P, G k ) over all weighted samples Q n = n i=1 q n(x i )δ xi for q n a probability mass function Liu and Lee [2016] do this with Gaussian kernel k(x, y) = e 1 h x y 2 2 Bandwidth h set to median of the squared Euclidean distance between pairs of sample points We recreate their experiment with the IMQ kernel k(x, y) = (1 + 1 h x y 2 2 ) 1/2 Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

58 Improving Sample Quality d 10 2 Average MSE, E P Z E ~ 2 Qn X Sample Initial Q n Gaussian KSD IMQ KSD Dimension, d MSE averaged over 500 simulations (±2 standard errors) Target P = N (0, I d ) Starting sample Q n = 1 n n i=1 δ x i for x i iid P, n = 100. Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

59 Future Directions Many opportunities for future development 1 Improve scalability of KSD while maintaining convergence determining properties Low-rank, sparse, or stochastic approximations of kernel matrix Subsampling of likelihood terms in log p 2 Addressing other inferential tasks Design of control variates [Oates, Girolami, and Chopin, 2016, Oates and Girolami, 2016] Variational inference [Liu and Wang, 2016, Liu and Feng, 2016] Training generative adversarial networks [Wang and Liu, 2016] 3 Exploring the impact of Stein operator choice An infinite number of operators T characterize P How is discrepancy impacted? How do we select the best T? Thm: If log p bounded and k C (1,1) 0, then S(Q n, T P, G k ) 0 for some (Q n ) n 1 not converging to P Diffusion Stein operators (T g)(x) = 1 p(x), p(x)m(x)g(x) of Gorham, Duncan, Vollmer, and Mackey [2016] may be more appropriate for these heavy-tailed targets Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

60 References I S. Ahn, A. Korattikara, and M. Welling. Bayesian posterior sampling via stochastic gradient Fisher scoring. In Proc. 29th ICML, ICML 12, A. D. Barbour. Stein s method and Poisson process convergence. J. Appl. Probab., (Special Vol. 25A): , ISSN A celebration of applied probability. A. D. Barbour. Stein s method for diffusion approximations. Probab. Theory Related Fields, 84(3): , ISSN doi: /BF L. Baringhaus and N. Henze. A consistent test for multivariate normality based on the empirical characteristic function. Metrika, 35(1): , K. Chwialkowski, H. Strathmann, and A. Gretton. A kernel test of goodness of fit. In Proc. 33rd ICML, ICML, C. DuBois, A. Korattikara, M. Welling, and P. Smyth. Approximate slice sampling for Bayesian posterior inference. In Proc. 17th AISTATS, pages , J. Gorham and L. Mackey. Measuring sample quality with Stein s method. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Adv. NIPS 28, pages Curran Associates, Inc., J. Gorham and L. Mackey. Measuring sample quality with kernels. arxiv: , Mar J. Gorham, A. Duncan, S. Vollmer, and L. Mackey. Measuring sample quality with diffusions. arxiv: , Nov F. Götze. On the rate of convergence in the multivariate CLT. Ann. Probab., 19(2): , A. Korattikara, Y. Chen, and M. Welling. Austerity in MCMC land: Cutting the Metropolis-Hastings budget. In Proc. of 31st ICML, ICML 14, Q. Liu and Y. Feng. Two methods for wild variational inference. arxiv preprint arxiv: , Q. Liu and J. Lee. Black-box importance sampling. arxiv: , Oct To appear in AISTATS Q. Liu and D. Wang. Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm. arxiv: , Aug Q. Liu, J. Lee, and M. Jordan. A kernelized Stein discrepancy for goodness-of-fit tests. In Proc. of 33rd ICML, volume 48 of ICML, pages , L. Mackey and J. Gorham. Multivariate Stein factors for a class of strongly log-concave distributions. arxiv: , A. Müller. Integral probability metrics and their generating classes of functions. Ann. Appl. Probab., 29(2):pp , Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

61 References II C. Oates and M. Girolami. Control functionals for quasi-monte carlo integration. In A. Gretton and C. C. Robert, editors, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 56 65, Cadiz, Spain, May PMLR. C. J. Oates, M. Girolami, and N. Chopin. Control functionals for Monte Carlo integration. Journal of the Royal Statistical Society: Series B (Statistical Methodology), pages n/a n/a, ISSN doi: /rssb C. Stein. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proc. 6th Berkeley Symposium on Mathematical Statistics and Probability (Univ. California, Berkeley, Calif., 1970/1971), Vol. II: Probability theory, pages Univ. California Press, Berkeley, Calif., C. Stein, P. Diaconis, S. Holmes, and G. Reinert. Use of exchangeable pairs in the analysis of simulations. In Stein s method: expository lectures and applications, volume 46 of IMS Lecture Notes Monogr. Ser., pages Inst. Math. Statist., Beachwood, OH, D. Wang and Q. Liu. Learning to Draw Samples: With Application to Amortized MLE for Generative Adversarial Learning. arxiv: , Nov M. Welling and Y. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In ICML, Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

62 Comparing Discrepancies i.i.d. from mixture target P i.i.d. from single mixture component d = 1 d = 4 Discrepancy IMQ KSD Graph Stein discrepancy Wasserstein Discrepancy value Number of sample points, n Computation time (sec) Number of sample points, n Left: Samples drawn i.i.d. from either the bimodal Gaussian mixture target p(x) e 1 2 (x+1.5)2 + e 1 2 (x 1.5)2 or a single mixture component. Right: Discrepancy computation time using d cores in d dimensions. Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

63 Selecting Sampler Hyperparameters Setup [Welling and Teh, 2011] Consider the posterior distribution P induced by L datapoints y l drawn i.i.d. from a Gaussian mixture likelihood Y l X iid 1N (X 2 1, 2) + 1N (X X 2, 2) under Gaussian priors on the parameters X R 2 X 1 N (0, 10) X 2 N (0, 1) Draw m = 100 datapoints y l with parameters (x 1, x 2 ) = (0, 1) Induces posterior with second mode at (x 1, x 2 ) = (1, 1) For range of parameters ɛ, run approximate slice sampling for datapoint likelihood evaluations and store resulting posterior sample Q n Use minimum IMQ KSD (β = 1, c = 1) to select appropriate ɛ 2 Compare with standard MCMC parameter selection criterion, effective sample size (ESS), a measure of Markov chain autocorrelation Compute median of diagnostic over 50 random sequences Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

64 Selecting Samplers Setup MNIST handwritten digits [Ahn, Korattikara, and Welling, 2012] images, 51 features, binary label indicating whether image of a 7 or a 9 Bayesian logistic regression posterior P L independent observations (y l, v l ) {1, 1} R d with P(Y l = 1 v l, X) = 1/(1 + exp( v l, X )) Flat improper prior on the parameters X R d Use IMQ KSD (β = 1, c = 1) to compare SGFS-f to SGFS-d 2 drawing 10 5 sample points and discarding first half as burn-in For external support, compare bivariate marginal means and 95% confidence ellipses with surrogate ground truth Hamiltonian Monte chain with 10 5 sample points [Ahn, Korattikara, and Welling, 2012] Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

65 Lower Bounding the Kernel Stein Discrepancy Goal: Show S(Q n, T P, G k ) 0 only if (Q n ) n 1 converges to P Let P be the set of targets P with Lipschitz log p and distant log(p(x)/p(y)),y x strong log concavity ( k for x y x y 2 2 r) 2 Includes Gaussian mixtures with common covariance, Bayesian logistic and Student s t regression with Gaussian priors,... For a different Stein set G, Gorham, Duncan, Vollmer, and Mackey [2016] showed (Q n ) n 1 converges to P if P P and S(Q n, T P, G) 0 New contribution [Gorham and Mackey, 2017] Theorem (Univarite KSD detects non-convergence) Suppose P P and k(x, y) = Φ(x y) for Φ C 2 with a non-vanishing generalized Fourier transform. If d = 1, then S(Q n, T P, G k ) 0 only if (Q n ) n 1 converges weakly to P. Justifies use of KSD with Gaussian, Matérn, or inverse multiquadric kernels k in the univariate case Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

66 The Importance of Tightness Goal: Show S(Q n, T P, G k ) 0 only if Q n converges to P A sequence (Q n ) n 1 is uniformly tight if for every ɛ > 0, there is a finite number R(ɛ) such that sup n Q n ( X 2 > R(ɛ)) ɛ Intuitively, no mass in the sequence escapes to infinity Theorem (KSD detects tight non-convergence [Gorham and Mackey, 2017]) Suppose that P P and k(x, y) = Φ(x y) for Φ C 2 with a non-vanishing generalized Fourier transform. If (Q n ) n 1 is uniformly tight and S(Q n, T P, G k ) 0, then (Q n ) n 1 converges weakly to P. Good news, but, ideally, KSD would detect non-tight sequences automatically... Mackey (MSR) Kernel Stein Discrepancy July 4, / 25

Measuring Sample Quality with Stein s Method

Measuring Sample Quality with Stein s Method Measuring Sample Quality with Stein s Method Lester Mackey Joint work with Jackson Gorham, Andrew Duncan, Sebastian Vollmer Microsoft Research, Opendoor Labs, University of Sussex, University of Warwick

More information

Measuring Sample Quality with Kernels

Measuring Sample Quality with Kernels Jackson Gorham 1 Lester Mackey 2 Abstract Approximate Markov chain Monte Carlo (MCMC) offers the promise of more rapid sampling at the cost of more biased inference. Since standard MCMC diagnostics fail

More information

MEASURING SAMPLE QUALITY WITH KERNELS. Jackson Gorham Lester Mackey. Department of Statistics STANFORD UNIVERSITY Stanford, California

MEASURING SAMPLE QUALITY WITH KERNELS. Jackson Gorham Lester Mackey. Department of Statistics STANFORD UNIVERSITY Stanford, California MEASURING SAMPLE QUALITY WITH KERNELS By Jackson Gorham Lester Mackey Technical Report No. 07-05 June 07 Department of Statistics STANFORD UNIVERSITY Stanford, California 94305-4065 MEASURING SAMPLE QUALITY

More information

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d

More information

Learning to Sample Using Stein Discrepancy

Learning to Sample Using Stein Discrepancy Learning to Sample Using Stein Discrepancy Dilin Wang Yihao Feng Qiang Liu Department of Computer Science Dartmouth College Hanover, NH 03755 {dilin.wang.gr, yihao.feng.gr, qiang.liu}@dartmouth.edu Abstract

More information

Approximate Slice Sampling for Bayesian Posterior Inference

Approximate Slice Sampling for Bayesian Posterior Inference Approximate Slice Sampling for Bayesian Posterior Inference Anonymous Author 1 Anonymous Author 2 Anonymous Author 3 Unknown Institution 1 Unknown Institution 2 Unknown Institution 3 Abstract In this paper,

More information

Kernel adaptive Sequential Monte Carlo

Kernel adaptive Sequential Monte Carlo Kernel adaptive Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) December 7, 2015 1 / 36 Section 1 Outline

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Approximate Slice Sampling for Bayesian Posterior Inference

Approximate Slice Sampling for Bayesian Posterior Inference Approximate Slice Sampling for Bayesian Posterior Inference Christopher DuBois GraphLab, Inc. Anoop Korattikara Dept. of Computer Science UC Irvine Max Welling Informatics Institute University of Amsterdam

More information

On Markov chain Monte Carlo methods for tall data

On Markov chain Monte Carlo methods for tall data On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

A Linear-Time Kernel Goodness-of-Fit Test

A Linear-Time Kernel Goodness-of-Fit Test A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum 1; Wenkai Xu 1 Zoltán Szabó 2 NIPS 2017 Best paper! Kenji Fukumizu 3 Arthur Gretton 1 wittawatj@gmail.com 1 Gatsby Unit, University College

More information

Kernel Adaptive Metropolis-Hastings

Kernel Adaptive Metropolis-Hastings Kernel Adaptive Metropolis-Hastings Arthur Gretton,?? Gatsby Unit, CSML, University College London NIPS, December 2015 Arthur Gretton (Gatsby Unit, UCL) Kernel Adaptive Metropolis-Hastings 12/12/2015 1

More information

Monte Carlo in Bayesian Statistics

Monte Carlo in Bayesian Statistics Monte Carlo in Bayesian Statistics Matthew Thomas SAMBa - University of Bath m.l.thomas@bath.ac.uk December 4, 2014 Matthew Thomas (SAMBa) Monte Carlo in Bayesian Statistics December 4, 2014 1 / 16 Overview

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Afternoon Meeting on Bayesian Computation 2018 University of Reading Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of

More information

Bayesian Support Vector Machines for Feature Ranking and Selection

Bayesian Support Vector Machines for Feature Ranking and Selection Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction

More information

Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget

Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget Anoop Korattikara, Yutian Chen and Max Welling,2 Department of Computer Science, University of California, Irvine 2 Informatics Institute,

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence Bayesian Inference in GLMs Frequentists typically base inferences on MLEs, asymptotic confidence limits, and log-likelihood ratio tests Bayesians base inferences on the posterior distribution of the unknowns

More information

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17 MCMC for big data Geir Storvik BigInsight lunch - May 2 2018 Geir Storvik MCMC for big data BigInsight lunch - May 2 2018 1 / 17 Outline Why ordinary MCMC is not scalable Different approaches for making

More information

Convergence Rates of Kernel Quadrature Rules

Convergence Rates of Kernel Quadrature Rules Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction

More information

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods Pattern Recognition and Machine Learning Chapter 11: Sampling Methods Elise Arnaud Jakob Verbeek May 22, 2008 Outline of the chapter 11.1 Basic Sampling Algorithms 11.2 Markov Chain Monte Carlo 11.3 Gibbs

More information

Manifold Monte Carlo Methods

Manifold Monte Carlo Methods Manifold Monte Carlo Methods Mark Girolami Department of Statistical Science University College London Joint work with Ben Calderhead Research Section Ordinary Meeting The Royal Statistical Society October

More information

Adaptive HMC via the Infinite Exponential Family

Adaptive HMC via the Infinite Exponential Family Adaptive HMC via the Infinite Exponential Family Arthur Gretton Gatsby Unit, CSML, University College London RegML, 2017 Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

MEASURING SAMPLE QUALITY WITH DIFFUSIONS. Jackson Gorham Andrew B. Duncan Sebastian J. Vollmer Lester Mackey

MEASURING SAMPLE QUALITY WITH DIFFUSIONS. Jackson Gorham Andrew B. Duncan Sebastian J. Vollmer Lester Mackey MEASURING SAMPLE QUALITY WITH DIFFUSIONS By Jackson Gorham Andrew B. Duncan Sebastian J. Vollmer Lester Mackey Technical Report No. 017-08 July 017 Department of Statistics STANFORD UNIVERSITY Stanford,

More information

17 : Markov Chain Monte Carlo

17 : Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

MEASURING SAMPLE QUALITY WITH STEIN S METHOD. Jackson Gorham Lester Mackey

MEASURING SAMPLE QUALITY WITH STEIN S METHOD. Jackson Gorham Lester Mackey MEASURING SAMPLE QUALITY WITH STEIN S METHOD By Jackson Gorham Lester Mackey Technical Report No. 5-9 September 5 Department of Statistics STANFORD UNIVERSITY Stanford, California 9435-465 MEASURING SAMPLE

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning for for Advanced Topics in California Institute of Technology April 20th, 2017 1 / 50 Table of Contents for 1 2 3 4 2 / 50 History of methods for Enrico Fermi used to calculate incredibly accurate predictions

More information

Default Priors and Effcient Posterior Computation in Bayesian

Default Priors and Effcient Posterior Computation in Bayesian Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature

More information

Kernel Sequential Monte Carlo

Kernel Sequential Monte Carlo Kernel Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) * equal contribution April 25, 2016 1 / 37 Section

More information

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu Dilin Wang Department of Computer Science Dartmouth College Hanover, NH 03755 {qiang.liu, dilin.wang.gr}@dartmouth.edu

More information

Lecture 8: The Metropolis-Hastings Algorithm

Lecture 8: The Metropolis-Hastings Algorithm 30.10.2008 What we have seen last time: Gibbs sampler Key idea: Generate a Markov chain by updating the component of (X 1,..., X p ) in turn by drawing from the full conditionals: X (t) j Two drawbacks:

More information

LECTURE 15 Markov chain Monte Carlo

LECTURE 15 Markov chain Monte Carlo LECTURE 15 Markov chain Monte Carlo There are many settings when posterior computation is a challenge in that one does not have a closed form expression for the posterior distribution. Markov chain Monte

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem Set 2 Due date: Wednesday October 6 Please address all questions and comments about this problem set to 6867-staff@csail.mit.edu. You will need to use MATLAB for some of

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

MCMC Sampling for Bayesian Inference using L1-type Priors

MCMC Sampling for Bayesian Inference using L1-type Priors MÜNSTER MCMC Sampling for Bayesian Inference using L1-type Priors (what I do whenever the ill-posedness of EEG/MEG is just not frustrating enough!) AG Imaging Seminar Felix Lucka 26.06.2012 , MÜNSTER Sampling

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Machine Learning. Probabilistic KNN.

Machine Learning. Probabilistic KNN. Machine Learning. Mark Girolami girolami@dcs.gla.ac.uk Department of Computing Science University of Glasgow June 21, 2007 p. 1/3 KNN is a remarkably simple algorithm with proven error-rates June 21, 2007

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

20: Gaussian Processes

20: Gaussian Processes 10-708: Probabilistic Graphical Models 10-708, Spring 2016 20: Gaussian Processes Lecturer: Andrew Gordon Wilson Scribes: Sai Ganesh Bandiatmakuri 1 Discussion about ML Here we discuss an introduction

More information

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server in collaboration with: Minjie Xu, Balaji Lakshminarayanan, Leonard Hasenclever, Thibaut Lienart, Stefan Webb,

More information

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2 STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised

More information

MCMC algorithms for fitting Bayesian models

MCMC algorithms for fitting Bayesian models MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models

More information

Coresets for Bayesian Logistic Regression

Coresets for Bayesian Logistic Regression Coresets for Bayesian Logistic Regression Tamara Broderick ITT Career Development Assistant Professor, MIT With: Jonathan H. Huggins, Trevor Campbell 1 Bayesian inference Bayesian inference Complex, modular

More information

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 18, 2015 1 / 45 Resources and Attribution Image credits,

More information

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) Markov Chain Monte Carlo (MCMC Dependent Sampling Suppose we wish to sample from a density π, and we can evaluate π as a function but have no means to directly generate a sample. Rejection sampling can

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Advanced Introduction to Machine Learning

Advanced Introduction to Machine Learning 10-715 Advanced Introduction to Machine Learning Homework 3 Due Nov 12, 10.30 am Rules 1. Homework is due on the due date at 10.30 am. Please hand over your homework at the beginning of class. Please see

More information

A Review of Pseudo-Marginal Markov Chain Monte Carlo

A Review of Pseudo-Marginal Markov Chain Monte Carlo A Review of Pseudo-Marginal Markov Chain Monte Carlo Discussed by: Yizhe Zhang October 21, 2016 Outline 1 Overview 2 Paper review 3 experiment 4 conclusion Motivation & overview Notation: θ denotes the

More information

Can we do statistical inference in a non-asymptotic way? 1

Can we do statistical inference in a non-asymptotic way? 1 Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

16 : Markov Chain Monte Carlo (MCMC)

16 : Markov Chain Monte Carlo (MCMC) 10-708: Probabilistic Graphical Models 10-708, Spring 2014 16 : Markov Chain Monte Carlo MCMC Lecturer: Matthew Gormley Scribes: Yining Wang, Renato Negrinho 1 Sampling from low-dimensional distributions

More information

Energy-Based Generative Adversarial Network

Energy-Based Generative Adversarial Network Energy-Based Generative Adversarial Network Energy-Based Generative Adversarial Network J. Zhao, M. Mathieu and Y. LeCun Learning to Draw Samples: With Application to Amoritized MLE for Generalized Adversarial

More information

arxiv: v1 [stat.co] 2 Nov 2017

arxiv: v1 [stat.co] 2 Nov 2017 Binary Bouncy Particle Sampler arxiv:1711.922v1 [stat.co] 2 Nov 217 Ari Pakman Department of Statistics Center for Theoretical Neuroscience Grossman Center for the Statistics of Mind Columbia University

More information

Extreme Value Analysis and Spatial Extremes

Extreme Value Analysis and Spatial Extremes Extreme Value Analysis and Department of Statistics Purdue University 11/07/2013 Outline Motivation 1 Motivation 2 Extreme Value Theorem and 3 Bayesian Hierarchical Models Copula Models Max-stable Models

More information

Scaling up Bayesian Inference

Scaling up Bayesian Inference Scaling up Bayesian Inference David Dunson Departments of Statistical Science, Mathematics & ECE, Duke University May 1, 2017 Outline Motivation & background EP-MCMC amcmc Discussion Motivation & background

More information

Brief introduction to Markov Chain Monte Carlo

Brief introduction to Markov Chain Monte Carlo Brief introduction to Department of Probability and Mathematical Statistics seminar Stochastic modeling in economics and finance November 7, 2011 Brief introduction to Content 1 and motivation Classical

More information

The Poisson transform for unnormalised statistical models. Nicolas Chopin (ENSAE) joint work with Simon Barthelmé (CNRS, Gipsa-LAB)

The Poisson transform for unnormalised statistical models. Nicolas Chopin (ENSAE) joint work with Simon Barthelmé (CNRS, Gipsa-LAB) The Poisson transform for unnormalised statistical models Nicolas Chopin (ENSAE) joint work with Simon Barthelmé (CNRS, Gipsa-LAB) Part I Unnormalised statistical models Unnormalised statistical models

More information

Lecture 7 and 8: Markov Chain Monte Carlo

Lecture 7 and 8: Markov Chain Monte Carlo Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information

Kernel Learning via Random Fourier Representations

Kernel Learning via Random Fourier Representations Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random

More information

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods Changyou Chen Department of Electrical and Computer Engineering, Duke University cc448@duke.edu Duke-Tsinghua Machine Learning Summer

More information

The Recycling Gibbs Sampler for Efficient Learning

The Recycling Gibbs Sampler for Efficient Learning The Recycling Gibbs Sampler for Efficient Learning L. Martino, V. Elvira, G. Camps-Valls Universidade de São Paulo, São Carlos (Brazil). Télécom ParisTech, Université Paris-Saclay. (France), Universidad

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Stat 451 Lecture Notes Monte Carlo Integration

Stat 451 Lecture Notes Monte Carlo Integration Stat 451 Lecture Notes 06 12 Monte Carlo Integration Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Chapter 6 in Givens & Hoeting, Chapter 23 in Lange, and Chapters 3 4 in Robert & Casella 2 Updated:

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Learning Interpretable Features to Compare Distributions

Learning Interpretable Features to Compare Distributions Learning Interpretable Features to Compare Distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London Theory of Big Data, 2017 1/41 Goal of this talk Given: Two collections

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Learning features to compare distributions

Learning features to compare distributions Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Bayesian Phylogenetics:

Bayesian Phylogenetics: Bayesian Phylogenetics: an introduction Marc A. Suchard msuchard@ucla.edu UCLA Who is this man? How sure are you? The one true tree? Methods we ve learned so far try to find a single tree that best describes

More information

Markov chain Monte Carlo

Markov chain Monte Carlo Markov chain Monte Carlo Karl Oskar Ekvall Galin L. Jones University of Minnesota March 12, 2019 Abstract Practically relevant statistical models often give rise to probability distributions that are analytically

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

Part 6: Multivariate Normal and Linear Models

Part 6: Multivariate Normal and Linear Models Part 6: Multivariate Normal and Linear Models 1 Multiple measurements Up until now all of our statistical models have been univariate models models for a single measurement on each member of a sample of

More information

Slice Sampling with Adaptive Multivariate Steps: The Shrinking-Rank Method

Slice Sampling with Adaptive Multivariate Steps: The Shrinking-Rank Method Slice Sampling with Adaptive Multivariate Steps: The Shrinking-Rank Method Madeleine B. Thompson Radford M. Neal Abstract The shrinking rank method is a variation of slice sampling that is efficient at

More information

Differential Stein operators for multivariate continuous distributions and applications

Differential Stein operators for multivariate continuous distributions and applications Differential Stein operators for multivariate continuous distributions and applications Gesine Reinert A French/American Collaborative Colloquium on Concentration Inequalities, High Dimensional Statistics

More information

Bayesian Regularization

Bayesian Regularization Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Bayesian Nonparametric Regression for Diabetes Deaths

Bayesian Nonparametric Regression for Diabetes Deaths Bayesian Nonparametric Regression for Diabetes Deaths Brian M. Hartman PhD Student, 2010 Texas A&M University College Station, TX, USA David B. Dahl Assistant Professor Texas A&M University College Station,

More information

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

More information

Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geo-statistical Datasets

Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geo-statistical Datasets Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geo-statistical Datasets Abhirup Datta 1 Sudipto Banerjee 1 Andrew O. Finley 2 Alan E. Gelfand 3 1 University of Minnesota, Minneapolis,

More information

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling Professor Erik Sudderth Brown University Computer Science October 27, 2016 Some figures and materials courtesy

More information

Bayesian Inference and MCMC

Bayesian Inference and MCMC Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the

More information

Importance Reweighting Using Adversarial-Collaborative Training

Importance Reweighting Using Adversarial-Collaborative Training Importance Reweighting Using Adversarial-Collaborative Training Yifan Wu yw4@andrew.cmu.edu Tianshu Ren tren@andrew.cmu.edu Lidan Mu lmu@andrew.cmu.edu Abstract We consider the problem of reweighting a

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem

More information