Scaling up Bayesian Inference

Size: px
Start display at page:

Download "Scaling up Bayesian Inference"

Transcription

1 Scaling up Bayesian Inference David Dunson Departments of Statistical Science, Mathematics & ECE, Duke University May 1, 2017

2 Outline Motivation & background EP-MCMC amcmc Discussion Motivation & background 2

3 Complex & high-dimensional data Interest in developing new methods for analyzing & interpreting complex, high-dimensional data Motivation & background 2

4 Complex & high-dimensional data Interest in developing new methods for analyzing & interpreting complex, high-dimensional data Arise routinely in broad fields of sciences, engineering & even arts & humanities Motivation & background 2

5 Complex & high-dimensional data Interest in developing new methods for analyzing & interpreting complex, high-dimensional data Arise routinely in broad fields of sciences, engineering & even arts & humanities Despite huge interest in big data, there are vast gaps that have fundamentally limited progress in many fields Motivation & background 2

6 Typical approaches to big data There is an increasingly immense literature focused on big data Motivation & background 3

7 Typical approaches to big data There is an increasingly immense literature focused on big data Most of the focus has been on optimization-style methods Motivation & background 3

8 Typical approaches to big data There is an increasingly immense literature focused on big data Most of the focus has been on optimization-style methods Rapidly obtaining a point estimate even when sample size n & overall size of data is immense Motivation & background 3

9 Typical approaches to big data There is an increasingly immense literature focused on big data Most of the focus has been on optimization-style methods Rapidly obtaining a point estimate even when sample size n & overall size of data is immense Bandwagons: many people work on quite similar problems, while critical open problems remain untouched Motivation & background 3

10 My focus - probability models Motivation & background 4

11 My focus - probability models General probabilistic inference algorithms for complex data Motivation & background 4

12 My focus - probability models General probabilistic inference algorithms for complex data We would like to be able to handle arbitrarily complex probability models Motivation & background 4

13 My focus - probability models General probabilistic inference algorithms for complex data We would like to be able to handle arbitrarily complex probability models Algorithms scalable to huge data - potentially using many computers Motivation & background 4

14 My focus - probability models General probabilistic inference algorithms for complex data We would like to be able to handle arbitrarily complex probability models Algorithms scalable to huge data - potentially using many computers Motivation & background 4

15 My focus - probability models General probabilistic inference algorithms for complex data We would like to be able to handle arbitrarily complex probability models Algorithms scalable to huge data - potentially using many computers Accurate uncertainty quantification (UQ) is a critical issue Motivation & background 4

16 My focus - probability models General probabilistic inference algorithms for complex data We would like to be able to handle arbitrarily complex probability models Algorithms scalable to huge data - potentially using many computers Accurate uncertainty quantification (UQ) is a critical issue Robustness of inferences also crucial Motivation & background 4

17 My focus - probability models General probabilistic inference algorithms for complex data We would like to be able to handle arbitrarily complex probability models Algorithms scalable to huge data - potentially using many computers Accurate uncertainty quantification (UQ) is a critical issue Robustness of inferences also crucial Particular emphasis on scientific applications - limited labeled data Motivation & background 4

18 Bayes approaches Bayesian methods offer an attractive general approach for modeling complex data Motivation & background 5

19 Bayes approaches Bayesian methods offer an attractive general approach for modeling complex data Choosing a prior π(θ) & likelihood L(Y (n) θ), the posterior is π n (θ Y (n) ) = π(θ)l(y (n) θ) π(θ)l(y (n) θ)dθ = π(θ)l(y (n) θ) L(Y (n). ) Motivation & background 5

20 Bayes approaches Bayesian methods offer an attractive general approach for modeling complex data Choosing a prior π(θ) & likelihood L(Y (n) θ), the posterior is π n (θ Y (n) ) = π(θ)l(y (n) θ) π(θ)l(y (n) θ)dθ = π(θ)l(y (n) θ) L(Y (n). ) Often θ is moderate to high-dimensional & the integral in the denominator is intractable Motivation & background 5

21 Bayes approaches Bayesian methods offer an attractive general approach for modeling complex data Choosing a prior π(θ) & likelihood L(Y (n) θ), the posterior is π n (θ Y (n) ) = π(θ)l(y (n) θ) π(θ)l(y (n) θ)dθ = π(θ)l(y (n) θ) L(Y (n). ) Often θ is moderate to high-dimensional & the integral in the denominator is intractable Accurate analytic approximations to the posterior have proven elusive outside of narrow settings Motivation & background 5

22 Bayes approaches Bayesian methods offer an attractive general approach for modeling complex data Choosing a prior π(θ) & likelihood L(Y (n) θ), the posterior is π n (θ Y (n) ) = π(θ)l(y (n) θ) π(θ)l(y (n) θ)dθ = π(θ)l(y (n) θ) L(Y (n). ) Often θ is moderate to high-dimensional & the integral in the denominator is intractable Accurate analytic approximations to the posterior have proven elusive outside of narrow settings Markov chain Monte Carlo (MCMC) & other posterior sampling algorithms remain the standard Motivation & background 5

23 Bayes approaches Bayesian methods offer an attractive general approach for modeling complex data Choosing a prior π(θ) & likelihood L(Y (n) θ), the posterior is π n (θ Y (n) ) = π(θ)l(y (n) θ) π(θ)l(y (n) θ)dθ = π(θ)l(y (n) θ) L(Y (n). ) Often θ is moderate to high-dimensional & the integral in the denominator is intractable Accurate analytic approximations to the posterior have proven elusive outside of narrow settings Markov chain Monte Carlo (MCMC) & other posterior sampling algorithms remain the standard Scaling MCMC to big & complex settings challenging Motivation & background 5

24 MCMC & Computational bottlenecks MCMC constructs Markov chain with stationary distribution π n (θ Y (n) ) Motivation & background 6

25 MCMC & Computational bottlenecks MCMC constructs Markov chain with stationary distribution π n (θ Y (n) ) A transition kernel is carefully chosen & iterative sampling proceeds Motivation & background 6

26 MCMC & Computational bottlenecks MCMC constructs Markov chain with stationary distribution π n (θ Y (n) ) A transition kernel is carefully chosen & iterative sampling proceeds Time per iteration increases with # of parameters/unknowns Motivation & background 6

27 MCMC & Computational bottlenecks MCMC constructs Markov chain with stationary distribution π n (θ Y (n) ) A transition kernel is carefully chosen & iterative sampling proceeds Time per iteration increases with # of parameters/unknowns Mixing worse as dimension of data increases Motivation & background 6

28 MCMC & Computational bottlenecks MCMC constructs Markov chain with stationary distribution π n (θ Y (n) ) A transition kernel is carefully chosen & iterative sampling proceeds Time per iteration increases with # of parameters/unknowns Mixing worse as dimension of data increases Storing & basic processing on big data sets is problematic Motivation & background 6

29 MCMC & Computational bottlenecks MCMC constructs Markov chain with stationary distribution π n (θ Y (n) ) A transition kernel is carefully chosen & iterative sampling proceeds Time per iteration increases with # of parameters/unknowns Mixing worse as dimension of data increases Storing & basic processing on big data sets is problematic Usually multiple likelihood and/or gradient evaluations at each iteration Motivation & background 6

30 Solutions Embarrassingly parallel (EP) MCMC: run MCMC in parallel for different subsets of data & combine. Motivation & background 7

31 Solutions Embarrassingly parallel (EP) MCMC: run MCMC in parallel for different subsets of data & combine. Approximate MCMC: Approximate expensive to evaluate transition kernels. Motivation & background 7

32 Solutions Embarrassingly parallel (EP) MCMC: run MCMC in parallel for different subsets of data & combine. Approximate MCMC: Approximate expensive to evaluate transition kernels. Hybrid algorithms: run MCMC for a subset of the parameters & use a fast estimate for the others. Motivation & background 7

33 Solutions Embarrassingly parallel (EP) MCMC: run MCMC in parallel for different subsets of data & combine. Approximate MCMC: Approximate expensive to evaluate transition kernels. Hybrid algorithms: run MCMC for a subset of the parameters & use a fast estimate for the others. Designer MCMC: define clever kernels that solve mixing problems in high dimensions Motivation & background 7

34 Solutions Embarrassingly parallel (EP) MCMC: run MCMC in parallel for different subsets of data & combine. Approximate MCMC: Approximate expensive to evaluate transition kernels. Hybrid algorithms: run MCMC for a subset of the parameters & use a fast estimate for the others. Designer MCMC: define clever kernels that solve mixing problems in high dimensions I ll focus on EP-MCMC & amcmc in remainder Motivation & background 7

35 Outline Motivation & background EP-MCMC amcmc Discussion EP-MCMC 8

36 Embarrassingly parallel MCMC Divide large sample size n data set into many smaller data sets stored on different machines EP-MCMC 8

37 Embarrassingly parallel MCMC Divide large sample size n data set into many smaller data sets stored on different machines Draw posterior samples for each subset posterior in parallel EP-MCMC 8

38 Embarrassingly parallel MCMC Divide large sample size n data set into many smaller data sets stored on different machines Draw posterior samples for each subset posterior in parallel Magically combine the results quickly & simply EP-MCMC 8

39 500 1 Toy Example: Logistic Regression 1.15 β MCMC Subset Posterior 0.85 WASP β 1 exp pr(y i = 1 x i1,..., x i p,θ) = 1 + exp ( p ) j =1 x i j β j ( p ). j =1 x i j β j Subset posteriors: noisy approximations of full data posterior. EP-MCMC 9

40 500 1 Toy Example: Logistic Regression 1.15 β MCMC Subset Posterior 0.85 WASP β 1 exp pr(y i = 1 x i1,..., x i p,θ) = 1 + exp ( p ) j =1 x i j β j ( p ). j =1 x i j β j Subset posteriors: noisy approximations of full data posterior. Averaging of subset posteriors reduces this noise & leads to an accurate posterior approximation. EP-MCMC 9

41 Stochastic Approximation Full data posterior density of inid data Y (n) n π n (θ Y (n) i=1 ) = p i (y i θ)π(θ) n Θ i=1 p i (y i θ)π(θ)dθ. EP-MCMC 10

42 Stochastic Approximation Full data posterior density of inid data Y (n) n π n (θ Y (n) i=1 ) = p i (y i θ)π(θ) n Θ i=1 p i (y i θ)π(θ)dθ. Divide full data Y (n) into k subsets of size m: Y (n) = (Y [1],...,Y [j ],...,Y [k] ). EP-MCMC 10

43 Stochastic Approximation Full data posterior density of inid data Y (n) n π n (θ Y (n) i=1 ) = p i (y i θ)π(θ) n Θ i=1 p i (y i θ)π(θ)dθ. Divide full data Y (n) into k subsets of size m: Y (n) = (Y [1],...,Y [j ],...,Y [k] ). Subset posterior density for jth data subset π γ m(θ Y [j ] ) = Θ i [j ](p i (y i θ)) γ π(θ) i [j ](p i (y i θ)) γ π(θ)dθ. EP-MCMC 10

44 Stochastic Approximation Full data posterior density of inid data Y (n) n π n (θ Y (n) i=1 ) = p i (y i θ)π(θ) n Θ i=1 p i (y i θ)π(θ)dθ. Divide full data Y (n) into k subsets of size m: Y (n) = (Y [1],...,Y [j ],...,Y [k] ). Subset posterior density for jth data subset π γ m(θ Y [j ] ) = Θ i [j ](p i (y i θ)) γ π(θ) i [j ](p i (y i θ)) γ π(θ)dθ. γ = O(k) - chosen to minimize approximation error EP-MCMC 10

45 Barycenter in Metric Spaces EP-MCMC 11

46 Barycenter in Metric Spaces EP-MCMC 12

47 WAsserstein barycenter of Subset Posteriors (WASP) Srivastava, Li & Dunson (2015) 2-Wasserstein distance between µ,ν P 2 (Θ) (E[d W 2 (µ,ν) = inf{ 2 (X,Y )] ) } 1 2 : law(x ) = µ, law(y ) = ν. EP-MCMC 13

48 WAsserstein barycenter of Subset Posteriors (WASP) Srivastava, Li & Dunson (2015) 2-Wasserstein distance between µ,ν P 2 (Θ) (E[d W 2 (µ,ν) = inf{ 2 (X,Y )] ) } 1 2 : law(x ) = µ, law(y ) = ν. Π γ m( Y [j ] ) for j = 1,...,k are combined through WASP Π γ n ( Y (n) 1 ) = argmin Π P 2 (Θ) k k W2 2 (Π,Πγ m( Y [j ] )). [Agueh & Carlier (2011)] j =1 EP-MCMC 13

49 WAsserstein barycenter of Subset Posteriors (WASP) Srivastava, Li & Dunson (2015) 2-Wasserstein distance between µ,ν P 2 (Θ) (E[d W 2 (µ,ν) = inf{ 2 (X,Y )] ) } 1 2 : law(x ) = µ, law(y ) = ν. Π γ m( Y [j ] ) for j = 1,...,k are combined through WASP Π γ n ( Y (n) 1 ) = argmin Π P 2 (Θ) k k W2 2 (Π,Πγ m( Y [j ] )). [Agueh & Carlier (2011)] j =1 Plugging in Π γ m( Y [j ] ) for j = 1,...,k, a linear program (LP) can be used for fast estimation of an atomic approximation EP-MCMC 13

50 LP Estimation of WASP Minimizing Wasserstein is solution to a discrete optimal transport problem EP-MCMC 14

51 LP Estimation of WASP Minimizing Wasserstein is solution to a discrete optimal transport problem Let µ = J 1 j =1 a j δ θ1j, ν = J 2 l=1 b l δ θ2l & M 12 R J 1 J 2 = matrix of square differences in atoms {θ 1j },{θ 2l }. EP-MCMC 14

52 LP Estimation of WASP Minimizing Wasserstein is solution to a discrete optimal transport problem Let µ = J 1 j =1 a j δ θ1j, ν = J 2 l=1 b l δ θ2l & M 12 R J 1 J 2 = matrix of square differences in atoms {θ 1j },{θ 2l }. Optimal transport polytope: T (a,b) = set of doubly stochastic matrices w/ row sums a & column sums b EP-MCMC 14

53 LP Estimation of WASP Minimizing Wasserstein is solution to a discrete optimal transport problem Let µ = J 1 j =1 a j δ θ1j, ν = J 2 l=1 b l δ θ2l & M 12 R J 1 J 2 = matrix of square differences in atoms {θ 1j },{θ 2l }. Optimal transport polytope: T (a,b) = set of doubly stochastic matrices w/ row sums a & column sums b Objective is to find T T (a,b) minimizing tr(t T M 12 ) EP-MCMC 14

54 LP Estimation of WASP Minimizing Wasserstein is solution to a discrete optimal transport problem Let µ = J 1 j =1 a j δ θ1j, ν = J 2 l=1 b l δ θ2l & M 12 R J 1 J 2 = matrix of square differences in atoms {θ 1j },{θ 2l }. Optimal transport polytope: T (a,b) = set of doubly stochastic matrices w/ row sums a & column sums b Objective is to find T T (a,b) minimizing tr(t T M 12 ) For WASP, generalize to multimargin optimal transport problem - entropy smoothing has been used previously EP-MCMC 14

55 LP Estimation of WASP Minimizing Wasserstein is solution to a discrete optimal transport problem Let µ = J 1 j =1 a j δ θ1j, ν = J 2 l=1 b l δ θ2l & M 12 R J 1 J 2 = matrix of square differences in atoms {θ 1j },{θ 2l }. Optimal transport polytope: T (a,b) = set of doubly stochastic matrices w/ row sums a & column sums b Objective is to find T T (a,b) minimizing tr(t T M 12 ) For WASP, generalize to multimargin optimal transport problem - entropy smoothing has been used previously We can avoid such smoothing & use sparse LP solvers - neglible computation cost compared to sampling EP-MCMC 14

56 Theorem (Subset Posteriors) WASP: Theorems Under usual regularity conditions, there exists a constant C 1 independent of subset posteriors, such that for large m, E [j ] P W 2 { γ 2 Π m( Y [j ] ),δ θ0 ( ) } ( log 2 ) m 1 α C 1 θ 0 m j = 1,...,k, EP-MCMC 15

57 Theorem (Subset Posteriors) WASP: Theorems Under usual regularity conditions, there exists a constant C 1 independent of subset posteriors, such that for large m, E [j ] P W 2 { γ 2 Π m( Y [j ] ),δ θ0 ( ) } ( log 2 ) m 1 α C 1 θ 0 m j = 1,...,k, Theorem (WASP) Under usual regularity conditions and for large m, W 2 {Π γ } n ( Y (n) ),δ θ0 ( ) = O P (n) θ 0 log 2/α m. km 1/α EP-MCMC 15

58 Simple & Fast Posterior Interval Estimation (PIE) Li, Srivastava & Dunson (2015) j Usually report point & interval estimates for different 1-d functionals - multidimensional posterior difficult to interpret EP-MCMC 16

59 Simple & Fast Posterior Interval Estimation (PIE) Li, Srivastava & Dunson (2015) j Usually report point & interval estimates for different 1-d functionals - multidimensional posterior difficult to interpret j WASP has explicit relationship with subset posteriors in 1-d EP-MCMC 16

60 Simple & Fast Posterior Interval Estimation (PIE) Li, Srivastava & Dunson (2015) j Usually report point & interval estimates for different 1-d functionals - multidimensional posterior difficult to interpret j WASP has explicit relationship with subset posteriors in 1-d j Quantiles of WASP are simple averages of quantiles of subset posteriors EP-MCMC 16

61 Simple & Fast Posterior Interval Estimation (PIE) Li, Srivastava & Dunson (2015) j Usually report point & interval estimates for different 1-d functionals - multidimensional posterior difficult to interpret j WASP has explicit relationship with subset posteriors in 1-d j Quantiles of WASP are simple averages of quantiles of subset posteriors j Leads to a super trivial algorithm - run MCMC for each subset & average quantiles - reminiscent of bag of little bootstraps EP-MCMC 16

62 Simple & Fast Posterior Interval Estimation (PIE) Li, Srivastava & Dunson (2015) j Usually report point & interval estimates for different 1-d functionals - multidimensional posterior difficult to interpret j WASP has explicit relationship with subset posteriors in 1-d j Quantiles of WASP are simple averages of quantiles of subset posteriors j Leads to a super trivial algorithm - run MCMC for each subset & average quantiles - reminiscent of bag of little bootstraps j Strong theory showing accuracy of the resulting approximation EP-MCMC 16

63 Simple & Fast Posterior Interval Estimation (PIE) Li, Srivastava & Dunson (2015) j Usually report point & interval estimates for different 1-d functionals - multidimensional posterior difficult to interpret j WASP has explicit relationship with subset posteriors in 1-d j Quantiles of WASP are simple averages of quantiles of subset posteriors j Leads to a super trivial algorithm - run MCMC for each subset & average quantiles - reminiscent of bag of little bootstraps j Strong theory showing accuracy of the resulting approximation j Can be implemented in STAN, which allows powered likelihoods EP-MCMC 16

64 Theory on PIE/1-d WASP We show 1-d WASP Π n (ξ Y (n) ) is highly accurate approximation to exact posterior Π n (ξ Y (n) ) EP-MCMC 17

65 Theory on PIE/1-d WASP We show 1-d WASP Π n (ξ Y (n) ) is highly accurate approximation to exact posterior Π n (ξ Y (n) ) As subset sample size m increases, W 2 distance between them decreases at faster than parametric rate o p (n 1/2 ) EP-MCMC 17

66 Theory on PIE/1-d WASP We show 1-d WASP Π n (ξ Y (n) ) is highly accurate approximation to exact posterior Π n (ξ Y (n) ) As subset sample size m increases, W 2 distance between them decreases at faster than parametric rate o p (n 1/2 ) Theorem allows k = O(n c ) and m = O(n 1 c ) for any c (0,1), so m can increase very slowly relative to k (recall n = mk) EP-MCMC 17

67 Theory on PIE/1-d WASP We show 1-d WASP Π n (ξ Y (n) ) is highly accurate approximation to exact posterior Π n (ξ Y (n) ) As subset sample size m increases, W 2 distance between them decreases at faster than parametric rate o p (n 1/2 ) Theorem allows k = O(n c ) and m = O(n 1 c ) for any c (0,1), so m can increase very slowly relative to k (recall n = mk) Their biases, variances, quantiles only differ in high orders of the total sample size EP-MCMC 17

68 Theory on PIE/1-d WASP We show 1-d WASP Π n (ξ Y (n) ) is highly accurate approximation to exact posterior Π n (ξ Y (n) ) As subset sample size m increases, W 2 distance between them decreases at faster than parametric rate o p (n 1/2 ) Theorem allows k = O(n c ) and m = O(n 1 c ) for any c (0,1), so m can increase very slowly relative to k (recall n = mk) Their biases, variances, quantiles only differ in high orders of the total sample size Conditions: standard, mild conditions on likelihood + prior finite 2nd moment & uniform integrabiity of subset posteriors EP-MCMC 17

69 Results We have implemented for rich variety of data & models EP-MCMC 18

70 Results We have implemented for rich variety of data & models Logistic & linear random effects models, mixture models, matrix & tensor factorizations, Gaussian process regression EP-MCMC 18

71 Results We have implemented for rich variety of data & models Logistic & linear random effects models, mixture models, matrix & tensor factorizations, Gaussian process regression Nonparametric models, dependence, hierarchical models, etc. EP-MCMC 18

72 Results We have implemented for rich variety of data & models Logistic & linear random effects models, mixture models, matrix & tensor factorizations, Gaussian process regression Nonparametric models, dependence, hierarchical models, etc. We compare to long runs of MCMC (when feasible) & VB EP-MCMC 18

73 Results We have implemented for rich variety of data & models Logistic & linear random effects models, mixture models, matrix & tensor factorizations, Gaussian process regression Nonparametric models, dependence, hierarchical models, etc. We compare to long runs of MCMC (when feasible) & VB WASP/PIE is much faster than MCMC & highly accurate EP-MCMC 18

74 Results We have implemented for rich variety of data & models Logistic & linear random effects models, mixture models, matrix & tensor factorizations, Gaussian process regression Nonparametric models, dependence, hierarchical models, etc. We compare to long runs of MCMC (when feasible) & VB WASP/PIE is much faster than MCMC & highly accurate Carefully designed VB implementations often do very well EP-MCMC 18

75 Outline Motivation & background EP-MCMC amcmc Discussion amcmc 19

76 amcmc Johndrow, Mattingly, Mukherjee & Dunson (2015) Different way to speed up MCMC - replace expensive transition kernels with approximations amcmc 19

77 amcmc Johndrow, Mattingly, Mukherjee & Dunson (2015) Different way to speed up MCMC - replace expensive transition kernels with approximations For example, approximate a conditional distribution in Gibbs sampler with a Gaussian or using a subsample of data amcmc 19

78 amcmc Johndrow, Mattingly, Mukherjee & Dunson (2015) Different way to speed up MCMC - replace expensive transition kernels with approximations For example, approximate a conditional distribution in Gibbs sampler with a Gaussian or using a subsample of data Can potentially vastly speed up MCMC sampling in high-dimensional settings amcmc 19

79 amcmc Johndrow, Mattingly, Mukherjee & Dunson (2015) Different way to speed up MCMC - replace expensive transition kernels with approximations For example, approximate a conditional distribution in Gibbs sampler with a Gaussian or using a subsample of data Can potentially vastly speed up MCMC sampling in high-dimensional settings Original MCMC sampler converges to a stationary distribution corresponding to the exact posterior amcmc 19

80 amcmc Johndrow, Mattingly, Mukherjee & Dunson (2015) Different way to speed up MCMC - replace expensive transition kernels with approximations For example, approximate a conditional distribution in Gibbs sampler with a Gaussian or using a subsample of data Can potentially vastly speed up MCMC sampling in high-dimensional settings Original MCMC sampler converges to a stationary distribution corresponding to the exact posterior Not clear what happens when we start substituting in approximations - may diverge etc amcmc 19

81 amcmc Overview amcmc is used routinely in an essentially ad hoc manner amcmc 20

82 amcmc Overview amcmc is used routinely in an essentially ad hoc manner Our goal: obtain theory guarantees & use these to target design of algorithms amcmc 20

83 amcmc Overview amcmc is used routinely in an essentially ad hoc manner Our goal: obtain theory guarantees & use these to target design of algorithms Define exact MCMC algorithm, which is computationally intractable but has good mixing amcmc 20

84 amcmc Overview amcmc is used routinely in an essentially ad hoc manner Our goal: obtain theory guarantees & use these to target design of algorithms Define exact MCMC algorithm, which is computationally intractable but has good mixing exact chain converges to stationary distribution corresponding to exact posterior amcmc 20

85 amcmc Overview amcmc is used routinely in an essentially ad hoc manner Our goal: obtain theory guarantees & use these to target design of algorithms Define exact MCMC algorithm, which is computationally intractable but has good mixing exact chain converges to stationary distribution corresponding to exact posterior Approximate kernel in exact chain with more computationally tractable alternative amcmc 20

86 Sketch of theory Define s ɛ = τ 1 (P )/τ 1 (P ɛ ) = computational speed-up, τ 1 (P ) = time for one step with transition kernel P amcmc 21

87 Sketch of theory Define s ɛ = τ 1 (P )/τ 1 (P ɛ ) = computational speed-up, τ 1 (P ) = time for one step with transition kernel P Interest: optimizing computational time-accuracy tradeoff for estimators of Πf = Θ f (θ)π(dθ x) amcmc 21

88 Sketch of theory Define s ɛ = τ 1 (P )/τ 1 (P ɛ ) = computational speed-up, τ 1 (P ) = time for one step with transition kernel P Interest: optimizing computational time-accuracy tradeoff for estimators of Πf = Θ f (θ)π(dθ x) We provide tight, finite sample bounds on L 2 error amcmc 21

89 Sketch of theory Define s ɛ = τ 1 (P )/τ 1 (P ɛ ) = computational speed-up, τ 1 (P ) = time for one step with transition kernel P Interest: optimizing computational time-accuracy tradeoff for estimators of Πf = Θ f (θ)π(dθ x) We provide tight, finite sample bounds on L 2 error amcmc estimators win for low computational budgets but have asymptotic bias amcmc 21

90 Sketch of theory Define s ɛ = τ 1 (P )/τ 1 (P ɛ ) = computational speed-up, τ 1 (P ) = time for one step with transition kernel P Interest: optimizing computational time-accuracy tradeoff for estimators of Πf = Θ f (θ)π(dθ x) We provide tight, finite sample bounds on L 2 error amcmc estimators win for low computational budgets but have asymptotic bias Often larger approximation error larger s ɛ & rougher approximations are better when speed super important amcmc 21

91 Ex 1: Approximations using subsets Replace the full data likelihood with ( N/ V L ɛ (x θ) = L(x i θ)), i V for randomly chosen subset V {1,...,n}. amcmc 22

92 Ex 1: Approximations using subsets Replace the full data likelihood with ( N/ V L ɛ (x θ) = L(x i θ)), i V for randomly chosen subset V {1,...,n}. Applied to Pólya-Gamma data augmentation for logistic regression amcmc 22

93 Ex 1: Approximations using subsets Replace the full data likelihood with ( N/ V L ɛ (x θ) = L(x i θ)), i V for randomly chosen subset V {1,...,n}. Applied to Pólya-Gamma data augmentation for logistic regression Different V at each iteration trivial modification to Gibbs amcmc 22

94 Ex 1: Approximations using subsets Replace the full data likelihood with ( N/ V L ɛ (x θ) = L(x i θ)), i V for randomly chosen subset V {1,...,n}. Applied to Pólya-Gamma data augmentation for logistic regression Different V at each iteration trivial modification to Gibbs Assumptions hold with high probability for subsets > minimal size (wrt distribution of subsets, data & kernel). amcmc 22

95 Application to SUSY dataset n = 5,000,000 (0.5 million test), binary outcome & 18 continuous covariates amcmc 23

96 Application to SUSY dataset n = 5,000,000 (0.5 million test), binary outcome & 18 continuous covariates Considered subsets sizes ranging from V = 1,000 to 4,500,000 amcmc 23

97 Application to SUSY dataset n = 5,000,000 (0.5 million test), binary outcome & 18 continuous covariates Considered subsets sizes ranging from V = 1,000 to 4,500,000 Considered different losses as function of V amcmc 23

98 Application to SUSY dataset n = 5,000,000 (0.5 million test), binary outcome & 18 continuous covariates Considered subsets sizes ranging from V = 1,000 to 4,500,000 Considered different losses as function of V Rate at which loss 0 with ɛ heavily dependent on loss amcmc 23

99 Application to SUSY dataset n = 5,000,000 (0.5 million test), binary outcome & 18 continuous covariates Considered subsets sizes ranging from V = 1,000 to 4,500,000 Considered different losses as function of V Rate at which loss 0 with ɛ heavily dependent on loss For small computational budget & focus on posterior mean estimation, small subsets preferred amcmc 23

100 Application to SUSY dataset n = 5,000,000 (0.5 million test), binary outcome & 18 continuous covariates Considered subsets sizes ranging from V = 1,000 to 4,500,000 Considered different losses as function of V Rate at which loss 0 with ɛ heavily dependent on loss For small computational budget & focus on posterior mean estimation, small subsets preferred As budget increases & loss focused more on tails (e.g., for interval estimation), optimal V increases amcmc 23

101 Application 2: Mixture models & tensor factorizations We also considered a nonparametric Bayes model: pr(y i1 = c 1,..., y i p = c p ) = k p λ h h=1 j =1 ψ (j ) hc j, a very useful model for multivariate categorical data amcmc 24

102 Application 2: Mixture models & tensor factorizations We also considered a nonparametric Bayes model: pr(y i1 = c 1,..., y i p = c p ) = k p λ h h=1 j =1 ψ (j ) hc j, a very useful model for multivariate categorical data Dunson & Xing (9) - a data augmentation Gibbs sampler amcmc 24

103 Application 2: Mixture models & tensor factorizations We also considered a nonparametric Bayes model: pr(y i1 = c 1,..., y i p = c p ) = k p λ h h=1 j =1 ψ (j ) hc j, a very useful model for multivariate categorical data Dunson & Xing (9) - a data augmentation Gibbs sampler Sampling latent classes computationally prohibitive for huge n amcmc 24

104 Application 2: Mixture models & tensor factorizations We also considered a nonparametric Bayes model: pr(y i1 = c 1,..., y i p = c p ) = k p λ h h=1 j =1 ψ (j ) hc j, a very useful model for multivariate categorical data Dunson & Xing (9) - a data augmentation Gibbs sampler Sampling latent classes computationally prohibitive for huge n Use adaptive Gaussian approximation - avoid sampling individual latent classes amcmc 24

105 Application 2: Mixture models & tensor factorizations We also considered a nonparametric Bayes model: pr(y i1 = c 1,..., y i p = c p ) = k p λ h h=1 j =1 ψ (j ) hc j, a very useful model for multivariate categorical data Dunson & Xing (9) - a data augmentation Gibbs sampler Sampling latent classes computationally prohibitive for huge n Use adaptive Gaussian approximation - avoid sampling individual latent classes We have shown Assumptions 1-2, Assumption 2 result more general than this setting amcmc 24

106 Application 2: Mixture models & tensor factorizations We also considered a nonparametric Bayes model: pr(y i1 = c 1,..., y i p = c p ) = k p λ h h=1 j =1 ψ (j ) hc j, a very useful model for multivariate categorical data Dunson & Xing (9) - a data augmentation Gibbs sampler Sampling latent classes computationally prohibitive for huge n Use adaptive Gaussian approximation - avoid sampling individual latent classes We have shown Assumptions 1-2, Assumption 2 result more general than this setting Improved computation performance for large n amcmc 24

107 Application 3: Low rank approximation to GP Gaussian process regression, y i = f (x i ) + η i, η i N(0,σ 2 ) amcmc 25

108 Application 3: Low rank approximation to GP Gaussian process regression, y i = f (x i ) + η i, η i N(0,σ 2 ) f GP prior with covariance τ 2 exp( φ x 1 x 2 2 ) amcmc 25

109 Application 3: Low rank approximation to GP Gaussian process regression, y i = f (x i ) + η i, η i N(0,σ 2 ) f GP prior with covariance τ 2 exp( φ x 1 x 2 2 ) Discrete-uniform on φ & gamma priors on τ 2,σ 2 amcmc 25

110 Application 3: Low rank approximation to GP Gaussian process regression, y i = f (x i ) + η i, η i N(0,σ 2 ) f GP prior with covariance τ 2 exp( φ x 1 x 2 2 ) Discrete-uniform on φ & gamma priors on τ 2,σ 2 Marginal MCMC sampler updates φ,τ 2,σ 2 amcmc 25

111 Application 3: Low rank approximation to GP Gaussian process regression, y i = f (x i ) + η i, η i N(0,σ 2 ) f GP prior with covariance τ 2 exp( φ x 1 x 2 2 ) Discrete-uniform on φ & gamma priors on τ 2,σ 2 Marginal MCMC sampler updates φ,τ 2,σ 2 We show Assumption 1 holds under mild regularity conditions on truth, Assumption 2 holds for partial rank-r eigen approximation to Σ amcmc 25

112 Application 3: Low rank approximation to GP Gaussian process regression, y i = f (x i ) + η i, η i N(0,σ 2 ) f GP prior with covariance τ 2 exp( φ x 1 x 2 2 ) Discrete-uniform on φ & gamma priors on τ 2,σ 2 Marginal MCMC sampler updates φ,τ 2,σ 2 We show Assumption 1 holds under mild regularity conditions on truth, Assumption 2 holds for partial rank-r eigen approximation to Σ Less accurate approximations clearly superior in practice for small computational budget amcmc 25

113 Applications: General Conclusions Achieving uniform control of approximation error ɛ requires approximations adaptive to current state of chain amcmc 26

114 Applications: General Conclusions Achieving uniform control of approximation error ɛ requires approximations adaptive to current state of chain More accurate approximations needed farther from high probability region of posterior; good as chain rarely there amcmc 26

115 Applications: General Conclusions Achieving uniform control of approximation error ɛ requires approximations adaptive to current state of chain More accurate approximations needed farther from high probability region of posterior; good as chain rarely there Approximations to conditionals of vector parameters are highly sensitive to 2nd moment amcmc 26

116 Applications: General Conclusions Achieving uniform control of approximation error ɛ requires approximations adaptive to current state of chain More accurate approximations needed farther from high probability region of posterior; good as chain rarely there Approximations to conditionals of vector parameters are highly sensitive to 2nd moment Smaller condition numbers for the covariance matrix of vector parameters mean less accurate approximations can be used amcmc 26

117 Outline Motivation & background EP-MCMC amcmc Discussion Discussion 27

118 Discussion Proposed very general classes of scalable Bayes algorithms Discussion 27

119 Discussion Proposed very general classes of scalable Bayes algorithms EP-MCMC & amcmc - fast & scalable with guarantees Discussion 27

120 Discussion Proposed very general classes of scalable Bayes algorithms EP-MCMC & amcmc - fast & scalable with guarantees Interest in improving theory - avoid reliance on asymptotics in EP-MCMC & weaken assumptions in amcmc Discussion 27

121 Discussion Proposed very general classes of scalable Bayes algorithms EP-MCMC & amcmc - fast & scalable with guarantees Interest in improving theory - avoid reliance on asymptotics in EP-MCMC & weaken assumptions in amcmc Useful to combine algorithms - e.g., run amcmc for each subset Discussion 27

122 Discussion Proposed very general classes of scalable Bayes algorithms EP-MCMC & amcmc - fast & scalable with guarantees Interest in improving theory - avoid reliance on asymptotics in EP-MCMC & weaken assumptions in amcmc Useful to combine algorithms - e.g., run amcmc for each subset By looking at algorithms through our theory lens, suggests new & improved algorithms Discussion 27

123 Discussion Proposed very general classes of scalable Bayes algorithms EP-MCMC & amcmc - fast & scalable with guarantees Interest in improving theory - avoid reliance on asymptotics in EP-MCMC & weaken assumptions in amcmc Useful to combine algorithms - e.g., run amcmc for each subset By looking at algorithms through our theory lens, suggests new & improved algorithms Also, very interested in hybrid frequentist-bayes algorithms Discussion 27

124 Hybrid high-dimensional density estimation Ye, Canale & Dunson (2016, AISTATS) y i = (y i1,..., y i p ) T f with p large & f an unknown density Discussion 28

125 Hybrid high-dimensional density estimation Ye, Canale & Dunson (2016, AISTATS) y i = (y i1,..., y i p ) T f with p large & f an unknown density Potentially use Dirichlet process mixtures of factor models Discussion 28

126 Hybrid high-dimensional density estimation Ye, Canale & Dunson (2016, AISTATS) y i = (y i1,..., y i p ) T f with p large & f an unknown density Potentially use Dirichlet process mixtures of factor models Approach doesn t scale well at all with p Discussion 28

127 Hybrid high-dimensional density estimation Ye, Canale & Dunson (2016, AISTATS) y i = (y i1,..., y i p ) T f with p large & f an unknown density Potentially use Dirichlet process mixtures of factor models Approach doesn t scale well at all with p Instead use hybrid of Gibbs sampling & fast multiscale SVD Discussion 28

128 Hybrid high-dimensional density estimation Ye, Canale & Dunson (2016, AISTATS) y i = (y i1,..., y i p ) T f with p large & f an unknown density Potentially use Dirichlet process mixtures of factor models Approach doesn t scale well at all with p Instead use hybrid of Gibbs sampling & fast multiscale SVD Scalable, excellent mixing & empirical/predictive performance Discussion 28

129 What about mixing? Method ACF DA MH MVN CDA Lag In the above we have put aside the mixing issues that can arise in big samples Discussion 29

130 What about mixing? Method ACF DA MH MVN CDA Lag In the above we have put aside the mixing issues that can arise in big samples Slow mixing we need many more MCMC samples for the sample MC error Discussion 29

131 What about mixing? Method ACF DA MH MVN CDA Lag In the above we have put aside the mixing issues that can arise in big samples Slow mixing we need many more MCMC samples for the sample MC error Common data augmentation algorithms for discrete data fail badly for large imbalanced data (Johndrow et al. 2016) Discussion 29

132 What about mixing? Method ACF DA MH MVN CDA Lag In the above we have put aside the mixing issues that can arise in big samples Slow mixing we need many more MCMC samples for the sample MC error Common data augmentation algorithms for discrete data fail badly for large imbalanced data (Johndrow et al. 2016) But such problems can be fixed via calibration (Duan et al. 2016) Discussion 29

133 What about mixing? Method ACF DA MH MVN CDA Lag In the above we have put aside the mixing issues that can arise in big samples Slow mixing we need many more MCMC samples for the sample MC error Common data augmentation algorithms for discrete data fail badly for large imbalanced data (Johndrow et al. 2016) But such problems can be fixed via calibration (Duan et al. 2016) Interesting area for further research Discussion 29

134 Primary References Duan L, Johndrow J, Dunson DB (2017) Calibrated data augmentation for scalable Markov chain Monte Carlo. arxiv: Johndrow J, Mattingly J, Mukherjee S, Dunson DB (2015) Approximations of Markov chains and Bayesian inference. arxiv: Johndrow J, Smith A, Pillai N, Dunson DB (2016) Inefficiency of data augmentation for large sample imbalanced data. arxiv: Li C, Srivastava S, Dunson DB (2016) Simple, scalable and accurate posterior interval estimation. arxiv: ; Biometrika, in press. Discussion 30

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

On Markov chain Monte Carlo methods for tall data

On Markov chain Monte Carlo methods for tall data On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

VCMC: Variational Consensus Monte Carlo

VCMC: Variational Consensus Monte Carlo VCMC: Variational Consensus Monte Carlo Maxim Rabinovich, Elaine Angelino, Michael I. Jordan Berkeley Vision and Learning Center September 22, 2015 probabilistic models! sky fog bridge water grass object

More information

Nonparametric Bayes tensor factorizations for big data

Nonparametric Bayes tensor factorizations for big data Nonparametric Bayes tensor factorizations for big data David Dunson Department of Statistical Science, Duke University Funded from NIH R01-ES017240, R01-ES017436 & DARPA N66001-09-C-2082 Motivation Conditional

More information

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17 MCMC for big data Geir Storvik BigInsight lunch - May 2 2018 Geir Storvik MCMC for big data BigInsight lunch - May 2 2018 1 / 17 Outline Why ordinary MCMC is not scalable Different approaches for making

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Nonparametric Bayes Uncertainty Quantification

Nonparametric Bayes Uncertainty Quantification Nonparametric Bayes Uncertainty Quantification David Dunson Department of Statistical Science, Duke University Funded from NIH R01-ES017240, R01-ES017436 & ONR Review of Bayes Intro to Nonparametric Bayes

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Bayes: All uncertainty is described using probability.

Bayes: All uncertainty is described using probability. Bayes: All uncertainty is described using probability. Let w be the data and θ be any unknown quantities. Likelihood. The probability model π(w θ) has θ fixed and w varying. The likelihood L(θ; w) is π(w

More information

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis Summarizing a posterior Given the data and prior the posterior is determined Summarizing the posterior gives parameter estimates, intervals, and hypothesis tests Most of these computations are integrals

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Default Priors and Effcient Posterior Computation in Bayesian

Default Priors and Effcient Posterior Computation in Bayesian Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature

More information

On Bayesian Computation

On Bayesian Computation On Bayesian Computation Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang Previous Work: Information Constraints on Inference Minimize the minimax risk under constraints

More information

Bayes methods for categorical data. April 25, 2017

Bayes methods for categorical data. April 25, 2017 Bayes methods for categorical data April 25, 2017 Motivation for joint probability models Increasing interest in high-dimensional data in broad applications Focus may be on prediction, variable selection,

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

STA 216, GLM, Lecture 16. October 29, 2007

STA 216, GLM, Lecture 16. October 29, 2007 STA 216, GLM, Lecture 16 October 29, 2007 Efficient Posterior Computation in Factor Models Underlying Normal Models Generalized Latent Trait Models Formulation Genetic Epidemiology Illustration Structural

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Afternoon Meeting on Bayesian Computation 2018 University of Reading Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of

More information

Integrated Non-Factorized Variational Inference

Integrated Non-Factorized Variational Inference Integrated Non-Factorized Variational Inference Shaobo Han, Xuejun Liao and Lawrence Carin Duke University February 27, 2014 S. Han et al. Integrated Non-Factorized Variational Inference February 27, 2014

More information

Lecture 7 and 8: Markov Chain Monte Carlo

Lecture 7 and 8: Markov Chain Monte Carlo Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani

More information

Part 1: Expectation Propagation

Part 1: Expectation Propagation Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud

More information

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA Intro: Course Outline and Brief Intro to Marina Vannucci Rice University, USA PASI-CIMAT 04/28-30/2010 Marina Vannucci

More information

A Review of Pseudo-Marginal Markov Chain Monte Carlo

A Review of Pseudo-Marginal Markov Chain Monte Carlo A Review of Pseudo-Marginal Markov Chain Monte Carlo Discussed by: Yizhe Zhang October 21, 2016 Outline 1 Overview 2 Paper review 3 experiment 4 conclusion Motivation & overview Notation: θ denotes the

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Recurrent Latent Variable Networks for Session-Based Recommendation

Recurrent Latent Variable Networks for Session-Based Recommendation Recurrent Latent Variable Networks for Session-Based Recommendation Panayiotis Christodoulou Cyprus University of Technology paa.christodoulou@edu.cut.ac.cy 27/8/2017 Panayiotis Christodoulou (C.U.T.)

More information

Bayesian non-parametric model to longitudinally predict churn

Bayesian non-parametric model to longitudinally predict churn Bayesian non-parametric model to longitudinally predict churn Bruno Scarpa Università di Padova Conference of European Statistics Stakeholders Methodologists, Producers and Users of European Statistics

More information

The Bayesian approach to inverse problems

The Bayesian approach to inverse problems The Bayesian approach to inverse problems Youssef Marzouk Department of Aeronautics and Astronautics Center for Computational Engineering Massachusetts Institute of Technology ymarz@mit.edu, http://uqgroup.mit.edu

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Bayesian Inference and MCMC

Bayesian Inference and MCMC Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Bayesian linear regression

Bayesian linear regression Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding

More information

A Divide-and-Conquer Bayesian Approach to Large-Scale Kriging

A Divide-and-Conquer Bayesian Approach to Large-Scale Kriging A Divide-and-Conquer Bayesian Approach to Large-Scale Kriging Cheng Li DSAP, National University of Singapore Joint work with Rajarshi Guhaniyogi (UC Santa Cruz), Terrance D. Savitsky (US Bureau of Labor

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods Tomas McKelvey and Lennart Svensson Signal Processing Group Department of Signals and Systems Chalmers University of Technology, Sweden November 26, 2012 Today s learning

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

17 : Markov Chain Monte Carlo

17 : Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

Bayesian Linear Regression

Bayesian Linear Regression Bayesian Linear Regression Sudipto Banerjee 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. September 15, 2010 1 Linear regression models: a Bayesian perspective

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

MCMC Sampling for Bayesian Inference using L1-type Priors

MCMC Sampling for Bayesian Inference using L1-type Priors MÜNSTER MCMC Sampling for Bayesian Inference using L1-type Priors (what I do whenever the ill-posedness of EEG/MEG is just not frustrating enough!) AG Imaging Seminar Felix Lucka 26.06.2012 , MÜNSTER Sampling

More information

Non-Parametric Bayes

Non-Parametric Bayes Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

More information

Scalable MCMC for the horseshoe prior

Scalable MCMC for the horseshoe prior Scalable MCMC for the horseshoe prior Anirban Bhattacharya Department of Statistics, Texas A&M University Joint work with James Johndrow and Paolo Orenstein September 7, 2018 Cornell Day of Statistics

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

STAT 518 Intro Student Presentation

STAT 518 Intro Student Presentation STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible

More information

The connection of dropout and Bayesian statistics

The connection of dropout and Bayesian statistics The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University

More information

Markov Chain Monte Carlo

Markov Chain Monte Carlo Markov Chain Monte Carlo Recall: To compute the expectation E ( h(y ) ) we use the approximation E(h(Y )) 1 n n h(y ) t=1 with Y (1),..., Y (n) h(y). Thus our aim is to sample Y (1),..., Y (n) from f(y).

More information

Generative Models and Stochastic Algorithms for Population Average Estimation and Image Analysis

Generative Models and Stochastic Algorithms for Population Average Estimation and Image Analysis Generative Models and Stochastic Algorithms for Population Average Estimation and Image Analysis Stéphanie Allassonnière CIS, JHU July, 15th 28 Context : Computational Anatomy Context and motivations :

More information

Learning Energy-Based Models of High-Dimensional Data

Learning Energy-Based Models of High-Dimensional Data Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero www.cs.toronto.edu/~hinton/energybasedmodelsweb.htm Discovering causal structure as a goal

More information

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee University of Minnesota July 20th, 2008 1 Bayesian Principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters

More information

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet. Stat 535 C - Statistical Computing & Monte Carlo Methods Arnaud Doucet Email: arnaud@cs.ubc.ca 1 1.1 Outline Introduction to Markov chain Monte Carlo The Gibbs Sampler Examples Overview of the Lecture

More information

MCMC algorithms for fitting Bayesian models

MCMC algorithms for fitting Bayesian models MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

STAT 425: Introduction to Bayesian Analysis

STAT 425: Introduction to Bayesian Analysis STAT 425: Introduction to Bayesian Analysis Marina Vannucci Rice University, USA Fall 2017 Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 2) Fall 2017 1 / 19 Part 2: Markov chain Monte

More information

Bayesian Inference for Discretely Sampled Diffusion Processes: A New MCMC Based Approach to Inference

Bayesian Inference for Discretely Sampled Diffusion Processes: A New MCMC Based Approach to Inference Bayesian Inference for Discretely Sampled Diffusion Processes: A New MCMC Based Approach to Inference Osnat Stramer 1 and Matthew Bognar 1 Department of Statistics and Actuarial Science, University of

More information

Sparse Stochastic Inference for Latent Dirichlet Allocation

Sparse Stochastic Inference for Latent Dirichlet Allocation Sparse Stochastic Inference for Latent Dirichlet Allocation David Mimno 1, Matthew D. Hoffman 2, David M. Blei 1 1 Dept. of Computer Science, Princeton U. 2 Dept. of Statistics, Columbia U. Presentation

More information

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods Changyou Chen Department of Electrical and Computer Engineering, Duke University cc448@duke.edu Duke-Tsinghua Machine Learning Summer

More information

CSC 2541: Bayesian Methods for Machine Learning

CSC 2541: Bayesian Methods for Machine Learning CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 3 More Markov Chain Monte Carlo Methods The Metropolis algorithm isn t the only way to do MCMC. We ll

More information

ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts

ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts ICML 2015 Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes Machine Learning Research Group and Oxford-Man Institute University of Oxford July 8, 2015 Point Processes

More information

The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for Bayesian Estimation in a Finite Gaussian Mixture Model

The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for Bayesian Estimation in a Finite Gaussian Mixture Model Thai Journal of Mathematics : 45 58 Special Issue: Annual Meeting in Mathematics 207 http://thaijmath.in.cmu.ac.th ISSN 686-0209 The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for

More information

Large-scale Ordinal Collaborative Filtering

Large-scale Ordinal Collaborative Filtering Large-scale Ordinal Collaborative Filtering Ulrich Paquet, Blaise Thomson, and Ole Winther Microsoft Research Cambridge, University of Cambridge, Technical University of Denmark ulripa@microsoft.com,brmt2@cam.ac.uk,owi@imm.dtu.dk

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Bayesian Calibration of Simulators with Structured Discretization Uncertainty

Bayesian Calibration of Simulators with Structured Discretization Uncertainty Bayesian Calibration of Simulators with Structured Discretization Uncertainty Oksana A. Chkrebtii Department of Statistics, The Ohio State University Joint work with Matthew T. Pratola (Statistics, The

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Bagging During Markov Chain Monte Carlo for Smoother Predictions Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods

More information

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model UNIVERSITY OF TEXAS AT SAN ANTONIO Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model Liang Jing April 2010 1 1 ABSTRACT In this paper, common MCMC algorithms are introduced

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Contents. Part I: Fundamentals of Bayesian Inference 1

Contents. Part I: Fundamentals of Bayesian Inference 1 Contents Preface xiii Part I: Fundamentals of Bayesian Inference 1 1 Probability and inference 3 1.1 The three steps of Bayesian data analysis 3 1.2 General notation for statistical inference 4 1.3 Bayesian

More information

Stochastic Analogues to Deterministic Optimizers

Stochastic Analogues to Deterministic Optimizers Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured

More information

Fitting Narrow Emission Lines in X-ray Spectra

Fitting Narrow Emission Lines in X-ray Spectra Outline Fitting Narrow Emission Lines in X-ray Spectra Taeyoung Park Department of Statistics, University of Pittsburgh October 11, 2007 Outline of Presentation Outline This talk has three components:

More information

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence Bayesian Inference in GLMs Frequentists typically base inferences on MLEs, asymptotic confidence limits, and log-likelihood ratio tests Bayesians base inferences on the posterior distribution of the unknowns

More information

Likelihood-free MCMC

Likelihood-free MCMC Bayesian inference for stable distributions with applications in finance Department of Mathematics University of Leicester September 2, 2011 MSc project final presentation Outline 1 2 3 4 Classical Monte

More information

An example of Bayesian reasoning Consider the one-dimensional deconvolution problem with various degrees of prior information.

An example of Bayesian reasoning Consider the one-dimensional deconvolution problem with various degrees of prior information. An example of Bayesian reasoning Consider the one-dimensional deconvolution problem with various degrees of prior information. Model: where g(t) = a(t s)f(s)ds + e(t), a(t) t = (rapidly). The problem,

More information

MCMC and Gibbs Sampling. Kayhan Batmanghelich

MCMC and Gibbs Sampling. Kayhan Batmanghelich MCMC and Gibbs Sampling Kayhan Batmanghelich 1 Approaches to inference l Exact inference algorithms l l l The elimination algorithm Message-passing algorithm (sum-product, belief propagation) The junction

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Bayesian model selection: methodology, computation and applications

Bayesian model selection: methodology, computation and applications Bayesian model selection: methodology, computation and applications David Nott Department of Statistics and Applied Probability National University of Singapore Statistical Genomics Summer School Program

More information

New Insights into History Matching via Sequential Monte Carlo

New Insights into History Matching via Sequential Monte Carlo New Insights into History Matching via Sequential Monte Carlo Associate Professor Chris Drovandi School of Mathematical Sciences ARC Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS)

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference 1 The views expressed in this paper are those of the authors and do not necessarily reflect the views of the Federal Reserve Board of Governors or the Federal Reserve System. Bayesian Estimation of DSGE

More information

Sequential Monte Carlo Methods for Bayesian Computation

Sequential Monte Carlo Methods for Bayesian Computation Sequential Monte Carlo Methods for Bayesian Computation A. Doucet Kyoto Sept. 2012 A. Doucet (MLSS Sept. 2012) Sept. 2012 1 / 136 Motivating Example 1: Generic Bayesian Model Let X be a vector parameter

More information

Bayesian inference for multivariate extreme value distributions

Bayesian inference for multivariate extreme value distributions Bayesian inference for multivariate extreme value distributions Sebastian Engelke Clément Dombry, Marco Oesting Toronto, Fields Institute, May 4th, 2016 Main motivation For a parametric model Z F θ of

More information

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent Latent Variable Models for Binary Data Suppose that for a given vector of explanatory variables x, the latent variable, U, has a continuous cumulative distribution function F (u; x) and that the binary

More information

Learning the hyper-parameters. Luca Martino

Learning the hyper-parameters. Luca Martino Learning the hyper-parameters Luca Martino 2017 2017 1 / 28 Parameters and hyper-parameters 1. All the described methods depend on some choice of hyper-parameters... 2. For instance, do you recall λ (bandwidth

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Kernel adaptive Sequential Monte Carlo

Kernel adaptive Sequential Monte Carlo Kernel adaptive Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) December 7, 2015 1 / 36 Section 1 Outline

More information

Approximate inference in Energy-Based Models

Approximate inference in Energy-Based Models CSC 2535: 2013 Lecture 3b Approximate inference in Energy-Based Models Geoffrey Hinton Two types of density model Stochastic generative model using directed acyclic graph (e.g. Bayes Net) Energy-based

More information

Adaptive Monte Carlo methods

Adaptive Monte Carlo methods Adaptive Monte Carlo methods Jean-Michel Marin Projet Select, INRIA Futurs, Université Paris-Sud joint with Randal Douc (École Polytechnique), Arnaud Guillin (Université de Marseille) and Christian Robert

More information

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Practical Bayesian Optimization of Machine Learning. Learning Algorithms Practical Bayesian Optimization of Machine Learning Algorithms CS 294 University of California, Berkeley Tuesday, April 20, 2016 Motivation Machine Learning Algorithms (MLA s) have hyperparameters that

More information

Application of new Monte Carlo method for inversion of prestack seismic data. Yang Xue Advisor: Dr. Mrinal K. Sen

Application of new Monte Carlo method for inversion of prestack seismic data. Yang Xue Advisor: Dr. Mrinal K. Sen Application of new Monte Carlo method for inversion of prestack seismic data Yang Xue Advisor: Dr. Mrinal K. Sen Overview Motivation Introduction Bayes theorem Stochastic inference methods Methodology

More information

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF) Case Study 4: Collaborative Filtering Review: Probabilistic Matrix Factorization Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 2 th, 214 Emily Fox 214 1 Probabilistic

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures

More information

Nonparametric Bayesian Methods - Lecture I

Nonparametric Bayesian Methods - Lecture I Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016 Overview of the lectures I Intro to nonparametric Bayesian statistics

More information

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units Sahar Z Zangeneh Robert W. Keener Roderick J.A. Little Abstract In Probability proportional

More information

Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing

Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing George Papandreou and Alan Yuille Department of Statistics University of California, Los Angeles ICCV Workshop on Information

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Exercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters

Exercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters Exercises Tutorial at ICASSP 216 Learning Nonlinear Dynamical Models Using Particle Filters Andreas Svensson, Johan Dahlin and Thomas B. Schön March 18, 216 Good luck! 1 [Bootstrap particle filter for

More information