Scaling up Bayesian Inference

Size: px

Start display at page:

Download "Scaling up Bayesian Inference"

Cornelia Johns
5 years ago
Views:

1 Scaling up Bayesian Inference David Dunson Departments of Statistical Science, Mathematics & ECE, Duke University May 1, 2017

2 Outline Motivation & background EP-MCMC amcmc Discussion Motivation & background 2

3 Complex & high-dimensional data Interest in developing new methods for analyzing & interpreting complex, high-dimensional data Motivation & background 2

4 Complex & high-dimensional data Interest in developing new methods for analyzing & interpreting complex, high-dimensional data Arise routinely in broad fields of sciences, engineering & even arts & humanities Motivation & background 2

Complex & high-dimensional data Interest in developing new methods for analyzing & interpreting complex, high-dimensional data Arise routinely in broad fields of sciences,

5 Complex & high-dimensional data Interest in developing new methods for analyzing & interpreting complex, high-dimensional data Arise routinely in broad fields of sciences, engineering & even arts & humanities Despite huge interest in big data, there are vast gaps that have fundamentally limited progress in many fields Motivation & background 2

6 Typical approaches to big data There is an increasingly immense literature focused on big data Motivation & background 3

7 Typical approaches to big data There is an increasingly immense literature focused on big data Most of the focus has been on optimization-style methods Motivation & background 3

8 Typical approaches to big data There is an increasingly immense literature focused on big data Most of the focus has been on optimization-style methods Rapidly obtaining a point estimate even when sample size n & overall size of data is immense Motivation & background 3

9 Typical approaches to big data There is an increasingly immense literature focused on big data Most of the focus has been on optimization-style methods Rapidly obtaining a point estimate even when sample size n & overall size of data is immense Bandwagons: many people work on quite similar problems, while critical open problems remain untouched Motivation & background 3

10 My focus - probability models Motivation & background 4

11 My focus - probability models General probabilistic inference algorithms for complex data Motivation & background 4

12 My focus - probability models General probabilistic inference algorithms for complex data We would like to be able to handle arbitrarily complex probability models Motivation & background 4

13 My focus - probability models General probabilistic inference algorithms for complex data We would like to be able to handle arbitrarily complex probability models Algorithms scalable to huge data - potentially using many computers Motivation & background 4

14 My focus - probability models General probabilistic inference algorithms for complex data We would like to be able to handle arbitrarily complex probability models Algorithms scalable to huge data - potentially using many computers Motivation & background 4

15 My focus - probability models General probabilistic inference algorithms for complex data We would like to be able to handle arbitrarily complex probability models Algorithms scalable to huge data - potentially using many computers Accurate uncertainty quantification (UQ) is a critical issue Motivation & background 4

16 My focus - probability models General probabilistic inference algorithms for complex data We would like to be able to handle arbitrarily complex probability models Algorithms scalable to huge data - potentially using many computers Accurate uncertainty quantification (UQ) is a critical issue Robustness of inferences also crucial Motivation & background 4

My focus - probability models General probabilistic inference algorithms for complex data We would like to be able to handle arbitrarily complex probability models Algorithms scalable to huge data -

17 My focus - probability models General probabilistic inference algorithms for complex data We would like to be able to handle arbitrarily complex probability models Algorithms scalable to huge data - potentially using many computers Accurate uncertainty quantification (UQ) is a critical issue Robustness of inferences also crucial Particular emphasis on scientific applications - limited labeled data Motivation & background 4

18 Bayes approaches Bayesian methods offer an attractive general approach for modeling complex data Motivation & background 5

19 Bayes approaches Bayesian methods offer an attractive general approach for modeling complex data Choosing a prior π(θ) & likelihood L(Y (n) θ), the posterior is π n (θ Y (n) ) = π(θ)l(y (n) θ) π(θ)l(y (n) θ)dθ = π(θ)l(y (n) θ) L(Y (n). ) Motivation & background 5

20 Bayes approaches Bayesian methods offer an attractive general approach for modeling complex data Choosing a prior π(θ) & likelihood L(Y (n) θ), the posterior is π n (θ Y (n) ) = π(θ)l(y (n) θ) π(θ)l(y (n) θ)dθ = π(θ)l(y (n) θ) L(Y (n). ) Often θ is moderate to high-dimensional & the integral in the denominator is intractable Motivation & background 5

21 Bayes approaches Bayesian methods offer an attractive general approach for modeling complex data Choosing a prior π(θ) & likelihood L(Y (n) θ), the posterior is π n (θ Y (n) ) = π(θ)l(y (n) θ) π(θ)l(y (n) θ)dθ = π(θ)l(y (n) θ) L(Y (n). ) Often θ is moderate to high-dimensional & the integral in the denominator is intractable Accurate analytic approximations to the posterior have proven elusive outside of narrow settings Motivation & background 5

22 Bayes approaches Bayesian methods offer an attractive general approach for modeling complex data Choosing a prior π(θ) & likelihood L(Y (n) θ), the posterior is π n (θ Y (n) ) = π(θ)l(y (n) θ) π(θ)l(y (n) θ)dθ = π(θ)l(y (n) θ) L(Y (n). ) Often θ is moderate to high-dimensional & the integral in the denominator is intractable Accurate analytic approximations to the posterior have proven elusive outside of narrow settings Markov chain Monte Carlo (MCMC) & other posterior sampling algorithms remain the standard Motivation & background 5

) Often θ is moderate to high-dimensional & the integral in the denominator is intractable Accurate analytic approximations to the posterior have

23 Bayes approaches Bayesian methods offer an attractive general approach for modeling complex data Choosing a prior π(θ) & likelihood L(Y (n) θ), the posterior is π n (θ Y (n) ) = π(θ)l(y (n) θ) π(θ)l(y (n) θ)dθ = π(θ)l(y (n) θ) L(Y (n). ) Often θ is moderate to high-dimensional & the integral in the denominator is intractable Accurate analytic approximations to the posterior have proven elusive outside of narrow settings Markov chain Monte Carlo (MCMC) & other posterior sampling algorithms remain the standard Scaling MCMC to big & complex settings challenging Motivation & background 5

24 MCMC & Computational bottlenecks MCMC constructs Markov chain with stationary distribution π n (θ Y (n) ) Motivation & background 6

25 MCMC & Computational bottlenecks MCMC constructs Markov chain with stationary distribution π n (θ Y (n) ) A transition kernel is carefully chosen & iterative sampling proceeds Motivation & background 6

26 MCMC & Computational bottlenecks MCMC constructs Markov chain with stationary distribution π n (θ Y (n) ) A transition kernel is carefully chosen & iterative sampling proceeds Time per iteration increases with # of parameters/unknowns Motivation & background 6

27 MCMC & Computational bottlenecks MCMC constructs Markov chain with stationary distribution π n (θ Y (n) ) A transition kernel is carefully chosen & iterative sampling proceeds Time per iteration increases with # of parameters/unknowns Mixing worse as dimension of data increases Motivation & background 6

28 MCMC & Computational bottlenecks MCMC constructs Markov chain with stationary distribution π n (θ Y (n) ) A transition kernel is carefully chosen & iterative sampling proceeds Time per iteration increases with # of parameters/unknowns Mixing worse as dimension of data increases Storing & basic processing on big data sets is problematic Motivation & background 6

MCMC & Computational bottlenecks MCMC constructs Markov chain with stationary distribution π n (θ Y (n) ) A transition kernel is carefully chosen & iterative sampling proceeds Time per iteration

29 MCMC & Computational bottlenecks MCMC constructs Markov chain with stationary distribution π n (θ Y (n) ) A transition kernel is carefully chosen & iterative sampling proceeds Time per iteration increases with # of parameters/unknowns Mixing worse as dimension of data increases Storing & basic processing on big data sets is problematic Usually multiple likelihood and/or gradient evaluations at each iteration Motivation & background 6

30 Solutions Embarrassingly parallel (EP) MCMC: run MCMC in parallel for different subsets of data & combine. Motivation & background 7

31 Solutions Embarrassingly parallel (EP) MCMC: run MCMC in parallel for different subsets of data & combine. Approximate MCMC: Approximate expensive to evaluate transition kernels. Motivation & background 7

32 Solutions Embarrassingly parallel (EP) MCMC: run MCMC in parallel for different subsets of data & combine. Approximate MCMC: Approximate expensive to evaluate transition kernels. Hybrid algorithms: run MCMC for a subset of the parameters & use a fast estimate for the others. Motivation & background 7

33 Solutions Embarrassingly parallel (EP) MCMC: run MCMC in parallel for different subsets of data & combine. Approximate MCMC: Approximate expensive to evaluate transition kernels. Hybrid algorithms: run MCMC for a subset of the parameters & use a fast estimate for the others. Designer MCMC: define clever kernels that solve mixing problems in high dimensions Motivation & background 7

34 Solutions Embarrassingly parallel (EP) MCMC: run MCMC in parallel for different subsets of data & combine. Approximate MCMC: Approximate expensive to evaluate transition kernels. Hybrid algorithms: run MCMC for a subset of the parameters & use a fast estimate for the others. Designer MCMC: define clever kernels that solve mixing problems in high dimensions I ll focus on EP-MCMC & amcmc in remainder Motivation & background 7

35 Outline Motivation & background EP-MCMC amcmc Discussion EP-MCMC 8

36 Embarrassingly parallel MCMC Divide large sample size n data set into many smaller data sets stored on different machines EP-MCMC 8

37 Embarrassingly parallel MCMC Divide large sample size n data set into many smaller data sets stored on different machines Draw posterior samples for each subset posterior in parallel EP-MCMC 8

38 Embarrassingly parallel MCMC Divide large sample size n data set into many smaller data sets stored on different machines Draw posterior samples for each subset posterior in parallel Magically combine the results quickly & simply EP-MCMC 8

39 500 1 Toy Example: Logistic Regression 1.15 β MCMC Subset Posterior 0.85 WASP β 1 exp pr(y i = 1 x i1,..., x i p,θ) = 1 + exp ( p ) j =1 x i j β j ( p ). j =1 x i j β j Subset posteriors: noisy approximations of full data posterior. EP-MCMC 9

40 500 1 Toy Example: Logistic Regression 1.15 β MCMC Subset Posterior 0.85 WASP β 1 exp pr(y i = 1 x i1,..., x i p,θ) = 1 + exp ( p ) j =1 x i j β j ( p ). j =1 x i j β j Subset posteriors: noisy approximations of full data posterior. Averaging of subset posteriors reduces this noise & leads to an accurate posterior approximation. EP-MCMC 9

41 Stochastic Approximation Full data posterior density of inid data Y (n) n π n (θ Y (n) i=1 ) = p i (y i θ)π(θ) n Θ i=1 p i (y i θ)π(θ)dθ. EP-MCMC 10

42 Stochastic Approximation Full data posterior density of inid data Y (n) n π n (θ Y (n) i=1 ) = p i (y i θ)π(θ) n Θ i=1 p i (y i θ)π(θ)dθ. Divide full data Y (n) into k subsets of size m: Y (n) = (Y [1],...,Y [j ],...,Y [k] ). EP-MCMC 10

43 Stochastic Approximation Full data posterior density of inid data Y (n) n π n (θ Y (n) i=1 ) = p i (y i θ)π(θ) n Θ i=1 p i (y i θ)π(θ)dθ. Divide full data Y (n) into k subsets of size m: Y (n) = (Y [1],...,Y [j ],...,Y [k] ). Subset posterior density for jth data subset π γ m(θ Y [j ] ) = Θ i [j ](p i (y i θ)) γ π(θ) i [j ](p i (y i θ)) γ π(θ)dθ. EP-MCMC 10

44 Stochastic Approximation Full data posterior density of inid data Y (n) n π n (θ Y (n) i=1 ) = p i (y i θ)π(θ) n Θ i=1 p i (y i θ)π(θ)dθ. Divide full data Y (n) into k subsets of size m: Y (n) = (Y [1],...,Y [j ],...,Y [k] ). Subset posterior density for jth data subset π γ m(θ Y [j ] ) = Θ i [j ](p i (y i θ)) γ π(θ) i [j ](p i (y i θ)) γ π(θ)dθ. γ = O(k) - chosen to minimize approximation error EP-MCMC 10

45 Barycenter in Metric Spaces EP-MCMC 11

46 Barycenter in Metric Spaces EP-MCMC 12

47 WAsserstein barycenter of Subset Posteriors (WASP) Srivastava, Li & Dunson (2015) 2-Wasserstein distance between µ,ν P 2 (Θ) (E[d W 2 (µ,ν) = inf{ 2 (X,Y )] ) } 1 2 : law(x ) = µ, law(y ) = ν. EP-MCMC 13

48 WAsserstein barycenter of Subset Posteriors (WASP) Srivastava, Li & Dunson (2015) 2-Wasserstein distance between µ,ν P 2 (Θ) (E[d W 2 (µ,ν) = inf{ 2 (X,Y )] ) } 1 2 : law(x ) = µ, law(y ) = ν. Π γ m( Y [j ] ) for j = 1,...,k are combined through WASP Π γ n ( Y (n) 1 ) = argmin Π P 2 (Θ) k k W2 2 (Π,Πγ m( Y [j ] )). [Agueh & Carlier (2011)] j =1 EP-MCMC 13

WAsserstein barycenter of Subset Posteriors (WASP) Srivastava, Li & Dunson (2015) 2-Wasserstein distance between µ,ν P 2 (Θ) (E[d W 2 (µ,ν) = inf{ 2 (X,Y )] ) } 1 2 : law(x ) = µ, law(y ) = ν.

49 WAsserstein barycenter of Subset Posteriors (WASP) Srivastava, Li & Dunson (2015) 2-Wasserstein distance between µ,ν P 2 (Θ) (E[d W 2 (µ,ν) = inf{ 2 (X,Y )] ) } 1 2 : law(x ) = µ, law(y ) = ν. Π γ m( Y [j ] ) for j = 1,...,k are combined through WASP Π γ n ( Y (n) 1 ) = argmin Π P 2 (Θ) k k W2 2 (Π,Πγ m( Y [j ] )). [Agueh & Carlier (2011)] j =1 Plugging in Π γ m( Y [j ] ) for j = 1,...,k, a linear program (LP) can be used for fast estimation of an atomic approximation EP-MCMC 13

50 LP Estimation of WASP Minimizing Wasserstein is solution to a discrete optimal transport problem EP-MCMC 14

51 LP Estimation of WASP Minimizing Wasserstein is solution to a discrete optimal transport problem Let µ = J 1 j =1 a j δ θ1j, ν = J 2 l=1 b l δ θ2l & M 12 R J 1 J 2 = matrix of square differences in atoms {θ 1j },{θ 2l }. EP-MCMC 14

52 LP Estimation of WASP Minimizing Wasserstein is solution to a discrete optimal transport problem Let µ = J 1 j =1 a j δ θ1j, ν = J 2 l=1 b l δ θ2l & M 12 R J 1 J 2 = matrix of square differences in atoms {θ 1j },{θ 2l }. Optimal transport polytope: T (a,b) = set of doubly stochastic matrices w/ row sums a & column sums b EP-MCMC 14

53 LP Estimation of WASP Minimizing Wasserstein is solution to a discrete optimal transport problem Let µ = J 1 j =1 a j δ θ1j, ν = J 2 l=1 b l δ θ2l & M 12 R J 1 J 2 = matrix of square differences in atoms {θ 1j },{θ 2l }. Optimal transport polytope: T (a,b) = set of doubly stochastic matrices w/ row sums a & column sums b Objective is to find T T (a,b) minimizing tr(t T M 12 ) EP-MCMC 14

54 LP Estimation of WASP Minimizing Wasserstein is solution to a discrete optimal transport problem Let µ = J 1 j =1 a j δ θ1j, ν = J 2 l=1 b l δ θ2l & M 12 R J 1 J 2 = matrix of square differences in atoms {θ 1j },{θ 2l }. Optimal transport polytope: T (a,b) = set of doubly stochastic matrices w/ row sums a & column sums b Objective is to find T T (a,b) minimizing tr(t T M 12 ) For WASP, generalize to multimargin optimal transport problem - entropy smoothing has been used previously EP-MCMC 14

55 LP Estimation of WASP Minimizing Wasserstein is solution to a discrete optimal transport problem Let µ = J 1 j =1 a j δ θ1j, ν = J 2 l=1 b l δ θ2l & M 12 R J 1 J 2 = matrix of square differences in atoms {θ 1j },{θ 2l }. Optimal transport polytope: T (a,b) = set of doubly stochastic matrices w/ row sums a & column sums b Objective is to find T T (a,b) minimizing tr(t T M 12 ) For WASP, generalize to multimargin optimal transport problem - entropy smoothing has been used previously We can avoid such smoothing & use sparse LP solvers - neglible computation cost compared to sampling EP-MCMC 14

56 Theorem (Subset Posteriors) WASP: Theorems Under usual regularity conditions, there exists a constant C 1 independent of subset posteriors, such that for large m, E [j ] P W 2 { γ 2 Π m( Y [j ] ),δ θ0 ( ) } ( log 2 ) m 1 α C 1 θ 0 m j = 1,...,k, EP-MCMC 15

57 Theorem (Subset Posteriors) WASP: Theorems Under usual regularity conditions, there exists a constant C 1 independent of subset posteriors, such that for large m, E [j ] P W 2 { γ 2 Π m( Y [j ] ),δ θ0 ( ) } ( log 2 ) m 1 α C 1 θ 0 m j = 1,...,k, Theorem (WASP) Under usual regularity conditions and for large m, W 2 {Π γ } n ( Y (n) ),δ θ0 ( ) = O P (n) θ 0 log 2/α m. km 1/α EP-MCMC 15

58 Simple & Fast Posterior Interval Estimation (PIE) Li, Srivastava & Dunson (2015) j Usually report point & interval estimates for different 1-d functionals - multidimensional posterior difficult to interpret EP-MCMC 16

59 Simple & Fast Posterior Interval Estimation (PIE) Li, Srivastava & Dunson (2015) j Usually report point & interval estimates for different 1-d functionals - multidimensional posterior difficult to interpret j WASP has explicit relationship with subset posteriors in 1-d EP-MCMC 16

60 Simple & Fast Posterior Interval Estimation (PIE) Li, Srivastava & Dunson (2015) j Usually report point & interval estimates for different 1-d functionals - multidimensional posterior difficult to interpret j WASP has explicit relationship with subset posteriors in 1-d j Quantiles of WASP are simple averages of quantiles of subset posteriors EP-MCMC 16

61 Simple & Fast Posterior Interval Estimation (PIE) Li, Srivastava & Dunson (2015) j Usually report point & interval estimates for different 1-d functionals - multidimensional posterior difficult to interpret j WASP has explicit relationship with subset posteriors in 1-d j Quantiles of WASP are simple averages of quantiles of subset posteriors j Leads to a super trivial algorithm - run MCMC for each subset & average quantiles - reminiscent of bag of little bootstraps EP-MCMC 16

62 Simple & Fast Posterior Interval Estimation (PIE) Li, Srivastava & Dunson (2015) j Usually report point & interval estimates for different 1-d functionals - multidimensional posterior difficult to interpret j WASP has explicit relationship with subset posteriors in 1-d j Quantiles of WASP are simple averages of quantiles of subset posteriors j Leads to a super trivial algorithm - run MCMC for each subset & average quantiles - reminiscent of bag of little bootstraps j Strong theory showing accuracy of the resulting approximation EP-MCMC 16

Simple & Fast Posterior Interval Estimation (PIE) Li, Srivastava & Dunson (2015) j Usually report point & interval estimates for different 1-d functionals - multidimensional posterior difficult to

63 Simple & Fast Posterior Interval Estimation (PIE) Li, Srivastava & Dunson (2015) j Usually report point & interval estimates for different 1-d functionals - multidimensional posterior difficult to interpret j WASP has explicit relationship with subset posteriors in 1-d j Quantiles of WASP are simple averages of quantiles of subset posteriors j Leads to a super trivial algorithm - run MCMC for each subset & average quantiles - reminiscent of bag of little bootstraps j Strong theory showing accuracy of the resulting approximation j Can be implemented in STAN, which allows powered likelihoods EP-MCMC 16

64 Theory on PIE/1-d WASP We show 1-d WASP Π n (ξ Y (n) ) is highly accurate approximation to exact posterior Π n (ξ Y (n) ) EP-MCMC 17

65 Theory on PIE/1-d WASP We show 1-d WASP Π n (ξ Y (n) ) is highly accurate approximation to exact posterior Π n (ξ Y (n) ) As subset sample size m increases, W 2 distance between them decreases at faster than parametric rate o p (n 1/2 ) EP-MCMC 17

66 Theory on PIE/1-d WASP We show 1-d WASP Π n (ξ Y (n) ) is highly accurate approximation to exact posterior Π n (ξ Y (n) ) As subset sample size m increases, W 2 distance between them decreases at faster than parametric rate o p (n 1/2 ) Theorem allows k = O(n c ) and m = O(n 1 c ) for any c (0,1), so m can increase very slowly relative to k (recall n = mk) EP-MCMC 17

67 Theory on PIE/1-d WASP We show 1-d WASP Π n (ξ Y (n) ) is highly accurate approximation to exact posterior Π n (ξ Y (n) ) As subset sample size m increases, W 2 distance between them decreases at faster than parametric rate o p (n 1/2 ) Theorem allows k = O(n c ) and m = O(n 1 c ) for any c (0,1), so m can increase very slowly relative to k (recall n = mk) Their biases, variances, quantiles only differ in high orders of the total sample size EP-MCMC 17

68 Theory on PIE/1-d WASP We show 1-d WASP Π n (ξ Y (n) ) is highly accurate approximation to exact posterior Π n (ξ Y (n) ) As subset sample size m increases, W 2 distance between them decreases at faster than parametric rate o p (n 1/2 ) Theorem allows k = O(n c ) and m = O(n 1 c ) for any c (0,1), so m can increase very slowly relative to k (recall n = mk) Their biases, variances, quantiles only differ in high orders of the total sample size Conditions: standard, mild conditions on likelihood + prior finite 2nd moment & uniform integrabiity of subset posteriors EP-MCMC 17

69 Results We have implemented for rich variety of data & models EP-MCMC 18

70 Results We have implemented for rich variety of data & models Logistic & linear random effects models, mixture models, matrix & tensor factorizations, Gaussian process regression EP-MCMC 18

71 Results We have implemented for rich variety of data & models Logistic & linear random effects models, mixture models, matrix & tensor factorizations, Gaussian process regression Nonparametric models, dependence, hierarchical models, etc. EP-MCMC 18

72 Results We have implemented for rich variety of data & models Logistic & linear random effects models, mixture models, matrix & tensor factorizations, Gaussian process regression Nonparametric models, dependence, hierarchical models, etc. We compare to long runs of MCMC (when feasible) & VB EP-MCMC 18

73 Results We have implemented for rich variety of data & models Logistic & linear random effects models, mixture models, matrix & tensor factorizations, Gaussian process regression Nonparametric models, dependence, hierarchical models, etc. We compare to long runs of MCMC (when feasible) & VB WASP/PIE is much faster than MCMC & highly accurate EP-MCMC 18

74 Results We have implemented for rich variety of data & models Logistic & linear random effects models, mixture models, matrix & tensor factorizations, Gaussian process regression Nonparametric models, dependence, hierarchical models, etc. We compare to long runs of MCMC (when feasible) & VB WASP/PIE is much faster than MCMC & highly accurate Carefully designed VB implementations often do very well EP-MCMC 18

75 Outline Motivation & background EP-MCMC amcmc Discussion amcmc 19

76 amcmc Johndrow, Mattingly, Mukherjee & Dunson (2015) Different way to speed up MCMC - replace expensive transition kernels with approximations amcmc 19

77 amcmc Johndrow, Mattingly, Mukherjee & Dunson (2015) Different way to speed up MCMC - replace expensive transition kernels with approximations For example, approximate a conditional distribution in Gibbs sampler with a Gaussian or using a subsample of data amcmc 19

78 amcmc Johndrow, Mattingly, Mukherjee & Dunson (2015) Different way to speed up MCMC - replace expensive transition kernels with approximations For example, approximate a conditional distribution in Gibbs sampler with a Gaussian or using a subsample of data Can potentially vastly speed up MCMC sampling in high-dimensional settings amcmc 19

79 amcmc Johndrow, Mattingly, Mukherjee & Dunson (2015) Different way to speed up MCMC - replace expensive transition kernels with approximations For example, approximate a conditional distribution in Gibbs sampler with a Gaussian or using a subsample of data Can potentially vastly speed up MCMC sampling in high-dimensional settings Original MCMC sampler converges to a stationary distribution corresponding to the exact posterior amcmc 19

80 amcmc Johndrow, Mattingly, Mukherjee & Dunson (2015) Different way to speed up MCMC - replace expensive transition kernels with approximations For example, approximate a conditional distribution in Gibbs sampler with a Gaussian or using a subsample of data Can potentially vastly speed up MCMC sampling in high-dimensional settings Original MCMC sampler converges to a stationary distribution corresponding to the exact posterior Not clear what happens when we start substituting in approximations - may diverge etc amcmc 19

81 amcmc Overview amcmc is used routinely in an essentially ad hoc manner amcmc 20

82 amcmc Overview amcmc is used routinely in an essentially ad hoc manner Our goal: obtain theory guarantees & use these to target design of algorithms amcmc 20

83 amcmc Overview amcmc is used routinely in an essentially ad hoc manner Our goal: obtain theory guarantees & use these to target design of algorithms Define exact MCMC algorithm, which is computationally intractable but has good mixing amcmc 20

84 amcmc Overview amcmc is used routinely in an essentially ad hoc manner Our goal: obtain theory guarantees & use these to target design of algorithms Define exact MCMC algorithm, which is computationally intractable but has good mixing exact chain converges to stationary distribution corresponding to exact posterior amcmc 20

85 amcmc Overview amcmc is used routinely in an essentially ad hoc manner Our goal: obtain theory guarantees & use these to target design of algorithms Define exact MCMC algorithm, which is computationally intractable but has good mixing exact chain converges to stationary distribution corresponding to exact posterior Approximate kernel in exact chain with more computationally tractable alternative amcmc 20

86 Sketch of theory Define s ɛ = τ 1 (P )/τ 1 (P ɛ ) = computational speed-up, τ 1 (P ) = time for one step with transition kernel P amcmc 21

87 Sketch of theory Define s ɛ = τ 1 (P )/τ 1 (P ɛ ) = computational speed-up, τ 1 (P ) = time for one step with transition kernel P Interest: optimizing computational time-accuracy tradeoff for estimators of Πf = Θ f (θ)π(dθ x) amcmc 21

88 Sketch of theory Define s ɛ = τ 1 (P )/τ 1 (P ɛ ) = computational speed-up, τ 1 (P ) = time for one step with transition kernel P Interest: optimizing computational time-accuracy tradeoff for estimators of Πf = Θ f (θ)π(dθ x) We provide tight, finite sample bounds on L 2 error amcmc 21

89 Sketch of theory Define s ɛ = τ 1 (P )/τ 1 (P ɛ ) = computational speed-up, τ 1 (P ) = time for one step with transition kernel P Interest: optimizing computational time-accuracy tradeoff for estimators of Πf = Θ f (θ)π(dθ x) We provide tight, finite sample bounds on L 2 error amcmc estimators win for low computational budgets but have asymptotic bias amcmc 21

Sketch of theory Define s ɛ = τ 1 (P )/τ 1 (P ɛ ) = computational speed-up, τ 1 (P ) = time for one step with transition kernel P Interest: optimizing computational time-accuracy tradeoff for

90 Sketch of theory Define s ɛ = τ 1 (P )/τ 1 (P ɛ ) = computational speed-up, τ 1 (P ) = time for one step with transition kernel P Interest: optimizing computational time-accuracy tradeoff for estimators of Πf = Θ f (θ)π(dθ x) We provide tight, finite sample bounds on L 2 error amcmc estimators win for low computational budgets but have asymptotic bias Often larger approximation error larger s ɛ & rougher approximations are better when speed super important amcmc 21

91 Ex 1: Approximations using subsets Replace the full data likelihood with ( N/ V L ɛ (x θ) = L(x i θ)), i V for randomly chosen subset V {1,...,n}. amcmc 22

92 Ex 1: Approximations using subsets Replace the full data likelihood with ( N/ V L ɛ (x θ) = L(x i θ)), i V for randomly chosen subset V {1,...,n}. Applied to Pólya-Gamma data augmentation for logistic regression amcmc 22

93 Ex 1: Approximations using subsets Replace the full data likelihood with ( N/ V L ɛ (x θ) = L(x i θ)), i V for randomly chosen subset V {1,...,n}. Applied to Pólya-Gamma data augmentation for logistic regression Different V at each iteration trivial modification to Gibbs amcmc 22

94 Ex 1: Approximations using subsets Replace the full data likelihood with ( N/ V L ɛ (x θ) = L(x i θ)), i V for randomly chosen subset V {1,...,n}. Applied to Pólya-Gamma data augmentation for logistic regression Different V at each iteration trivial modification to Gibbs Assumptions hold with high probability for subsets > minimal size (wrt distribution of subsets, data & kernel). amcmc 22

95 Application to SUSY dataset n = 5,000,000 (0.5 million test), binary outcome & 18 continuous covariates amcmc 23

96 Application to SUSY dataset n = 5,000,000 (0.5 million test), binary outcome & 18 continuous covariates Considered subsets sizes ranging from V = 1,000 to 4,500,000 amcmc 23

97 Application to SUSY dataset n = 5,000,000 (0.5 million test), binary outcome & 18 continuous covariates Considered subsets sizes ranging from V = 1,000 to 4,500,000 Considered different losses as function of V amcmc 23

98 Application to SUSY dataset n = 5,000,000 (0.5 million test), binary outcome & 18 continuous covariates Considered subsets sizes ranging from V = 1,000 to 4,500,000 Considered different losses as function of V Rate at which loss 0 with ɛ heavily dependent on loss amcmc 23

99 Application to SUSY dataset n = 5,000,000 (0.5 million test), binary outcome & 18 continuous covariates Considered subsets sizes ranging from V = 1,000 to 4,500,000 Considered different losses as function of V Rate at which loss 0 with ɛ heavily dependent on loss For small computational budget & focus on posterior mean estimation, small subsets preferred amcmc 23

100 Application to SUSY dataset n = 5,000,000 (0.5 million test), binary outcome & 18 continuous covariates Considered subsets sizes ranging from V = 1,000 to 4,500,000 Considered different losses as function of V Rate at which loss 0 with ɛ heavily dependent on loss For small computational budget & focus on posterior mean estimation, small subsets preferred As budget increases & loss focused more on tails (e.g., for interval estimation), optimal V increases amcmc 23

101 Application 2: Mixture models & tensor factorizations We also considered a nonparametric Bayes model: pr(y i1 = c 1,..., y i p = c p ) = k p λ h h=1 j =1 ψ (j ) hc j, a very useful model for multivariate categorical data amcmc 24

102 Application 2: Mixture models & tensor factorizations We also considered a nonparametric Bayes model: pr(y i1 = c 1,..., y i p = c p ) = k p λ h h=1 j =1 ψ (j ) hc j, a very useful model for multivariate categorical data Dunson & Xing (9) - a data augmentation Gibbs sampler amcmc 24

103 Application 2: Mixture models & tensor factorizations We also considered a nonparametric Bayes model: pr(y i1 = c 1,..., y i p = c p ) = k p λ h h=1 j =1 ψ (j ) hc j, a very useful model for multivariate categorical data Dunson & Xing (9) - a data augmentation Gibbs sampler Sampling latent classes computationally prohibitive for huge n amcmc 24

104 Application 2: Mixture models & tensor factorizations We also considered a nonparametric Bayes model: pr(y i1 = c 1,..., y i p = c p ) = k p λ h h=1 j =1 ψ (j ) hc j, a very useful model for multivariate categorical data Dunson & Xing (9) - a data augmentation Gibbs sampler Sampling latent classes computationally prohibitive for huge n Use adaptive Gaussian approximation - avoid sampling individual latent classes amcmc 24

105 Application 2: Mixture models & tensor factorizations We also considered a nonparametric Bayes model: pr(y i1 = c 1,..., y i p = c p ) = k p λ h h=1 j =1 ψ (j ) hc j, a very useful model for multivariate categorical data Dunson & Xing (9) - a data augmentation Gibbs sampler Sampling latent classes computationally prohibitive for huge n Use adaptive Gaussian approximation - avoid sampling individual latent classes We have shown Assumptions 1-2, Assumption 2 result more general than this setting amcmc 24

106 Application 2: Mixture models & tensor factorizations We also considered a nonparametric Bayes model: pr(y i1 = c 1,..., y i p = c p ) = k p λ h h=1 j =1 ψ (j ) hc j, a very useful model for multivariate categorical data Dunson & Xing (9) - a data augmentation Gibbs sampler Sampling latent classes computationally prohibitive for huge n Use adaptive Gaussian approximation - avoid sampling individual latent classes We have shown Assumptions 1-2, Assumption 2 result more general than this setting Improved computation performance for large n amcmc 24

107 Application 3: Low rank approximation to GP Gaussian process regression, y i = f (x i ) + η i, η i N(0,σ 2 ) amcmc 25

108 Application 3: Low rank approximation to GP Gaussian process regression, y i = f (x i ) + η i, η i N(0,σ 2 ) f GP prior with covariance τ 2 exp( φ x 1 x 2 2 ) amcmc 25

109 Application 3: Low rank approximation to GP Gaussian process regression, y i = f (x i ) + η i, η i N(0,σ 2 ) f GP prior with covariance τ 2 exp( φ x 1 x 2 2 ) Discrete-uniform on φ & gamma priors on τ 2,σ 2 amcmc 25

110 Application 3: Low rank approximation to GP Gaussian process regression, y i = f (x i ) + η i, η i N(0,σ 2 ) f GP prior with covariance τ 2 exp( φ x 1 x 2 2 ) Discrete-uniform on φ & gamma priors on τ 2,σ 2 Marginal MCMC sampler updates φ,τ 2,σ 2 amcmc 25

111 Application 3: Low rank approximation to GP Gaussian process regression, y i = f (x i ) + η i, η i N(0,σ 2 ) f GP prior with covariance τ 2 exp( φ x 1 x 2 2 ) Discrete-uniform on φ & gamma priors on τ 2,σ 2 Marginal MCMC sampler updates φ,τ 2,σ 2 We show Assumption 1 holds under mild regularity conditions on truth, Assumption 2 holds for partial rank-r eigen approximation to Σ amcmc 25

Application 3: Low rank approximation to GP Gaussian process regression, y i = f (x i ) + η i, η i N(0,σ 2 ) f GP prior with covariance τ 2 exp( φ x 1 x 2 2 ) Discrete-uniform on φ & gamma priors on

112 Application 3: Low rank approximation to GP Gaussian process regression, y i = f (x i ) + η i, η i N(0,σ 2 ) f GP prior with covariance τ 2 exp( φ x 1 x 2 2 ) Discrete-uniform on φ & gamma priors on τ 2,σ 2 Marginal MCMC sampler updates φ,τ 2,σ 2 We show Assumption 1 holds under mild regularity conditions on truth, Assumption 2 holds for partial rank-r eigen approximation to Σ Less accurate approximations clearly superior in practice for small computational budget amcmc 25

113 Applications: General Conclusions Achieving uniform control of approximation error ɛ requires approximations adaptive to current state of chain amcmc 26

114 Applications: General Conclusions Achieving uniform control of approximation error ɛ requires approximations adaptive to current state of chain More accurate approximations needed farther from high probability region of posterior; good as chain rarely there amcmc 26

115 Applications: General Conclusions Achieving uniform control of approximation error ɛ requires approximations adaptive to current state of chain More accurate approximations needed farther from high probability region of posterior; good as chain rarely there Approximations to conditionals of vector parameters are highly sensitive to 2nd moment amcmc 26

116 Applications: General Conclusions Achieving uniform control of approximation error ɛ requires approximations adaptive to current state of chain More accurate approximations needed farther from high probability region of posterior; good as chain rarely there Approximations to conditionals of vector parameters are highly sensitive to 2nd moment Smaller condition numbers for the covariance matrix of vector parameters mean less accurate approximations can be used amcmc 26

117 Outline Motivation & background EP-MCMC amcmc Discussion Discussion 27

118 Discussion Proposed very general classes of scalable Bayes algorithms Discussion 27

119 Discussion Proposed very general classes of scalable Bayes algorithms EP-MCMC & amcmc - fast & scalable with guarantees Discussion 27

120 Discussion Proposed very general classes of scalable Bayes algorithms EP-MCMC & amcmc - fast & scalable with guarantees Interest in improving theory - avoid reliance on asymptotics in EP-MCMC & weaken assumptions in amcmc Discussion 27

121 Discussion Proposed very general classes of scalable Bayes algorithms EP-MCMC & amcmc - fast & scalable with guarantees Interest in improving theory - avoid reliance on asymptotics in EP-MCMC & weaken assumptions in amcmc Useful to combine algorithms - e.g., run amcmc for each subset Discussion 27

122 Discussion Proposed very general classes of scalable Bayes algorithms EP-MCMC & amcmc - fast & scalable with guarantees Interest in improving theory - avoid reliance on asymptotics in EP-MCMC & weaken assumptions in amcmc Useful to combine algorithms - e.g., run amcmc for each subset By looking at algorithms through our theory lens, suggests new & improved algorithms Discussion 27

123 Discussion Proposed very general classes of scalable Bayes algorithms EP-MCMC & amcmc - fast & scalable with guarantees Interest in improving theory - avoid reliance on asymptotics in EP-MCMC & weaken assumptions in amcmc Useful to combine algorithms - e.g., run amcmc for each subset By looking at algorithms through our theory lens, suggests new & improved algorithms Also, very interested in hybrid frequentist-bayes algorithms Discussion 27

124 Hybrid high-dimensional density estimation Ye, Canale & Dunson (2016, AISTATS) y i = (y i1,..., y i p ) T f with p large & f an unknown density Discussion 28

125 Hybrid high-dimensional density estimation Ye, Canale & Dunson (2016, AISTATS) y i = (y i1,..., y i p ) T f with p large & f an unknown density Potentially use Dirichlet process mixtures of factor models Discussion 28

126 Hybrid high-dimensional density estimation Ye, Canale & Dunson (2016, AISTATS) y i = (y i1,..., y i p ) T f with p large & f an unknown density Potentially use Dirichlet process mixtures of factor models Approach doesn t scale well at all with p Discussion 28

127 Hybrid high-dimensional density estimation Ye, Canale & Dunson (2016, AISTATS) y i = (y i1,..., y i p ) T f with p large & f an unknown density Potentially use Dirichlet process mixtures of factor models Approach doesn t scale well at all with p Instead use hybrid of Gibbs sampling & fast multiscale SVD Discussion 28

128 Hybrid high-dimensional density estimation Ye, Canale & Dunson (2016, AISTATS) y i = (y i1,..., y i p ) T f with p large & f an unknown density Potentially use Dirichlet process mixtures of factor models Approach doesn t scale well at all with p Instead use hybrid of Gibbs sampling & fast multiscale SVD Scalable, excellent mixing & empirical/predictive performance Discussion 28

129 What about mixing? Method ACF DA MH MVN CDA Lag In the above we have put aside the mixing issues that can arise in big samples Discussion 29

130 What about mixing? Method ACF DA MH MVN CDA Lag In the above we have put aside the mixing issues that can arise in big samples Slow mixing we need many more MCMC samples for the sample MC error Discussion 29

131 What about mixing? Method ACF DA MH MVN CDA Lag In the above we have put aside the mixing issues that can arise in big samples Slow mixing we need many more MCMC samples for the sample MC error Common data augmentation algorithms for discrete data fail badly for large imbalanced data (Johndrow et al. 2016) Discussion 29

132 What about mixing? Method ACF DA MH MVN CDA Lag In the above we have put aside the mixing issues that can arise in big samples Slow mixing we need many more MCMC samples for the sample MC error Common data augmentation algorithms for discrete data fail badly for large imbalanced data (Johndrow et al. 2016) But such problems can be fixed via calibration (Duan et al. 2016) Discussion 29

133 What about mixing? Method ACF DA MH MVN CDA Lag In the above we have put aside the mixing issues that can arise in big samples Slow mixing we need many more MCMC samples for the sample MC error Common data augmentation algorithms for discrete data fail badly for large imbalanced data (Johndrow et al. 2016) But such problems can be fixed via calibration (Duan et al. 2016) Interesting area for further research Discussion 29

134 Primary References Duan L, Johndrow J, Dunson DB (2017) Calibrated data augmentation for scalable Markov chain Monte Carlo. arxiv: Johndrow J, Mattingly J, Mukherjee S, Dunson DB (2015) Approximations of Markov chains and Bayesian inference. arxiv: Johndrow J, Smith A, Pillai N, Dunson DB (2016) Inefficiency of data augmentation for large sample imbalanced data. arxiv: Li C, Srivastava S, Dunson DB (2016) Simple, scalable and accurate posterior interval estimation. arxiv: ; Biometrika, in press. Discussion 30

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),