Scaling up Bayesian Inference
|
|
- Cornelia Johns
- 5 years ago
- Views:
Transcription
1 Scaling up Bayesian Inference David Dunson Departments of Statistical Science, Mathematics & ECE, Duke University May 1, 2017
2 Outline Motivation & background EP-MCMC amcmc Discussion Motivation & background 2
3 Complex & high-dimensional data Interest in developing new methods for analyzing & interpreting complex, high-dimensional data Motivation & background 2
4 Complex & high-dimensional data Interest in developing new methods for analyzing & interpreting complex, high-dimensional data Arise routinely in broad fields of sciences, engineering & even arts & humanities Motivation & background 2
5 Complex & high-dimensional data Interest in developing new methods for analyzing & interpreting complex, high-dimensional data Arise routinely in broad fields of sciences, engineering & even arts & humanities Despite huge interest in big data, there are vast gaps that have fundamentally limited progress in many fields Motivation & background 2
6 Typical approaches to big data There is an increasingly immense literature focused on big data Motivation & background 3
7 Typical approaches to big data There is an increasingly immense literature focused on big data Most of the focus has been on optimization-style methods Motivation & background 3
8 Typical approaches to big data There is an increasingly immense literature focused on big data Most of the focus has been on optimization-style methods Rapidly obtaining a point estimate even when sample size n & overall size of data is immense Motivation & background 3
9 Typical approaches to big data There is an increasingly immense literature focused on big data Most of the focus has been on optimization-style methods Rapidly obtaining a point estimate even when sample size n & overall size of data is immense Bandwagons: many people work on quite similar problems, while critical open problems remain untouched Motivation & background 3
10 My focus - probability models Motivation & background 4
11 My focus - probability models General probabilistic inference algorithms for complex data Motivation & background 4
12 My focus - probability models General probabilistic inference algorithms for complex data We would like to be able to handle arbitrarily complex probability models Motivation & background 4
13 My focus - probability models General probabilistic inference algorithms for complex data We would like to be able to handle arbitrarily complex probability models Algorithms scalable to huge data - potentially using many computers Motivation & background 4
14 My focus - probability models General probabilistic inference algorithms for complex data We would like to be able to handle arbitrarily complex probability models Algorithms scalable to huge data - potentially using many computers Motivation & background 4
15 My focus - probability models General probabilistic inference algorithms for complex data We would like to be able to handle arbitrarily complex probability models Algorithms scalable to huge data - potentially using many computers Accurate uncertainty quantification (UQ) is a critical issue Motivation & background 4
16 My focus - probability models General probabilistic inference algorithms for complex data We would like to be able to handle arbitrarily complex probability models Algorithms scalable to huge data - potentially using many computers Accurate uncertainty quantification (UQ) is a critical issue Robustness of inferences also crucial Motivation & background 4
17 My focus - probability models General probabilistic inference algorithms for complex data We would like to be able to handle arbitrarily complex probability models Algorithms scalable to huge data - potentially using many computers Accurate uncertainty quantification (UQ) is a critical issue Robustness of inferences also crucial Particular emphasis on scientific applications - limited labeled data Motivation & background 4
18 Bayes approaches Bayesian methods offer an attractive general approach for modeling complex data Motivation & background 5
19 Bayes approaches Bayesian methods offer an attractive general approach for modeling complex data Choosing a prior π(θ) & likelihood L(Y (n) θ), the posterior is π n (θ Y (n) ) = π(θ)l(y (n) θ) π(θ)l(y (n) θ)dθ = π(θ)l(y (n) θ) L(Y (n). ) Motivation & background 5
20 Bayes approaches Bayesian methods offer an attractive general approach for modeling complex data Choosing a prior π(θ) & likelihood L(Y (n) θ), the posterior is π n (θ Y (n) ) = π(θ)l(y (n) θ) π(θ)l(y (n) θ)dθ = π(θ)l(y (n) θ) L(Y (n). ) Often θ is moderate to high-dimensional & the integral in the denominator is intractable Motivation & background 5
21 Bayes approaches Bayesian methods offer an attractive general approach for modeling complex data Choosing a prior π(θ) & likelihood L(Y (n) θ), the posterior is π n (θ Y (n) ) = π(θ)l(y (n) θ) π(θ)l(y (n) θ)dθ = π(θ)l(y (n) θ) L(Y (n). ) Often θ is moderate to high-dimensional & the integral in the denominator is intractable Accurate analytic approximations to the posterior have proven elusive outside of narrow settings Motivation & background 5
22 Bayes approaches Bayesian methods offer an attractive general approach for modeling complex data Choosing a prior π(θ) & likelihood L(Y (n) θ), the posterior is π n (θ Y (n) ) = π(θ)l(y (n) θ) π(θ)l(y (n) θ)dθ = π(θ)l(y (n) θ) L(Y (n). ) Often θ is moderate to high-dimensional & the integral in the denominator is intractable Accurate analytic approximations to the posterior have proven elusive outside of narrow settings Markov chain Monte Carlo (MCMC) & other posterior sampling algorithms remain the standard Motivation & background 5
23 Bayes approaches Bayesian methods offer an attractive general approach for modeling complex data Choosing a prior π(θ) & likelihood L(Y (n) θ), the posterior is π n (θ Y (n) ) = π(θ)l(y (n) θ) π(θ)l(y (n) θ)dθ = π(θ)l(y (n) θ) L(Y (n). ) Often θ is moderate to high-dimensional & the integral in the denominator is intractable Accurate analytic approximations to the posterior have proven elusive outside of narrow settings Markov chain Monte Carlo (MCMC) & other posterior sampling algorithms remain the standard Scaling MCMC to big & complex settings challenging Motivation & background 5
24 MCMC & Computational bottlenecks MCMC constructs Markov chain with stationary distribution π n (θ Y (n) ) Motivation & background 6
25 MCMC & Computational bottlenecks MCMC constructs Markov chain with stationary distribution π n (θ Y (n) ) A transition kernel is carefully chosen & iterative sampling proceeds Motivation & background 6
26 MCMC & Computational bottlenecks MCMC constructs Markov chain with stationary distribution π n (θ Y (n) ) A transition kernel is carefully chosen & iterative sampling proceeds Time per iteration increases with # of parameters/unknowns Motivation & background 6
27 MCMC & Computational bottlenecks MCMC constructs Markov chain with stationary distribution π n (θ Y (n) ) A transition kernel is carefully chosen & iterative sampling proceeds Time per iteration increases with # of parameters/unknowns Mixing worse as dimension of data increases Motivation & background 6
28 MCMC & Computational bottlenecks MCMC constructs Markov chain with stationary distribution π n (θ Y (n) ) A transition kernel is carefully chosen & iterative sampling proceeds Time per iteration increases with # of parameters/unknowns Mixing worse as dimension of data increases Storing & basic processing on big data sets is problematic Motivation & background 6
29 MCMC & Computational bottlenecks MCMC constructs Markov chain with stationary distribution π n (θ Y (n) ) A transition kernel is carefully chosen & iterative sampling proceeds Time per iteration increases with # of parameters/unknowns Mixing worse as dimension of data increases Storing & basic processing on big data sets is problematic Usually multiple likelihood and/or gradient evaluations at each iteration Motivation & background 6
30 Solutions Embarrassingly parallel (EP) MCMC: run MCMC in parallel for different subsets of data & combine. Motivation & background 7
31 Solutions Embarrassingly parallel (EP) MCMC: run MCMC in parallel for different subsets of data & combine. Approximate MCMC: Approximate expensive to evaluate transition kernels. Motivation & background 7
32 Solutions Embarrassingly parallel (EP) MCMC: run MCMC in parallel for different subsets of data & combine. Approximate MCMC: Approximate expensive to evaluate transition kernels. Hybrid algorithms: run MCMC for a subset of the parameters & use a fast estimate for the others. Motivation & background 7
33 Solutions Embarrassingly parallel (EP) MCMC: run MCMC in parallel for different subsets of data & combine. Approximate MCMC: Approximate expensive to evaluate transition kernels. Hybrid algorithms: run MCMC for a subset of the parameters & use a fast estimate for the others. Designer MCMC: define clever kernels that solve mixing problems in high dimensions Motivation & background 7
34 Solutions Embarrassingly parallel (EP) MCMC: run MCMC in parallel for different subsets of data & combine. Approximate MCMC: Approximate expensive to evaluate transition kernels. Hybrid algorithms: run MCMC for a subset of the parameters & use a fast estimate for the others. Designer MCMC: define clever kernels that solve mixing problems in high dimensions I ll focus on EP-MCMC & amcmc in remainder Motivation & background 7
35 Outline Motivation & background EP-MCMC amcmc Discussion EP-MCMC 8
36 Embarrassingly parallel MCMC Divide large sample size n data set into many smaller data sets stored on different machines EP-MCMC 8
37 Embarrassingly parallel MCMC Divide large sample size n data set into many smaller data sets stored on different machines Draw posterior samples for each subset posterior in parallel EP-MCMC 8
38 Embarrassingly parallel MCMC Divide large sample size n data set into many smaller data sets stored on different machines Draw posterior samples for each subset posterior in parallel Magically combine the results quickly & simply EP-MCMC 8
39 500 1 Toy Example: Logistic Regression 1.15 β MCMC Subset Posterior 0.85 WASP β 1 exp pr(y i = 1 x i1,..., x i p,θ) = 1 + exp ( p ) j =1 x i j β j ( p ). j =1 x i j β j Subset posteriors: noisy approximations of full data posterior. EP-MCMC 9
40 500 1 Toy Example: Logistic Regression 1.15 β MCMC Subset Posterior 0.85 WASP β 1 exp pr(y i = 1 x i1,..., x i p,θ) = 1 + exp ( p ) j =1 x i j β j ( p ). j =1 x i j β j Subset posteriors: noisy approximations of full data posterior. Averaging of subset posteriors reduces this noise & leads to an accurate posterior approximation. EP-MCMC 9
41 Stochastic Approximation Full data posterior density of inid data Y (n) n π n (θ Y (n) i=1 ) = p i (y i θ)π(θ) n Θ i=1 p i (y i θ)π(θ)dθ. EP-MCMC 10
42 Stochastic Approximation Full data posterior density of inid data Y (n) n π n (θ Y (n) i=1 ) = p i (y i θ)π(θ) n Θ i=1 p i (y i θ)π(θ)dθ. Divide full data Y (n) into k subsets of size m: Y (n) = (Y [1],...,Y [j ],...,Y [k] ). EP-MCMC 10
43 Stochastic Approximation Full data posterior density of inid data Y (n) n π n (θ Y (n) i=1 ) = p i (y i θ)π(θ) n Θ i=1 p i (y i θ)π(θ)dθ. Divide full data Y (n) into k subsets of size m: Y (n) = (Y [1],...,Y [j ],...,Y [k] ). Subset posterior density for jth data subset π γ m(θ Y [j ] ) = Θ i [j ](p i (y i θ)) γ π(θ) i [j ](p i (y i θ)) γ π(θ)dθ. EP-MCMC 10
44 Stochastic Approximation Full data posterior density of inid data Y (n) n π n (θ Y (n) i=1 ) = p i (y i θ)π(θ) n Θ i=1 p i (y i θ)π(θ)dθ. Divide full data Y (n) into k subsets of size m: Y (n) = (Y [1],...,Y [j ],...,Y [k] ). Subset posterior density for jth data subset π γ m(θ Y [j ] ) = Θ i [j ](p i (y i θ)) γ π(θ) i [j ](p i (y i θ)) γ π(θ)dθ. γ = O(k) - chosen to minimize approximation error EP-MCMC 10
45 Barycenter in Metric Spaces EP-MCMC 11
46 Barycenter in Metric Spaces EP-MCMC 12
47 WAsserstein barycenter of Subset Posteriors (WASP) Srivastava, Li & Dunson (2015) 2-Wasserstein distance between µ,ν P 2 (Θ) (E[d W 2 (µ,ν) = inf{ 2 (X,Y )] ) } 1 2 : law(x ) = µ, law(y ) = ν. EP-MCMC 13
48 WAsserstein barycenter of Subset Posteriors (WASP) Srivastava, Li & Dunson (2015) 2-Wasserstein distance between µ,ν P 2 (Θ) (E[d W 2 (µ,ν) = inf{ 2 (X,Y )] ) } 1 2 : law(x ) = µ, law(y ) = ν. Π γ m( Y [j ] ) for j = 1,...,k are combined through WASP Π γ n ( Y (n) 1 ) = argmin Π P 2 (Θ) k k W2 2 (Π,Πγ m( Y [j ] )). [Agueh & Carlier (2011)] j =1 EP-MCMC 13
49 WAsserstein barycenter of Subset Posteriors (WASP) Srivastava, Li & Dunson (2015) 2-Wasserstein distance between µ,ν P 2 (Θ) (E[d W 2 (µ,ν) = inf{ 2 (X,Y )] ) } 1 2 : law(x ) = µ, law(y ) = ν. Π γ m( Y [j ] ) for j = 1,...,k are combined through WASP Π γ n ( Y (n) 1 ) = argmin Π P 2 (Θ) k k W2 2 (Π,Πγ m( Y [j ] )). [Agueh & Carlier (2011)] j =1 Plugging in Π γ m( Y [j ] ) for j = 1,...,k, a linear program (LP) can be used for fast estimation of an atomic approximation EP-MCMC 13
50 LP Estimation of WASP Minimizing Wasserstein is solution to a discrete optimal transport problem EP-MCMC 14
51 LP Estimation of WASP Minimizing Wasserstein is solution to a discrete optimal transport problem Let µ = J 1 j =1 a j δ θ1j, ν = J 2 l=1 b l δ θ2l & M 12 R J 1 J 2 = matrix of square differences in atoms {θ 1j },{θ 2l }. EP-MCMC 14
52 LP Estimation of WASP Minimizing Wasserstein is solution to a discrete optimal transport problem Let µ = J 1 j =1 a j δ θ1j, ν = J 2 l=1 b l δ θ2l & M 12 R J 1 J 2 = matrix of square differences in atoms {θ 1j },{θ 2l }. Optimal transport polytope: T (a,b) = set of doubly stochastic matrices w/ row sums a & column sums b EP-MCMC 14
53 LP Estimation of WASP Minimizing Wasserstein is solution to a discrete optimal transport problem Let µ = J 1 j =1 a j δ θ1j, ν = J 2 l=1 b l δ θ2l & M 12 R J 1 J 2 = matrix of square differences in atoms {θ 1j },{θ 2l }. Optimal transport polytope: T (a,b) = set of doubly stochastic matrices w/ row sums a & column sums b Objective is to find T T (a,b) minimizing tr(t T M 12 ) EP-MCMC 14
54 LP Estimation of WASP Minimizing Wasserstein is solution to a discrete optimal transport problem Let µ = J 1 j =1 a j δ θ1j, ν = J 2 l=1 b l δ θ2l & M 12 R J 1 J 2 = matrix of square differences in atoms {θ 1j },{θ 2l }. Optimal transport polytope: T (a,b) = set of doubly stochastic matrices w/ row sums a & column sums b Objective is to find T T (a,b) minimizing tr(t T M 12 ) For WASP, generalize to multimargin optimal transport problem - entropy smoothing has been used previously EP-MCMC 14
55 LP Estimation of WASP Minimizing Wasserstein is solution to a discrete optimal transport problem Let µ = J 1 j =1 a j δ θ1j, ν = J 2 l=1 b l δ θ2l & M 12 R J 1 J 2 = matrix of square differences in atoms {θ 1j },{θ 2l }. Optimal transport polytope: T (a,b) = set of doubly stochastic matrices w/ row sums a & column sums b Objective is to find T T (a,b) minimizing tr(t T M 12 ) For WASP, generalize to multimargin optimal transport problem - entropy smoothing has been used previously We can avoid such smoothing & use sparse LP solvers - neglible computation cost compared to sampling EP-MCMC 14
56 Theorem (Subset Posteriors) WASP: Theorems Under usual regularity conditions, there exists a constant C 1 independent of subset posteriors, such that for large m, E [j ] P W 2 { γ 2 Π m( Y [j ] ),δ θ0 ( ) } ( log 2 ) m 1 α C 1 θ 0 m j = 1,...,k, EP-MCMC 15
57 Theorem (Subset Posteriors) WASP: Theorems Under usual regularity conditions, there exists a constant C 1 independent of subset posteriors, such that for large m, E [j ] P W 2 { γ 2 Π m( Y [j ] ),δ θ0 ( ) } ( log 2 ) m 1 α C 1 θ 0 m j = 1,...,k, Theorem (WASP) Under usual regularity conditions and for large m, W 2 {Π γ } n ( Y (n) ),δ θ0 ( ) = O P (n) θ 0 log 2/α m. km 1/α EP-MCMC 15
58 Simple & Fast Posterior Interval Estimation (PIE) Li, Srivastava & Dunson (2015) j Usually report point & interval estimates for different 1-d functionals - multidimensional posterior difficult to interpret EP-MCMC 16
59 Simple & Fast Posterior Interval Estimation (PIE) Li, Srivastava & Dunson (2015) j Usually report point & interval estimates for different 1-d functionals - multidimensional posterior difficult to interpret j WASP has explicit relationship with subset posteriors in 1-d EP-MCMC 16
60 Simple & Fast Posterior Interval Estimation (PIE) Li, Srivastava & Dunson (2015) j Usually report point & interval estimates for different 1-d functionals - multidimensional posterior difficult to interpret j WASP has explicit relationship with subset posteriors in 1-d j Quantiles of WASP are simple averages of quantiles of subset posteriors EP-MCMC 16
61 Simple & Fast Posterior Interval Estimation (PIE) Li, Srivastava & Dunson (2015) j Usually report point & interval estimates for different 1-d functionals - multidimensional posterior difficult to interpret j WASP has explicit relationship with subset posteriors in 1-d j Quantiles of WASP are simple averages of quantiles of subset posteriors j Leads to a super trivial algorithm - run MCMC for each subset & average quantiles - reminiscent of bag of little bootstraps EP-MCMC 16
62 Simple & Fast Posterior Interval Estimation (PIE) Li, Srivastava & Dunson (2015) j Usually report point & interval estimates for different 1-d functionals - multidimensional posterior difficult to interpret j WASP has explicit relationship with subset posteriors in 1-d j Quantiles of WASP are simple averages of quantiles of subset posteriors j Leads to a super trivial algorithm - run MCMC for each subset & average quantiles - reminiscent of bag of little bootstraps j Strong theory showing accuracy of the resulting approximation EP-MCMC 16
63 Simple & Fast Posterior Interval Estimation (PIE) Li, Srivastava & Dunson (2015) j Usually report point & interval estimates for different 1-d functionals - multidimensional posterior difficult to interpret j WASP has explicit relationship with subset posteriors in 1-d j Quantiles of WASP are simple averages of quantiles of subset posteriors j Leads to a super trivial algorithm - run MCMC for each subset & average quantiles - reminiscent of bag of little bootstraps j Strong theory showing accuracy of the resulting approximation j Can be implemented in STAN, which allows powered likelihoods EP-MCMC 16
64 Theory on PIE/1-d WASP We show 1-d WASP Π n (ξ Y (n) ) is highly accurate approximation to exact posterior Π n (ξ Y (n) ) EP-MCMC 17
65 Theory on PIE/1-d WASP We show 1-d WASP Π n (ξ Y (n) ) is highly accurate approximation to exact posterior Π n (ξ Y (n) ) As subset sample size m increases, W 2 distance between them decreases at faster than parametric rate o p (n 1/2 ) EP-MCMC 17
66 Theory on PIE/1-d WASP We show 1-d WASP Π n (ξ Y (n) ) is highly accurate approximation to exact posterior Π n (ξ Y (n) ) As subset sample size m increases, W 2 distance between them decreases at faster than parametric rate o p (n 1/2 ) Theorem allows k = O(n c ) and m = O(n 1 c ) for any c (0,1), so m can increase very slowly relative to k (recall n = mk) EP-MCMC 17
67 Theory on PIE/1-d WASP We show 1-d WASP Π n (ξ Y (n) ) is highly accurate approximation to exact posterior Π n (ξ Y (n) ) As subset sample size m increases, W 2 distance between them decreases at faster than parametric rate o p (n 1/2 ) Theorem allows k = O(n c ) and m = O(n 1 c ) for any c (0,1), so m can increase very slowly relative to k (recall n = mk) Their biases, variances, quantiles only differ in high orders of the total sample size EP-MCMC 17
68 Theory on PIE/1-d WASP We show 1-d WASP Π n (ξ Y (n) ) is highly accurate approximation to exact posterior Π n (ξ Y (n) ) As subset sample size m increases, W 2 distance between them decreases at faster than parametric rate o p (n 1/2 ) Theorem allows k = O(n c ) and m = O(n 1 c ) for any c (0,1), so m can increase very slowly relative to k (recall n = mk) Their biases, variances, quantiles only differ in high orders of the total sample size Conditions: standard, mild conditions on likelihood + prior finite 2nd moment & uniform integrabiity of subset posteriors EP-MCMC 17
69 Results We have implemented for rich variety of data & models EP-MCMC 18
70 Results We have implemented for rich variety of data & models Logistic & linear random effects models, mixture models, matrix & tensor factorizations, Gaussian process regression EP-MCMC 18
71 Results We have implemented for rich variety of data & models Logistic & linear random effects models, mixture models, matrix & tensor factorizations, Gaussian process regression Nonparametric models, dependence, hierarchical models, etc. EP-MCMC 18
72 Results We have implemented for rich variety of data & models Logistic & linear random effects models, mixture models, matrix & tensor factorizations, Gaussian process regression Nonparametric models, dependence, hierarchical models, etc. We compare to long runs of MCMC (when feasible) & VB EP-MCMC 18
73 Results We have implemented for rich variety of data & models Logistic & linear random effects models, mixture models, matrix & tensor factorizations, Gaussian process regression Nonparametric models, dependence, hierarchical models, etc. We compare to long runs of MCMC (when feasible) & VB WASP/PIE is much faster than MCMC & highly accurate EP-MCMC 18
74 Results We have implemented for rich variety of data & models Logistic & linear random effects models, mixture models, matrix & tensor factorizations, Gaussian process regression Nonparametric models, dependence, hierarchical models, etc. We compare to long runs of MCMC (when feasible) & VB WASP/PIE is much faster than MCMC & highly accurate Carefully designed VB implementations often do very well EP-MCMC 18
75 Outline Motivation & background EP-MCMC amcmc Discussion amcmc 19
76 amcmc Johndrow, Mattingly, Mukherjee & Dunson (2015) Different way to speed up MCMC - replace expensive transition kernels with approximations amcmc 19
77 amcmc Johndrow, Mattingly, Mukherjee & Dunson (2015) Different way to speed up MCMC - replace expensive transition kernels with approximations For example, approximate a conditional distribution in Gibbs sampler with a Gaussian or using a subsample of data amcmc 19
78 amcmc Johndrow, Mattingly, Mukherjee & Dunson (2015) Different way to speed up MCMC - replace expensive transition kernels with approximations For example, approximate a conditional distribution in Gibbs sampler with a Gaussian or using a subsample of data Can potentially vastly speed up MCMC sampling in high-dimensional settings amcmc 19
79 amcmc Johndrow, Mattingly, Mukherjee & Dunson (2015) Different way to speed up MCMC - replace expensive transition kernels with approximations For example, approximate a conditional distribution in Gibbs sampler with a Gaussian or using a subsample of data Can potentially vastly speed up MCMC sampling in high-dimensional settings Original MCMC sampler converges to a stationary distribution corresponding to the exact posterior amcmc 19
80 amcmc Johndrow, Mattingly, Mukherjee & Dunson (2015) Different way to speed up MCMC - replace expensive transition kernels with approximations For example, approximate a conditional distribution in Gibbs sampler with a Gaussian or using a subsample of data Can potentially vastly speed up MCMC sampling in high-dimensional settings Original MCMC sampler converges to a stationary distribution corresponding to the exact posterior Not clear what happens when we start substituting in approximations - may diverge etc amcmc 19
81 amcmc Overview amcmc is used routinely in an essentially ad hoc manner amcmc 20
82 amcmc Overview amcmc is used routinely in an essentially ad hoc manner Our goal: obtain theory guarantees & use these to target design of algorithms amcmc 20
83 amcmc Overview amcmc is used routinely in an essentially ad hoc manner Our goal: obtain theory guarantees & use these to target design of algorithms Define exact MCMC algorithm, which is computationally intractable but has good mixing amcmc 20
84 amcmc Overview amcmc is used routinely in an essentially ad hoc manner Our goal: obtain theory guarantees & use these to target design of algorithms Define exact MCMC algorithm, which is computationally intractable but has good mixing exact chain converges to stationary distribution corresponding to exact posterior amcmc 20
85 amcmc Overview amcmc is used routinely in an essentially ad hoc manner Our goal: obtain theory guarantees & use these to target design of algorithms Define exact MCMC algorithm, which is computationally intractable but has good mixing exact chain converges to stationary distribution corresponding to exact posterior Approximate kernel in exact chain with more computationally tractable alternative amcmc 20
86 Sketch of theory Define s ɛ = τ 1 (P )/τ 1 (P ɛ ) = computational speed-up, τ 1 (P ) = time for one step with transition kernel P amcmc 21
87 Sketch of theory Define s ɛ = τ 1 (P )/τ 1 (P ɛ ) = computational speed-up, τ 1 (P ) = time for one step with transition kernel P Interest: optimizing computational time-accuracy tradeoff for estimators of Πf = Θ f (θ)π(dθ x) amcmc 21
88 Sketch of theory Define s ɛ = τ 1 (P )/τ 1 (P ɛ ) = computational speed-up, τ 1 (P ) = time for one step with transition kernel P Interest: optimizing computational time-accuracy tradeoff for estimators of Πf = Θ f (θ)π(dθ x) We provide tight, finite sample bounds on L 2 error amcmc 21
89 Sketch of theory Define s ɛ = τ 1 (P )/τ 1 (P ɛ ) = computational speed-up, τ 1 (P ) = time for one step with transition kernel P Interest: optimizing computational time-accuracy tradeoff for estimators of Πf = Θ f (θ)π(dθ x) We provide tight, finite sample bounds on L 2 error amcmc estimators win for low computational budgets but have asymptotic bias amcmc 21
90 Sketch of theory Define s ɛ = τ 1 (P )/τ 1 (P ɛ ) = computational speed-up, τ 1 (P ) = time for one step with transition kernel P Interest: optimizing computational time-accuracy tradeoff for estimators of Πf = Θ f (θ)π(dθ x) We provide tight, finite sample bounds on L 2 error amcmc estimators win for low computational budgets but have asymptotic bias Often larger approximation error larger s ɛ & rougher approximations are better when speed super important amcmc 21
91 Ex 1: Approximations using subsets Replace the full data likelihood with ( N/ V L ɛ (x θ) = L(x i θ)), i V for randomly chosen subset V {1,...,n}. amcmc 22
92 Ex 1: Approximations using subsets Replace the full data likelihood with ( N/ V L ɛ (x θ) = L(x i θ)), i V for randomly chosen subset V {1,...,n}. Applied to Pólya-Gamma data augmentation for logistic regression amcmc 22
93 Ex 1: Approximations using subsets Replace the full data likelihood with ( N/ V L ɛ (x θ) = L(x i θ)), i V for randomly chosen subset V {1,...,n}. Applied to Pólya-Gamma data augmentation for logistic regression Different V at each iteration trivial modification to Gibbs amcmc 22
94 Ex 1: Approximations using subsets Replace the full data likelihood with ( N/ V L ɛ (x θ) = L(x i θ)), i V for randomly chosen subset V {1,...,n}. Applied to Pólya-Gamma data augmentation for logistic regression Different V at each iteration trivial modification to Gibbs Assumptions hold with high probability for subsets > minimal size (wrt distribution of subsets, data & kernel). amcmc 22
95 Application to SUSY dataset n = 5,000,000 (0.5 million test), binary outcome & 18 continuous covariates amcmc 23
96 Application to SUSY dataset n = 5,000,000 (0.5 million test), binary outcome & 18 continuous covariates Considered subsets sizes ranging from V = 1,000 to 4,500,000 amcmc 23
97 Application to SUSY dataset n = 5,000,000 (0.5 million test), binary outcome & 18 continuous covariates Considered subsets sizes ranging from V = 1,000 to 4,500,000 Considered different losses as function of V amcmc 23
98 Application to SUSY dataset n = 5,000,000 (0.5 million test), binary outcome & 18 continuous covariates Considered subsets sizes ranging from V = 1,000 to 4,500,000 Considered different losses as function of V Rate at which loss 0 with ɛ heavily dependent on loss amcmc 23
99 Application to SUSY dataset n = 5,000,000 (0.5 million test), binary outcome & 18 continuous covariates Considered subsets sizes ranging from V = 1,000 to 4,500,000 Considered different losses as function of V Rate at which loss 0 with ɛ heavily dependent on loss For small computational budget & focus on posterior mean estimation, small subsets preferred amcmc 23
100 Application to SUSY dataset n = 5,000,000 (0.5 million test), binary outcome & 18 continuous covariates Considered subsets sizes ranging from V = 1,000 to 4,500,000 Considered different losses as function of V Rate at which loss 0 with ɛ heavily dependent on loss For small computational budget & focus on posterior mean estimation, small subsets preferred As budget increases & loss focused more on tails (e.g., for interval estimation), optimal V increases amcmc 23
101 Application 2: Mixture models & tensor factorizations We also considered a nonparametric Bayes model: pr(y i1 = c 1,..., y i p = c p ) = k p λ h h=1 j =1 ψ (j ) hc j, a very useful model for multivariate categorical data amcmc 24
102 Application 2: Mixture models & tensor factorizations We also considered a nonparametric Bayes model: pr(y i1 = c 1,..., y i p = c p ) = k p λ h h=1 j =1 ψ (j ) hc j, a very useful model for multivariate categorical data Dunson & Xing (9) - a data augmentation Gibbs sampler amcmc 24
103 Application 2: Mixture models & tensor factorizations We also considered a nonparametric Bayes model: pr(y i1 = c 1,..., y i p = c p ) = k p λ h h=1 j =1 ψ (j ) hc j, a very useful model for multivariate categorical data Dunson & Xing (9) - a data augmentation Gibbs sampler Sampling latent classes computationally prohibitive for huge n amcmc 24
104 Application 2: Mixture models & tensor factorizations We also considered a nonparametric Bayes model: pr(y i1 = c 1,..., y i p = c p ) = k p λ h h=1 j =1 ψ (j ) hc j, a very useful model for multivariate categorical data Dunson & Xing (9) - a data augmentation Gibbs sampler Sampling latent classes computationally prohibitive for huge n Use adaptive Gaussian approximation - avoid sampling individual latent classes amcmc 24
105 Application 2: Mixture models & tensor factorizations We also considered a nonparametric Bayes model: pr(y i1 = c 1,..., y i p = c p ) = k p λ h h=1 j =1 ψ (j ) hc j, a very useful model for multivariate categorical data Dunson & Xing (9) - a data augmentation Gibbs sampler Sampling latent classes computationally prohibitive for huge n Use adaptive Gaussian approximation - avoid sampling individual latent classes We have shown Assumptions 1-2, Assumption 2 result more general than this setting amcmc 24
106 Application 2: Mixture models & tensor factorizations We also considered a nonparametric Bayes model: pr(y i1 = c 1,..., y i p = c p ) = k p λ h h=1 j =1 ψ (j ) hc j, a very useful model for multivariate categorical data Dunson & Xing (9) - a data augmentation Gibbs sampler Sampling latent classes computationally prohibitive for huge n Use adaptive Gaussian approximation - avoid sampling individual latent classes We have shown Assumptions 1-2, Assumption 2 result more general than this setting Improved computation performance for large n amcmc 24
107 Application 3: Low rank approximation to GP Gaussian process regression, y i = f (x i ) + η i, η i N(0,σ 2 ) amcmc 25
108 Application 3: Low rank approximation to GP Gaussian process regression, y i = f (x i ) + η i, η i N(0,σ 2 ) f GP prior with covariance τ 2 exp( φ x 1 x 2 2 ) amcmc 25
109 Application 3: Low rank approximation to GP Gaussian process regression, y i = f (x i ) + η i, η i N(0,σ 2 ) f GP prior with covariance τ 2 exp( φ x 1 x 2 2 ) Discrete-uniform on φ & gamma priors on τ 2,σ 2 amcmc 25
110 Application 3: Low rank approximation to GP Gaussian process regression, y i = f (x i ) + η i, η i N(0,σ 2 ) f GP prior with covariance τ 2 exp( φ x 1 x 2 2 ) Discrete-uniform on φ & gamma priors on τ 2,σ 2 Marginal MCMC sampler updates φ,τ 2,σ 2 amcmc 25
111 Application 3: Low rank approximation to GP Gaussian process regression, y i = f (x i ) + η i, η i N(0,σ 2 ) f GP prior with covariance τ 2 exp( φ x 1 x 2 2 ) Discrete-uniform on φ & gamma priors on τ 2,σ 2 Marginal MCMC sampler updates φ,τ 2,σ 2 We show Assumption 1 holds under mild regularity conditions on truth, Assumption 2 holds for partial rank-r eigen approximation to Σ amcmc 25
112 Application 3: Low rank approximation to GP Gaussian process regression, y i = f (x i ) + η i, η i N(0,σ 2 ) f GP prior with covariance τ 2 exp( φ x 1 x 2 2 ) Discrete-uniform on φ & gamma priors on τ 2,σ 2 Marginal MCMC sampler updates φ,τ 2,σ 2 We show Assumption 1 holds under mild regularity conditions on truth, Assumption 2 holds for partial rank-r eigen approximation to Σ Less accurate approximations clearly superior in practice for small computational budget amcmc 25
113 Applications: General Conclusions Achieving uniform control of approximation error ɛ requires approximations adaptive to current state of chain amcmc 26
114 Applications: General Conclusions Achieving uniform control of approximation error ɛ requires approximations adaptive to current state of chain More accurate approximations needed farther from high probability region of posterior; good as chain rarely there amcmc 26
115 Applications: General Conclusions Achieving uniform control of approximation error ɛ requires approximations adaptive to current state of chain More accurate approximations needed farther from high probability region of posterior; good as chain rarely there Approximations to conditionals of vector parameters are highly sensitive to 2nd moment amcmc 26
116 Applications: General Conclusions Achieving uniform control of approximation error ɛ requires approximations adaptive to current state of chain More accurate approximations needed farther from high probability region of posterior; good as chain rarely there Approximations to conditionals of vector parameters are highly sensitive to 2nd moment Smaller condition numbers for the covariance matrix of vector parameters mean less accurate approximations can be used amcmc 26
117 Outline Motivation & background EP-MCMC amcmc Discussion Discussion 27
118 Discussion Proposed very general classes of scalable Bayes algorithms Discussion 27
119 Discussion Proposed very general classes of scalable Bayes algorithms EP-MCMC & amcmc - fast & scalable with guarantees Discussion 27
120 Discussion Proposed very general classes of scalable Bayes algorithms EP-MCMC & amcmc - fast & scalable with guarantees Interest in improving theory - avoid reliance on asymptotics in EP-MCMC & weaken assumptions in amcmc Discussion 27
121 Discussion Proposed very general classes of scalable Bayes algorithms EP-MCMC & amcmc - fast & scalable with guarantees Interest in improving theory - avoid reliance on asymptotics in EP-MCMC & weaken assumptions in amcmc Useful to combine algorithms - e.g., run amcmc for each subset Discussion 27
122 Discussion Proposed very general classes of scalable Bayes algorithms EP-MCMC & amcmc - fast & scalable with guarantees Interest in improving theory - avoid reliance on asymptotics in EP-MCMC & weaken assumptions in amcmc Useful to combine algorithms - e.g., run amcmc for each subset By looking at algorithms through our theory lens, suggests new & improved algorithms Discussion 27
123 Discussion Proposed very general classes of scalable Bayes algorithms EP-MCMC & amcmc - fast & scalable with guarantees Interest in improving theory - avoid reliance on asymptotics in EP-MCMC & weaken assumptions in amcmc Useful to combine algorithms - e.g., run amcmc for each subset By looking at algorithms through our theory lens, suggests new & improved algorithms Also, very interested in hybrid frequentist-bayes algorithms Discussion 27
124 Hybrid high-dimensional density estimation Ye, Canale & Dunson (2016, AISTATS) y i = (y i1,..., y i p ) T f with p large & f an unknown density Discussion 28
125 Hybrid high-dimensional density estimation Ye, Canale & Dunson (2016, AISTATS) y i = (y i1,..., y i p ) T f with p large & f an unknown density Potentially use Dirichlet process mixtures of factor models Discussion 28
126 Hybrid high-dimensional density estimation Ye, Canale & Dunson (2016, AISTATS) y i = (y i1,..., y i p ) T f with p large & f an unknown density Potentially use Dirichlet process mixtures of factor models Approach doesn t scale well at all with p Discussion 28
127 Hybrid high-dimensional density estimation Ye, Canale & Dunson (2016, AISTATS) y i = (y i1,..., y i p ) T f with p large & f an unknown density Potentially use Dirichlet process mixtures of factor models Approach doesn t scale well at all with p Instead use hybrid of Gibbs sampling & fast multiscale SVD Discussion 28
128 Hybrid high-dimensional density estimation Ye, Canale & Dunson (2016, AISTATS) y i = (y i1,..., y i p ) T f with p large & f an unknown density Potentially use Dirichlet process mixtures of factor models Approach doesn t scale well at all with p Instead use hybrid of Gibbs sampling & fast multiscale SVD Scalable, excellent mixing & empirical/predictive performance Discussion 28
129 What about mixing? Method ACF DA MH MVN CDA Lag In the above we have put aside the mixing issues that can arise in big samples Discussion 29
130 What about mixing? Method ACF DA MH MVN CDA Lag In the above we have put aside the mixing issues that can arise in big samples Slow mixing we need many more MCMC samples for the sample MC error Discussion 29
131 What about mixing? Method ACF DA MH MVN CDA Lag In the above we have put aside the mixing issues that can arise in big samples Slow mixing we need many more MCMC samples for the sample MC error Common data augmentation algorithms for discrete data fail badly for large imbalanced data (Johndrow et al. 2016) Discussion 29
132 What about mixing? Method ACF DA MH MVN CDA Lag In the above we have put aside the mixing issues that can arise in big samples Slow mixing we need many more MCMC samples for the sample MC error Common data augmentation algorithms for discrete data fail badly for large imbalanced data (Johndrow et al. 2016) But such problems can be fixed via calibration (Duan et al. 2016) Discussion 29
133 What about mixing? Method ACF DA MH MVN CDA Lag In the above we have put aside the mixing issues that can arise in big samples Slow mixing we need many more MCMC samples for the sample MC error Common data augmentation algorithms for discrete data fail badly for large imbalanced data (Johndrow et al. 2016) But such problems can be fixed via calibration (Duan et al. 2016) Interesting area for further research Discussion 29
134 Primary References Duan L, Johndrow J, Dunson DB (2017) Calibrated data augmentation for scalable Markov chain Monte Carlo. arxiv: Johndrow J, Mattingly J, Mukherjee S, Dunson DB (2015) Approximations of Markov chains and Bayesian inference. arxiv: Johndrow J, Smith A, Pillai N, Dunson DB (2016) Inefficiency of data augmentation for large sample imbalanced data. arxiv: Li C, Srivastava S, Dunson DB (2016) Simple, scalable and accurate posterior interval estimation. arxiv: ; Biometrika, in press. Discussion 30
Bayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationOn Markov chain Monte Carlo methods for tall data
On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationVCMC: Variational Consensus Monte Carlo
VCMC: Variational Consensus Monte Carlo Maxim Rabinovich, Elaine Angelino, Michael I. Jordan Berkeley Vision and Learning Center September 22, 2015 probabilistic models! sky fog bridge water grass object
More informationNonparametric Bayes tensor factorizations for big data
Nonparametric Bayes tensor factorizations for big data David Dunson Department of Statistical Science, Duke University Funded from NIH R01-ES017240, R01-ES017436 & DARPA N66001-09-C-2082 Motivation Conditional
More informationMCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17
MCMC for big data Geir Storvik BigInsight lunch - May 2 2018 Geir Storvik MCMC for big data BigInsight lunch - May 2 2018 1 / 17 Outline Why ordinary MCMC is not scalable Different approaches for making
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationNonparametric Bayes Uncertainty Quantification
Nonparametric Bayes Uncertainty Quantification David Dunson Department of Statistical Science, Duke University Funded from NIH R01-ES017240, R01-ES017436 & ONR Review of Bayes Intro to Nonparametric Bayes
More informationBayesian Regression Linear and Logistic Regression
When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we
More informationSTA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More informationBayes: All uncertainty is described using probability.
Bayes: All uncertainty is described using probability. Let w be the data and θ be any unknown quantities. Likelihood. The probability model π(w θ) has θ fixed and w varying. The likelihood L(θ; w) is π(w
More information(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis
Summarizing a posterior Given the data and prior the posterior is determined Summarizing the posterior gives parameter estimates, intervals, and hypothesis tests Most of these computations are integrals
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationDefault Priors and Effcient Posterior Computation in Bayesian
Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature
More informationOn Bayesian Computation
On Bayesian Computation Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang Previous Work: Information Constraints on Inference Minimize the minimax risk under constraints
More informationBayes methods for categorical data. April 25, 2017
Bayes methods for categorical data April 25, 2017 Motivation for joint probability models Increasing interest in high-dimensional data in broad applications Focus may be on prediction, variable selection,
More informationComputational statistics
Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated
More informationSTA 216, GLM, Lecture 16. October 29, 2007
STA 216, GLM, Lecture 16 October 29, 2007 Efficient Posterior Computation in Factor Models Underlying Normal Models Generalized Latent Trait Models Formulation Genetic Epidemiology Illustration Structural
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationAfternoon Meeting on Bayesian Computation 2018 University of Reading
Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of
More informationIntegrated Non-Factorized Variational Inference
Integrated Non-Factorized Variational Inference Shaobo Han, Xuejun Liao and Lawrence Carin Duke University February 27, 2014 S. Han et al. Integrated Non-Factorized Variational Inference February 27, 2014
More informationLecture 7 and 8: Markov Chain Monte Carlo
Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani
More informationPart 1: Expectation Propagation
Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud
More informationBAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA
BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA Intro: Course Outline and Brief Intro to Marina Vannucci Rice University, USA PASI-CIMAT 04/28-30/2010 Marina Vannucci
More informationA Review of Pseudo-Marginal Markov Chain Monte Carlo
A Review of Pseudo-Marginal Markov Chain Monte Carlo Discussed by: Yizhe Zhang October 21, 2016 Outline 1 Overview 2 Paper review 3 experiment 4 conclusion Motivation & overview Notation: θ denotes the
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationRecurrent Latent Variable Networks for Session-Based Recommendation
Recurrent Latent Variable Networks for Session-Based Recommendation Panayiotis Christodoulou Cyprus University of Technology paa.christodoulou@edu.cut.ac.cy 27/8/2017 Panayiotis Christodoulou (C.U.T.)
More informationBayesian non-parametric model to longitudinally predict churn
Bayesian non-parametric model to longitudinally predict churn Bruno Scarpa Università di Padova Conference of European Statistics Stakeholders Methodologists, Producers and Users of European Statistics
More informationThe Bayesian approach to inverse problems
The Bayesian approach to inverse problems Youssef Marzouk Department of Aeronautics and Astronautics Center for Computational Engineering Massachusetts Institute of Technology ymarz@mit.edu, http://uqgroup.mit.edu
More informationComputer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo
Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain
More informationBayesian Inference and MCMC
Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the
More information27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling
10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel
More informationBayesian linear regression
Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding
More informationA Divide-and-Conquer Bayesian Approach to Large-Scale Kriging
A Divide-and-Conquer Bayesian Approach to Large-Scale Kriging Cheng Li DSAP, National University of Singapore Joint work with Rajarshi Guhaniyogi (UC Santa Cruz), Terrance D. Savitsky (US Bureau of Labor
More informationMachine Learning Techniques for Computer Vision
Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM
More informationMarkov Chain Monte Carlo methods
Markov Chain Monte Carlo methods Tomas McKelvey and Lennart Svensson Signal Processing Group Department of Signals and Systems Chalmers University of Technology, Sweden November 26, 2012 Today s learning
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More information17 : Markov Chain Monte Carlo
10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo
More informationThe Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision
The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that
More informationIntroduction to Probabilistic Machine Learning
Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning
More informationBayesian Linear Regression
Bayesian Linear Regression Sudipto Banerjee 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. September 15, 2010 1 Linear regression models: a Bayesian perspective
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationMCMC Sampling for Bayesian Inference using L1-type Priors
MÜNSTER MCMC Sampling for Bayesian Inference using L1-type Priors (what I do whenever the ill-posedness of EEG/MEG is just not frustrating enough!) AG Imaging Seminar Felix Lucka 26.06.2012 , MÜNSTER Sampling
More informationNon-Parametric Bayes
Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian
More informationScalable MCMC for the horseshoe prior
Scalable MCMC for the horseshoe prior Anirban Bhattacharya Department of Statistics, Texas A&M University Joint work with James Johndrow and Paolo Orenstein September 7, 2018 Cornell Day of Statistics
More informationLecture 13 : Variational Inference: Mean Field Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1
More informationSTAT 518 Intro Student Presentation
STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible
More informationThe connection of dropout and Bayesian statistics
The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University
More informationMarkov Chain Monte Carlo
Markov Chain Monte Carlo Recall: To compute the expectation E ( h(y ) ) we use the approximation E(h(Y )) 1 n n h(y ) t=1 with Y (1),..., Y (n) h(y). Thus our aim is to sample Y (1),..., Y (n) from f(y).
More informationGenerative Models and Stochastic Algorithms for Population Average Estimation and Image Analysis
Generative Models and Stochastic Algorithms for Population Average Estimation and Image Analysis Stéphanie Allassonnière CIS, JHU July, 15th 28 Context : Computational Anatomy Context and motivations :
More informationLearning Energy-Based Models of High-Dimensional Data
Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero www.cs.toronto.edu/~hinton/energybasedmodelsweb.htm Discovering causal structure as a goal
More informationPrinciples of Bayesian Inference
Principles of Bayesian Inference Sudipto Banerjee University of Minnesota July 20th, 2008 1 Bayesian Principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters
More informationStat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.
Stat 535 C - Statistical Computing & Monte Carlo Methods Arnaud Doucet Email: arnaud@cs.ubc.ca 1 1.1 Outline Introduction to Markov chain Monte Carlo The Gibbs Sampler Examples Overview of the Lecture
More informationMCMC algorithms for fitting Bayesian models
MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models
More informationStudy Notes on the Latent Dirichlet Allocation
Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection
More informationSTAT 425: Introduction to Bayesian Analysis
STAT 425: Introduction to Bayesian Analysis Marina Vannucci Rice University, USA Fall 2017 Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 2) Fall 2017 1 / 19 Part 2: Markov chain Monte
More informationBayesian Inference for Discretely Sampled Diffusion Processes: A New MCMC Based Approach to Inference
Bayesian Inference for Discretely Sampled Diffusion Processes: A New MCMC Based Approach to Inference Osnat Stramer 1 and Matthew Bognar 1 Department of Statistics and Actuarial Science, University of
More informationSparse Stochastic Inference for Latent Dirichlet Allocation
Sparse Stochastic Inference for Latent Dirichlet Allocation David Mimno 1, Matthew D. Hoffman 2, David M. Blei 1 1 Dept. of Computer Science, Princeton U. 2 Dept. of Statistics, Columbia U. Presentation
More informationIntroduction to Stochastic Gradient Markov Chain Monte Carlo Methods
Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods Changyou Chen Department of Electrical and Computer Engineering, Duke University cc448@duke.edu Duke-Tsinghua Machine Learning Summer
More informationCSC 2541: Bayesian Methods for Machine Learning
CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 3 More Markov Chain Monte Carlo Methods The Metropolis algorithm isn t the only way to do MCMC. We ll
More informationICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts
ICML 2015 Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes Machine Learning Research Group and Oxford-Man Institute University of Oxford July 8, 2015 Point Processes
More informationThe Jackknife-Like Method for Assessing Uncertainty of Point Estimates for Bayesian Estimation in a Finite Gaussian Mixture Model
Thai Journal of Mathematics : 45 58 Special Issue: Annual Meeting in Mathematics 207 http://thaijmath.in.cmu.ac.th ISSN 686-0209 The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for
More informationLarge-scale Ordinal Collaborative Filtering
Large-scale Ordinal Collaborative Filtering Ulrich Paquet, Blaise Thomson, and Ole Winther Microsoft Research Cambridge, University of Cambridge, Technical University of Denmark ulripa@microsoft.com,brmt2@cam.ac.uk,owi@imm.dtu.dk
More informationIntroduction. Chapter 1
Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics
More informationBayesian Calibration of Simulators with Structured Discretization Uncertainty
Bayesian Calibration of Simulators with Structured Discretization Uncertainty Oksana A. Chkrebtii Department of Statistics, The Ohio State University Joint work with Matthew T. Pratola (Statistics, The
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationBagging During Markov Chain Monte Carlo for Smoother Predictions
Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods
More informationHastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model
UNIVERSITY OF TEXAS AT SAN ANTONIO Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model Liang Jing April 2010 1 1 ABSTRACT In this paper, common MCMC algorithms are introduced
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationContents. Part I: Fundamentals of Bayesian Inference 1
Contents Preface xiii Part I: Fundamentals of Bayesian Inference 1 1 Probability and inference 3 1.1 The three steps of Bayesian data analysis 3 1.2 General notation for statistical inference 4 1.3 Bayesian
More informationStochastic Analogues to Deterministic Optimizers
Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured
More informationFitting Narrow Emission Lines in X-ray Spectra
Outline Fitting Narrow Emission Lines in X-ray Spectra Taeyoung Park Department of Statistics, University of Pittsburgh October 11, 2007 Outline of Presentation Outline This talk has three components:
More informationBayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence
Bayesian Inference in GLMs Frequentists typically base inferences on MLEs, asymptotic confidence limits, and log-likelihood ratio tests Bayesians base inferences on the posterior distribution of the unknowns
More informationLikelihood-free MCMC
Bayesian inference for stable distributions with applications in finance Department of Mathematics University of Leicester September 2, 2011 MSc project final presentation Outline 1 2 3 4 Classical Monte
More informationAn example of Bayesian reasoning Consider the one-dimensional deconvolution problem with various degrees of prior information.
An example of Bayesian reasoning Consider the one-dimensional deconvolution problem with various degrees of prior information. Model: where g(t) = a(t s)f(s)ds + e(t), a(t) t = (rapidly). The problem,
More informationMCMC and Gibbs Sampling. Kayhan Batmanghelich
MCMC and Gibbs Sampling Kayhan Batmanghelich 1 Approaches to inference l Exact inference algorithms l l l The elimination algorithm Message-passing algorithm (sum-product, belief propagation) The junction
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationBayesian model selection: methodology, computation and applications
Bayesian model selection: methodology, computation and applications David Nott Department of Statistics and Applied Probability National University of Singapore Statistical Genomics Summer School Program
More informationNew Insights into History Matching via Sequential Monte Carlo
New Insights into History Matching via Sequential Monte Carlo Associate Professor Chris Drovandi School of Mathematical Sciences ARC Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS)
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationBayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference
1 The views expressed in this paper are those of the authors and do not necessarily reflect the views of the Federal Reserve Board of Governors or the Federal Reserve System. Bayesian Estimation of DSGE
More informationSequential Monte Carlo Methods for Bayesian Computation
Sequential Monte Carlo Methods for Bayesian Computation A. Doucet Kyoto Sept. 2012 A. Doucet (MLSS Sept. 2012) Sept. 2012 1 / 136 Motivating Example 1: Generic Bayesian Model Let X be a vector parameter
More informationBayesian inference for multivariate extreme value distributions
Bayesian inference for multivariate extreme value distributions Sebastian Engelke Clément Dombry, Marco Oesting Toronto, Fields Institute, May 4th, 2016 Main motivation For a parametric model Z F θ of
More informationLatent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent
Latent Variable Models for Binary Data Suppose that for a given vector of explanatory variables x, the latent variable, U, has a continuous cumulative distribution function F (u; x) and that the binary
More informationLearning the hyper-parameters. Luca Martino
Learning the hyper-parameters Luca Martino 2017 2017 1 / 28 Parameters and hyper-parameters 1. All the described methods depend on some choice of hyper-parameters... 2. For instance, do you recall λ (bandwidth
More informationVariational Principal Components
Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings
More informationKernel adaptive Sequential Monte Carlo
Kernel adaptive Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) December 7, 2015 1 / 36 Section 1 Outline
More informationApproximate inference in Energy-Based Models
CSC 2535: 2013 Lecture 3b Approximate inference in Energy-Based Models Geoffrey Hinton Two types of density model Stochastic generative model using directed acyclic graph (e.g. Bayes Net) Energy-based
More informationAdaptive Monte Carlo methods
Adaptive Monte Carlo methods Jean-Michel Marin Projet Select, INRIA Futurs, Université Paris-Sud joint with Randal Douc (École Polytechnique), Arnaud Guillin (Université de Marseille) and Christian Robert
More informationPractical Bayesian Optimization of Machine Learning. Learning Algorithms
Practical Bayesian Optimization of Machine Learning Algorithms CS 294 University of California, Berkeley Tuesday, April 20, 2016 Motivation Machine Learning Algorithms (MLA s) have hyperparameters that
More informationApplication of new Monte Carlo method for inversion of prestack seismic data. Yang Xue Advisor: Dr. Mrinal K. Sen
Application of new Monte Carlo method for inversion of prestack seismic data Yang Xue Advisor: Dr. Mrinal K. Sen Overview Motivation Introduction Bayes theorem Stochastic inference methods Methodology
More informationReview: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)
Case Study 4: Collaborative Filtering Review: Probabilistic Matrix Factorization Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 2 th, 214 Emily Fox 214 1 Probabilistic
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures
More informationNonparametric Bayesian Methods - Lecture I
Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016 Overview of the lectures I Intro to nonparametric Bayesian statistics
More informationBayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units
Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units Sahar Z Zangeneh Robert W. Keener Roderick J.A. Little Abstract In Probability proportional
More informationEfficient Variational Inference in Large-Scale Bayesian Compressed Sensing
Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing George Papandreou and Alan Yuille Department of Statistics University of California, Los Angeles ICCV Workshop on Information
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationExercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters
Exercises Tutorial at ICASSP 216 Learning Nonlinear Dynamical Models Using Particle Filters Andreas Svensson, Johan Dahlin and Thomas B. Schön March 18, 216 Good luck! 1 [Bootstrap particle filter for
More information