HYBRID DETERMINISTIC-STOCHASTIC GRADIENT LANGEVIN DYNAMICS FOR BAYESIAN LEARNING

Size: px

Start display at page:

Download "HYBRID DETERMINISTIC-STOCHASTIC GRADIENT LANGEVIN DYNAMICS FOR BAYESIAN LEARNING"

David Fisher
5 years ago
Views:

1 COMMUNICATIONS IN INFORMATION AND SYSTEMS c 01 International Press Vol. 1, No. 3, pp. 1-3, HYBRID DETERMINISTIC-STOCHASTIC GRADIENT LANGEVIN DYNAMICS FOR BAYESIAN LEARNING QI HE AND JACK XIN Abstract. We propose a new algorithm to obtain Bayesian posterior distribution by a hybrid deterministic-stochastic gradient Langevin dynamics. To speed up convergence and reduce computational costs, it is common to use stochastic gradient method to approximate the full gradient by sampling a subset of the large dataset. Stochastic gradient methods make progress fast initially, however, they often become slow in the late stage as the iterations approach the desired solution. The conventional gradient methods converge better eventually however at the expense of evaluating the full gradient at each iteration. Our hybrid method has the advantages of both approaches for constructing the Bayesian posterior distribution. We prove that our algorithm converges based on the weak convergence methods, and illustrate numerically its effectiveness and improved accuracy. Key words: Langevin dynamics, Stochastic gradient, Bayesian learning. AMS Subject Classification. 60J7, 93E0, 34E Introduction. This work focuses on Bayesian learning based on a hybrid deterministic-stochastic gradient descent Langevin dynamics. There has been increasing interest in large scale datasets for machine learning, ranging from network data, signal processing and data mining to bioinformatics. The large scale data significantly increase the computational complexity of the underlying optimization algorithm. One of the successful methods to overcome this difficulty is stochastic gradient method, which is efficient and simple to implement. In the literature, stochastic gradient methods have been widely used for large scale machine learning. For example, [1, section 3.] and [] studied incremental-gradient method in which each iteration only evaluates the gradient along one single index; [3] applied stochastic gradient algorithm to large scale linear prediction problems; [4] explored hybrid deterministic-stochastic methods for large scale optimization. For more applications of stochastic gradient methods in large scale machine learning, we refer to [5] and references therein. The convergence of stochastic gradient algorithm has also been extensively studied in stochastic approximation literature, such as[, 6], where convergence holds under some mild conditions including the case where the loss function is not everywhere differentiable; see also [7] for the convergence of a simultaneous perturbation gradient approximation and [8] for the convergence of stochastic gradient expectation maximization (EM) algorithm. Along another line, there are few studies for large scale Bayesian learning by stochastic gradient method. [9] seems to be the first to propose stochastic gradient method for Bayesian learning by Langevin dynamics. Later, [10] improved the method Department of Mathematics, University of California, Irvine 9697, qhe@uci.edu. Department of Mathematics, University of California, Irvine 9697, jxin@math.uci.edu 1

2 QI HE AND JACK XIN by adding preconditioner using stochastic Fisher scoring. Although the algorithm works efficiently in numerical experiments, there is no theoretical analysis for the convergence of the stochastic gradient Bayesian learning yet. To bridge this gap, in this paper, we study Bayesian learning by hybrid deterministic-stochastic gradient method motivated by [4] and provide a convergence proof. It covers the model in [9] as a special case and so gives the proof of the convergence for algorithm in [9]. This is the first work, to the best of our knowledge, that presents the theoretical analysis of the algorithm for stochastic gradient Bayesian learning. It is worth mentioning that the convergence analysis of the hybrid deterministicstochastic method for data fitting in [4] does not apply directly to the Bayesian learning problem here. The algorithm for Bayesian learning converges to a probability distribution rather than to a deterministic number. The traditional error analysis is not suitable for Bayesian learning. Instead we employ weak convergence method [11] to treat both data fitting and Bayesian learning algorithms. The weak convergence method also requires weaker conditions for convergence. The rest of the paper is organized as follows. Section begins with certain preliminary results. Section 3 provides the proof of convergence for the algorithm. Section 4 shows advantages of our proposed method by numerical simulations. Finally, concluding remarks are given in Section 5. The work was partially supported by NSF grants DMS-1507, and DMS The authors thank Mr. Sungjin Ahn for helpful conversations on Bayesian learning and related data sets.. Preliminary Results. The Markov chain Monte Carlo(MCMC) method is a very popular tool for computational problems in Bayesian statistics, see [1] and [13]. To sample from the target density, MCMC methods construct a Markov chain whose stationary distribution is the target density. Under some suitable ergodic condition of the Markov chain, statistical quantities based on the target density can be calculated by simulating the Markov chain and computing time averaged quantities. A basic sampling method in MCMC is Langevin dynamics [13]. Assume that π is a target density on R r. Then the Langevin diffusion θ t is defined by the stochastic differential equation (.1) dθ t = 1 logπ(θ t)dt+dw t, wherew t isastandardbrownianmotion. When π is suitablysmooth, it can be shown that θ t has π as a stationary distribution [14]. Denote the density function of the diffusion θ t by ρ(t,x), then lim t ρ(t,x) = π(x). A variety of MCMC algorithms are based on Langevin dynamics. For example, the simplest way is to discretize Langevin dynamics (.1) by the Euler-Maruyama method [15]. In addition, the Metropolis step can be added to remove bias if the step size is such that bias is significant. There

3 HYBRID GRADIENT LANGEVIN DYNAMICS FOR BAYESIAN LEARNING 3 are also some variants of the method, for example, pre-conditioning the dynamic by a positive definite matrix A to obtain (.) dθ t = 1 A logπ(θ t)dt+a 1/ dw t. This dynamic also has π as its stationary distribution. To apply Langevin dynamics of MCMC method to Bayesian learning, we consider the following model. Let θ denote a parameter vector, with p(θ) a prior distribution, and p(x θ) is the conditional distribution density of data x given θ. The posterior distribution of parameter θ given a set of data X = (x i,1 i N) is (.3) p(θ X) p(θ)π N i p(x i θ). Replacing π in (.1) by p(θ X), we have the following Langevin dynamic for θ dθ t = 1 logp(θ t X)dt+dW(t), = 1 ( N ) logp(θ t )+ logp(x i θ t ) dt+dw(t). Hence, we have that the limit distribution of θ t is p(θ X), [14]. In order to approach the posterior distribution by Langevin dynamics, we use Euler-Maruyama method for discretization [15]. The algorithm is as follows (.4) θ k+1 = ε ( k logp(θ k )+ N ) logp(x i θ k ) + ε k η k, where η k is an i.i.d random variable sequence with normal distribution, and step size ε k satisfies k=1 ε k = and k=1 ε k <. These two conditions for ε k are very common for stochastic approximation with decreasing step size, see [6], [11] and [16]. Typically, step size ε k = a(b +t) γ which decays algebraically with γ (0.5,1]. To correct for discretization error, one can use (.4) as a proposal distribution and modify it by the Metropolis-Hasting method [17]. Note that as we decrease the step size ε k, the discretization error decreases to zero so that the rejection rate approaches zero. Hence, we can simply ignore the Metropolis-Hasting acceptance step. In the algorithm (.4), we need to evaluate the gradient of log(x θ) over all the data set, which is time consuming. To speed up convergence, [9] used the stochastic gradient method, i.e., at each step only estimate the gradient over a subset of the data. Assume that the size of the subset of the data is n < N, the algorithm is as follows θ k+1 = ε ( k logp(θ k )+ N n n ) logp(x i θ k ) + ε k η k. Although it is shown in [9] that this stochastic gradient Bayesian learning works well for some numerical simulations, the convergence improves slowly after a certain

4 4 QI HE AND JACK XIN number of iterations. The reason for this is that the stochastic gradient method makes good progress initially, but it is slow in improving the accuracy as the iterates approach the limiting solution. To overcome this disadvantage, [4] proposed a hybrid deterministic-stochastic method for optimization, in which the size of the sampling subset is increasing after each iteration. It is shown that this method exhibits benefits of both the stochastic gradient method and full gradient method. Motivated by [4], we propose a hybrid deterministic-stochastic gradient method for Bayesian learning. The new ingredient is to inject additional noise according to Langevin dynamics into the algorithm so that convergence to the full posterior distribution holds. The hybrid deterministic-stochastic gradient Bayesian learning algorithm is given by (.5) θ k+1 = ε k ( logp(θ k )+ N n k ) logp(x i θ k ) + ε k η k, n k wheren k isthenumberofsizeofasamplesubsetatthek-iteration. Itisnondecreasing and lim k n k N. 3. Weak convergence method. In this section, we prove that the distribution of the algorithm (.5) converges to the posterior distribution p(θ X). We introduce the following conditions. Assumption A. Assume that logp(θ) and logp(θ x) satisfy linear growth and Lipschitz conditions, and that p(x) is continuously differentiable. Remark 3.1. The linear growth and Lipschitz conditions in assumption A can be extended to local linear growth and local Lipschitz conditions by applying truncation techniques, for more details we refer to [6]. Define t k = k 1 i=0 ε i,m(t) = max{k : t k t}, and continuous-time interpolations (3.1) θ 0 (t) = θ k, for t [t k,t k+1 ), θ k (t) = θ 0 (t+t k ). We first obtain an estimate on the pth moment of {θ k }. This is stated as follows. Lemma 3.. Under assumption A, for any fixed p and T > 0, (3.) sup E θ k p ( θ 0 p +KT)exp(KT) <, 0 k m(t) for some constant K > 0. Proof. Define U(θ) = θ p and use E k to denote the conditional expectation with respect to the σ-algebra G k, where G k = σ(θ 1,...,θ k ). In the following, we use

5 HYBRID GRADIENT LANGEVIN DYNAMICS FOR BAYESIAN LEARNING 5 notation for the transpose operation. Thus (3.3) E k U(θ k+1 ) U(θ k ) = E k U (θ k )[θ k+1 θ k ] +E k (θ k+1 θ k ) U(θ + k )(θ k+1 θ k ) ε k U (θ k ) ( logp(θ k )+ N n k ) logp(x ki θ k ) n k +Kε k θ k p (1+ θ k ) Kε k (1+ θ k p ), where U and U denote the gradient and the Hessian of U w.r.t. to x, and θ + k denotes a vector on the line segment joining θ k and θ k+1. Note that we have used the linear growth in θ for both logp(θ) and logp(θ x) in the last line of (3.3). Since U(θ k ) = θ k p, we obtain E k θ k+1 p θ k p +Kε k +Kε k θ k p. Taking the expectation on both sides and iterating on the resulting recursion, we have n E θ n+1 p θ 0 p +Kε k n+kε k E θ k p. k=0 An application of the Gronwall s inequality yields E θ n+1 p ( θ 0 p +KT)exp(KT) as desired. To proceed, let us recall the definition of tightness. Assume that B is a metric space and that B is a σ algebra of subsets of B. A set of probability measures {P α } on (B,B) istight ifforeachε > 0there isacompactset B ε B suchthat inf α P α {B α } 1 ε. A set of random variable {x α } is tight if its corresponding set of measures {P α } is tight. For more details of tightness, we refer readers to the book [11]. In view of the estimate above, {θ k : 0 k m(t)} is tight in R r by applying the well-known Chebyshev inequality. That is, for each δ > 0, there is a K δ satisfying K δ > (1/δ) such that P( θ k > K δ ) sup E θ k 0 n m(t) Kδ Kδ. Next, we show that {θ k ( )} is tight in suitable function spaces. Lemma 3.3. Under assumption A, {θ k ( )} is tight in D r [0, ), the space of functions that are right continuous and have left limits, endowed with the Skorohod topology.

6 6 QI HE AND JACK XIN Proof. For any η > 0, t 0, 0 s η, we have (3.4) E θ k (t+s) θ k (t) m(t+s) 1 = E 1 n k ε k ( logp(θ k )+ K K k=m(t) m(t+s) 1 k=m(t) m(t+s) 1 k=m(t) ε k(1+e θ k )+K ε k(1+ O(t+s t) = O(s). sup m(t) k m(t+s) 1 N ) logp(θ k x ki ) + n k m(t+s) 1 k=m(t) ε k E η k E θ k )+K m(t+s) 1 k=m(t) m(t+s) 1 ε k k=m(t) εk η k In the above, we have used Lemma 3. to ensure that sup m(t) k m(t+s) E θ k < and the conditions of the stepsize ε k < and ε k =. Therefore, (3.4) leads to lim limsup E θ k (t+s) θ k (t) = 0. η 0 k The tightness of {θ k ( )} then follows from [11, p. 47] Weak Convergence. Since {θ k ( )} is tight, by Prohorov s Theorem (see [6]), we may select a convergent subsequence. For simplicity, we still denote the subsequence by {θ k ( )} with limit denoted by { θ( )}. Theorem 3.4. Under assumption A, the sequence {θ k ( )} converges weakly to θ( ), which is the solution of Langevin dynamic given by (.3). Proof. By Skorohod representation(see[6]), without loss of generality and without changing notation, we may assume that {θ k ( )} converges to θ( ) almost surely, and the convergence is uniform on each bounded interval. We proceed to characterize the limiting process. Step 1: We show that the algorithm converges to the solution of (.3). Define the operator (3.5) Lg(θ) = 1 g(θ), logp(θ)+ n logp(θ x i ) + 1 tr[ g(θ)], for any suitable function g. For each t > 0 and s > 0, each positive integer κ, each 0 t ι t with ι κ, each bounded and continuous function ρ 0 ( ), and for each twice continuously differentiable function h( ) with compact support, we shall show that [ (3.6) Eρ( θ(t ι );ι κ) h( θ(t+s)) h( θ t ) This yields that h( θ t ) t 0 t+s t ] Lh( θ(u))du = 0. Lh( θ(u))du is a continuous-time Martingale,

7 HYBRID GRADIENT LANGEVIN DYNAMICS FOR BAYESIAN LEARNING 7 which in turn implies that θ( ) is a solution of the Martingale problem with operator L defined in (3.5). To establish the desired result, we work with the sequence θ k ( ). By virtue of the weak convergence and the Skorohod representation, it is readily seen that (3.7) Eρ(θ k (t ι );ι κ)[h(θ k (t+s)) h(θ [ k (t))] ] Eρ( θ(t ι );ι κ) h( θ(t+s)) h( θ(t)) as ε 0. On the other hand, given a small, direct calculation shows that [ ] Eρ(θ k (t ι );ι κ) h(θ k (t+s)) h(θ k (t)) (3.8) = Eρ(θ k (t ι );ι κ) { s/ 1 Step : For simplicity, we define f(θ) = logp(θ)+ (3.9) [ h(θm(t+tk +l + )) h(θ m(t+tk +l )) ]}. N logp(θ x i ), and f k (θ) = logp(θ)+ N n k logp(θ x ki ). n k For the terms on the last line of (3.8), we have lim = lim s/ 1 [h(θ m(t+tk +l + )) h(θ m(t+tk +l ))] { s/ 1 + [ h (θ m(t+tk +l )) By the continuity of f( ) and the fact that Ef(θ) = Ef k (θ), { s/ 1 lim h (θ m(t+tk +l )) = lim + lim = 0. { s/ 1 h (θ m(t+tk +l )) { s/ 1 h (θ m(t+tk +l )) ε i tr[ h(θ m(t+tk +l ))] ε i f i(θ i ) ]}. ε i [ f(θi ) f i (θ m(t+tk +l )) ]} ε i [ f(θi ) f i (θ i ) ]} ε i [ fi (θ i ) f i (θ m(t+tk +l )) ]}

8 8 QI HE AND JACK XIN Thus, in evaluating the limit, f i (θ i ) can be replaced by f(θ m(t+tk +l )). Consequently, by the weak convergence and the Skorohod representation, (3.10) { s/ 1 lim h (θ m(t+tk +l )) = lim { s/ 1 { s/ 1 ε } i f(θ m(t+t k +l )) h (θ m(t+tk +l )) } f(θ m(t+t k +l )) } = lim h (θ k (t+l ))f(θ k (t+l )) { t+s } = Eρ( θ(t ι );ι κ) h ( θ(u))f( θ(u))du. t In the above, treating such terms as f(θ k (t + lδ ε )), we can approximate θ k ( ) by a process taking finitely many values using a standard approximation argument (see example, [6, p. 169] for more details). Similar to (3.10), we also obtain (3.11) lim { s/ 1 { t+s = Eρ( θ(t ι );ι κ) t 1 tr[ h( θ(u)) ] du ε i tr[ h(θ t+tk +l ) ]} }. Step 3: Combining Steps 1, we obtain that θ( ), the weak limit of θ k ( ), is a solution of the Martingale problem with operator L defined in (3.5). Using characteristic functions, we show as in [18, Lemma 7.18], θ( ) the solution of the Martingale problem with operator L, is unique in the sense of distribution. Thus θ k ( ) converges to θ( ) as desired, which concludes the proof of the theorem. Remark 3.5. The proof can also be extended to the algorithm (.) including the preconditioner matrix, which is more common in practice. 4. Numerical simulations. In this section, we show the performance of the hybrid deterministic-stochastic gradient method for binary logistic regression and multinomial logistic regression models of data classification, in comparison with the stochastic gradient descent method Binary logistic regression. We apply our hybrid deterministic-stochastic gradient method to Bayesian logistic regression model and compare with the result of stochastic gradient method. Logistic regression models [0] have been used widely in a vast number of applications for the problem of binary classification. Assume that we have data with N examples of input-output pairs (x i,y i ), where x i R m is a

9 HYBRID GRADIENT LANGEVIN DYNAMICS FOR BAYESIAN LEARNING 9 vector of m features, and y i { 1,1} is the binary outcome. The goal is to find a linear classifier that given the features x i and a vector of parameters β, the sign of the inner product β T x i gives y i. The probability of the outcome given features x i is p 1 (y i x i ) = σ(y i β T x i ), 1 where σ(z) = 1+exp( z). The bias parameter is absorbed into β by including 1 as an entry in x i. We use standard normal distribution as the prior distribution for β. Then the gradient of the log likelihood is: β logp 1(y i x i ) = σ( y i β T x i )y i x i, and the gradient of the log of the prior is β logp(β) = β. We applied our algorithm tothea9adatasetusedin[9]fromuciadultdataset. It consistsof3561observations and 13 features. We used 80% of the data as the training data, i.e. N = 6049 and the rest 0% as the test data. For hybrid deterministic-stochastic method, we set the number of the size of the sampling subset as n k = min(1.1n k 1,N),k = 1,, with n 1 = 1. While for the stochastic gradient method, we use the constant number of samples with n = 4. We run each algorithm 50 times and take the average. Figure 1 shows the average log joint probability per data item. It can be seen that stochastic gradient algorithm leads to large likelihood at the beginning, while the hybrid stochastic-deterministic gradient algorithm dominates the stochastic gradient when the sample size grows. 0.3 Comparison of average log joint probability per data iterm for stochastic and hybrid stochastic methods log probability stochastic hybrid stochastic Iterations Fig. 1. Comparison of the log probability of logistic regression for stochastic and hybrid deterministic-stochastic methods. Figure shows the accuracy by comparing these two methods on the testing data set. The accuracy is in the sense of data percentages correctly predicted over the testing data set. The hybrid nature can be seen again: the stochastic gradient algorithm has more rapid initial progress, while the hybrid deterministic-stochastic gradient algorithm gains more accuracy when the sample size grows with the iteration. 4.. Multinomial logistic regression. In this subsection, we apply both algorithms to a more general logistic regression model. Multinomial logistic regression extends the binary requirement to allow each outcome y i to take any value from a set

10 30 QI HE AND JACK XIN 0.84 Comparison of accuracy of stochastic and Hybrid stochastic methods Accuracy Stochastic hybrid stochastic Iterations Fig.. Comparison of the accuracy of logistic regression on test set for stochastic and hybrid deterministic-stochastic methods. of classes C = {1,,3,,n}, [0]. In this model, we have a separate parameter β j for each class j C. Given data x i R m with m features, we model the probability of the outcome by: p (y i = j x i,(β j ) j C ) = exp(β T j x i) l C exp(βt l x l). The gradient of the log likelihood is: p (y i = j x i,(β j ) j C ) = σ( y i β l I {yi=l}x i )y i x i, β j l C where I { } is the indicator function. We run the experiments for multinomial logistic regression on the well-known MNIST data set [4], containing examples of 8 8 images of digits, where each digit is classified as one of the integers between 0 to 9. We used examples as the training data and the rest 10000as the test data. We set the sample size n = 10 for stochastic gradient algorithm and n k = min(1.1n k,n) for hybrid deterministicstochasticgradientalgorithm. Assumethatpriordistributionsforparameterβ j,j C are standard normal distributions. Then the gradient of the logarithm of the prior for β j is beta j logp(β j ) = β j. We run each algorithm 0 times and take the average. Figure 3 shows the accuracy of these two methods. As the results of the above experiments, the stochastic gradient algorithm is better initially. However, hybrid deterministic-stochastic gradient algorithm gains more accuracy as the sample size grows. 5. Remarks. Our work developed hybrid deterministic-stochastic gradient algorithm of Langevin dynamics to Bayesian learning. We further developed the hybrid deterministic-stochastic gradient optimization method [4] for Bayesian learning, and advanced the stochastic Bayesian learning method in [9]. A comprehensive weak convergence analysis of these algorithms is presented based on stochastic approximations method [6]. We showed by numerical simulations that our proposed method has both the advantages of the stochastic gradient Bayesian learning and the full gradient Bayesian learning. More efficient MCMC sampling methods based on hybrid

11 HYBRID GRADIENT LANGEVIN DYNAMICS FOR BAYESIAN LEARNING Comparison of accuaracy on test set for multilogistic regression for stochastic and hybrid stochastic method Accuracy stochastic hybrid stochastic Iterations Fig. 3. Comparison of the accuracy of multi-logistic regression on test dataset for stochastic and hybrid deterministic-stochastic methods. deterministic-stochastic gradients are interesting directions in the future, for example, the study of the more sophisticated Hamiltonian Monte Carlo approaches based on the hybrid deterministic-stochastic gradients. REFERENCES [1] D.P. Bertsekas and J.N. Tsitsiklis, Neuro-dynamic programming: An overview, in: Proceedings of the 34th IEEE Conference on Decision and Control, 1995, vol. 1, pp [] L. Bottou, Online learning and stochastic approximations, On-line learning in neural networks, 17(1998), pp [3] T. Zhang, Solving large scale linear prediction problems using stochastic gradient descent algorithms, in: Proceedings of the 1th international conference on Machine learning(icml), 004, pp [4] M.P. Friedlander and M. Schmidt, Hybrid deterministic-stochastic methods for data fitting, SIAM Journal on Scientific Computing, 34:3(01), pp [5] L. Bottou, Large-scale machine learning with stochastic gradient descent, in: Proceedings of COMPSTAT 010, 010, pp [6] H.J. Kushner and G. Yin, Stochastic approximation and recursive algorithms and applications, vol. 35, Springer, 003. [7] J.C. Spall, Multivariate stochastic approximation using a simultaneous perturbation gradient approximation, IEEE Transactions on Automatic Control, 37:3(199), pp [8] B. Delyon, M. Lavielle, and E. Moulines, Convergence of a stochastic approximation version of the EM algorithm, Annals of Statistics, 7(1999), pp [9] M. Welling and Y.W. Teh, Bayesian learning via stochastic gradient Langevin dynamics, in: Proceedings of the 8th International Conference on Machine Learning (ICML), 011, pp [10] S. Ahn, A. Korattikara, and M. Welling, Bayesian posterior sampling via stochastic gradient Fisher scoring, in: Proceedings of the 9th International Conference on Machine Learning (ICML), 01, pp [11] H.J. Kushner, Approximation and Weak Convergence Methods for Random Processes With Application to Stochastics Systems Theory, vol. 6, MIT press, [1] A.M. Stuart, J. Voss, and P. Wilberg, Conditional path sampling of SDEs and the Langevin MCMC method, Communications in Mathematical Sciences, :4(004), pp [13] C.P. Robert and G. Casella, Monte Carlo statistical methods, Springer, 010. [14] G.O. Roberts and R.L. Tweedie, Exponential convergence of Langevin distributions and their discrete approximations, Bernoulli, pp , [15] P.E. Kloeden and E. Platen, Numerical solution of stochastic differential equations, vol. 3,

12 3 QI HE AND JACK XIN Springer, 199. [16] H. Robbins and D. Siegmund, A convergence theorem for non negative almost supermartingales and some applications, in: Herbert Robbins Selected Papers, pp Springer, [17] G.O. Roberts and O. Stramer, Langevin diffusions and Metropolis-Hastings algorithms, Methodology and computing in applied probability, 4:4(00), pp [18] G. Yin and Q. Zhang, Continuous-time Markov chains and applications, vol. 37, Springer, 013. [19] J. Nocedal and S.J. Wright, Numerical optimization, Springer, 006. [0] D.W. Hosmer and S. Lemeshow, Applied logistic regression, vol. 354, John Wiley & Sons, 004.

Monte Carlo in Bayesian Statistics

Monte Carlo in Bayesian Statistics Matthew Thomas SAMBa - University of Bath m.l.thomas@bath.ac.uk December 4, 2014 Matthew Thomas (SAMBa) Monte Carlo in Bayesian Statistics December 4, 2014 1 / 16 Overview