HYBRID DETERMINISTIC-STOCHASTIC GRADIENT LANGEVIN DYNAMICS FOR BAYESIAN LEARNING

Size: px
Start display at page:

Download "HYBRID DETERMINISTIC-STOCHASTIC GRADIENT LANGEVIN DYNAMICS FOR BAYESIAN LEARNING"

Transcription

1 COMMUNICATIONS IN INFORMATION AND SYSTEMS c 01 International Press Vol. 1, No. 3, pp. 1-3, HYBRID DETERMINISTIC-STOCHASTIC GRADIENT LANGEVIN DYNAMICS FOR BAYESIAN LEARNING QI HE AND JACK XIN Abstract. We propose a new algorithm to obtain Bayesian posterior distribution by a hybrid deterministic-stochastic gradient Langevin dynamics. To speed up convergence and reduce computational costs, it is common to use stochastic gradient method to approximate the full gradient by sampling a subset of the large dataset. Stochastic gradient methods make progress fast initially, however, they often become slow in the late stage as the iterations approach the desired solution. The conventional gradient methods converge better eventually however at the expense of evaluating the full gradient at each iteration. Our hybrid method has the advantages of both approaches for constructing the Bayesian posterior distribution. We prove that our algorithm converges based on the weak convergence methods, and illustrate numerically its effectiveness and improved accuracy. Key words: Langevin dynamics, Stochastic gradient, Bayesian learning. AMS Subject Classification. 60J7, 93E0, 34E Introduction. This work focuses on Bayesian learning based on a hybrid deterministic-stochastic gradient descent Langevin dynamics. There has been increasing interest in large scale datasets for machine learning, ranging from network data, signal processing and data mining to bioinformatics. The large scale data significantly increase the computational complexity of the underlying optimization algorithm. One of the successful methods to overcome this difficulty is stochastic gradient method, which is efficient and simple to implement. In the literature, stochastic gradient methods have been widely used for large scale machine learning. For example, [1, section 3.] and [] studied incremental-gradient method in which each iteration only evaluates the gradient along one single index; [3] applied stochastic gradient algorithm to large scale linear prediction problems; [4] explored hybrid deterministic-stochastic methods for large scale optimization. For more applications of stochastic gradient methods in large scale machine learning, we refer to [5] and references therein. The convergence of stochastic gradient algorithm has also been extensively studied in stochastic approximation literature, such as[, 6], where convergence holds under some mild conditions including the case where the loss function is not everywhere differentiable; see also [7] for the convergence of a simultaneous perturbation gradient approximation and [8] for the convergence of stochastic gradient expectation maximization (EM) algorithm. Along another line, there are few studies for large scale Bayesian learning by stochastic gradient method. [9] seems to be the first to propose stochastic gradient method for Bayesian learning by Langevin dynamics. Later, [10] improved the method Department of Mathematics, University of California, Irvine 9697, qhe@uci.edu. Department of Mathematics, University of California, Irvine 9697, jxin@math.uci.edu 1

2 QI HE AND JACK XIN by adding preconditioner using stochastic Fisher scoring. Although the algorithm works efficiently in numerical experiments, there is no theoretical analysis for the convergence of the stochastic gradient Bayesian learning yet. To bridge this gap, in this paper, we study Bayesian learning by hybrid deterministic-stochastic gradient method motivated by [4] and provide a convergence proof. It covers the model in [9] as a special case and so gives the proof of the convergence for algorithm in [9]. This is the first work, to the best of our knowledge, that presents the theoretical analysis of the algorithm for stochastic gradient Bayesian learning. It is worth mentioning that the convergence analysis of the hybrid deterministicstochastic method for data fitting in [4] does not apply directly to the Bayesian learning problem here. The algorithm for Bayesian learning converges to a probability distribution rather than to a deterministic number. The traditional error analysis is not suitable for Bayesian learning. Instead we employ weak convergence method [11] to treat both data fitting and Bayesian learning algorithms. The weak convergence method also requires weaker conditions for convergence. The rest of the paper is organized as follows. Section begins with certain preliminary results. Section 3 provides the proof of convergence for the algorithm. Section 4 shows advantages of our proposed method by numerical simulations. Finally, concluding remarks are given in Section 5. The work was partially supported by NSF grants DMS-1507, and DMS The authors thank Mr. Sungjin Ahn for helpful conversations on Bayesian learning and related data sets.. Preliminary Results. The Markov chain Monte Carlo(MCMC) method is a very popular tool for computational problems in Bayesian statistics, see [1] and [13]. To sample from the target density, MCMC methods construct a Markov chain whose stationary distribution is the target density. Under some suitable ergodic condition of the Markov chain, statistical quantities based on the target density can be calculated by simulating the Markov chain and computing time averaged quantities. A basic sampling method in MCMC is Langevin dynamics [13]. Assume that π is a target density on R r. Then the Langevin diffusion θ t is defined by the stochastic differential equation (.1) dθ t = 1 logπ(θ t)dt+dw t, wherew t isastandardbrownianmotion. When π is suitablysmooth, it can be shown that θ t has π as a stationary distribution [14]. Denote the density function of the diffusion θ t by ρ(t,x), then lim t ρ(t,x) = π(x). A variety of MCMC algorithms are based on Langevin dynamics. For example, the simplest way is to discretize Langevin dynamics (.1) by the Euler-Maruyama method [15]. In addition, the Metropolis step can be added to remove bias if the step size is such that bias is significant. There

3 HYBRID GRADIENT LANGEVIN DYNAMICS FOR BAYESIAN LEARNING 3 are also some variants of the method, for example, pre-conditioning the dynamic by a positive definite matrix A to obtain (.) dθ t = 1 A logπ(θ t)dt+a 1/ dw t. This dynamic also has π as its stationary distribution. To apply Langevin dynamics of MCMC method to Bayesian learning, we consider the following model. Let θ denote a parameter vector, with p(θ) a prior distribution, and p(x θ) is the conditional distribution density of data x given θ. The posterior distribution of parameter θ given a set of data X = (x i,1 i N) is (.3) p(θ X) p(θ)π N i p(x i θ). Replacing π in (.1) by p(θ X), we have the following Langevin dynamic for θ dθ t = 1 logp(θ t X)dt+dW(t), = 1 ( N ) logp(θ t )+ logp(x i θ t ) dt+dw(t). Hence, we have that the limit distribution of θ t is p(θ X), [14]. In order to approach the posterior distribution by Langevin dynamics, we use Euler-Maruyama method for discretization [15]. The algorithm is as follows (.4) θ k+1 = ε ( k logp(θ k )+ N ) logp(x i θ k ) + ε k η k, where η k is an i.i.d random variable sequence with normal distribution, and step size ε k satisfies k=1 ε k = and k=1 ε k <. These two conditions for ε k are very common for stochastic approximation with decreasing step size, see [6], [11] and [16]. Typically, step size ε k = a(b +t) γ which decays algebraically with γ (0.5,1]. To correct for discretization error, one can use (.4) as a proposal distribution and modify it by the Metropolis-Hasting method [17]. Note that as we decrease the step size ε k, the discretization error decreases to zero so that the rejection rate approaches zero. Hence, we can simply ignore the Metropolis-Hasting acceptance step. In the algorithm (.4), we need to evaluate the gradient of log(x θ) over all the data set, which is time consuming. To speed up convergence, [9] used the stochastic gradient method, i.e., at each step only estimate the gradient over a subset of the data. Assume that the size of the subset of the data is n < N, the algorithm is as follows θ k+1 = ε ( k logp(θ k )+ N n n ) logp(x i θ k ) + ε k η k. Although it is shown in [9] that this stochastic gradient Bayesian learning works well for some numerical simulations, the convergence improves slowly after a certain

4 4 QI HE AND JACK XIN number of iterations. The reason for this is that the stochastic gradient method makes good progress initially, but it is slow in improving the accuracy as the iterates approach the limiting solution. To overcome this disadvantage, [4] proposed a hybrid deterministic-stochastic method for optimization, in which the size of the sampling subset is increasing after each iteration. It is shown that this method exhibits benefits of both the stochastic gradient method and full gradient method. Motivated by [4], we propose a hybrid deterministic-stochastic gradient method for Bayesian learning. The new ingredient is to inject additional noise according to Langevin dynamics into the algorithm so that convergence to the full posterior distribution holds. The hybrid deterministic-stochastic gradient Bayesian learning algorithm is given by (.5) θ k+1 = ε k ( logp(θ k )+ N n k ) logp(x i θ k ) + ε k η k, n k wheren k isthenumberofsizeofasamplesubsetatthek-iteration. Itisnondecreasing and lim k n k N. 3. Weak convergence method. In this section, we prove that the distribution of the algorithm (.5) converges to the posterior distribution p(θ X). We introduce the following conditions. Assumption A. Assume that logp(θ) and logp(θ x) satisfy linear growth and Lipschitz conditions, and that p(x) is continuously differentiable. Remark 3.1. The linear growth and Lipschitz conditions in assumption A can be extended to local linear growth and local Lipschitz conditions by applying truncation techniques, for more details we refer to [6]. Define t k = k 1 i=0 ε i,m(t) = max{k : t k t}, and continuous-time interpolations (3.1) θ 0 (t) = θ k, for t [t k,t k+1 ), θ k (t) = θ 0 (t+t k ). We first obtain an estimate on the pth moment of {θ k }. This is stated as follows. Lemma 3.. Under assumption A, for any fixed p and T > 0, (3.) sup E θ k p ( θ 0 p +KT)exp(KT) <, 0 k m(t) for some constant K > 0. Proof. Define U(θ) = θ p and use E k to denote the conditional expectation with respect to the σ-algebra G k, where G k = σ(θ 1,...,θ k ). In the following, we use

5 HYBRID GRADIENT LANGEVIN DYNAMICS FOR BAYESIAN LEARNING 5 notation for the transpose operation. Thus (3.3) E k U(θ k+1 ) U(θ k ) = E k U (θ k )[θ k+1 θ k ] +E k (θ k+1 θ k ) U(θ + k )(θ k+1 θ k ) ε k U (θ k ) ( logp(θ k )+ N n k ) logp(x ki θ k ) n k +Kε k θ k p (1+ θ k ) Kε k (1+ θ k p ), where U and U denote the gradient and the Hessian of U w.r.t. to x, and θ + k denotes a vector on the line segment joining θ k and θ k+1. Note that we have used the linear growth in θ for both logp(θ) and logp(θ x) in the last line of (3.3). Since U(θ k ) = θ k p, we obtain E k θ k+1 p θ k p +Kε k +Kε k θ k p. Taking the expectation on both sides and iterating on the resulting recursion, we have n E θ n+1 p θ 0 p +Kε k n+kε k E θ k p. k=0 An application of the Gronwall s inequality yields E θ n+1 p ( θ 0 p +KT)exp(KT) as desired. To proceed, let us recall the definition of tightness. Assume that B is a metric space and that B is a σ algebra of subsets of B. A set of probability measures {P α } on (B,B) istight ifforeachε > 0there isacompactset B ε B suchthat inf α P α {B α } 1 ε. A set of random variable {x α } is tight if its corresponding set of measures {P α } is tight. For more details of tightness, we refer readers to the book [11]. In view of the estimate above, {θ k : 0 k m(t)} is tight in R r by applying the well-known Chebyshev inequality. That is, for each δ > 0, there is a K δ satisfying K δ > (1/δ) such that P( θ k > K δ ) sup E θ k 0 n m(t) Kδ Kδ. Next, we show that {θ k ( )} is tight in suitable function spaces. Lemma 3.3. Under assumption A, {θ k ( )} is tight in D r [0, ), the space of functions that are right continuous and have left limits, endowed with the Skorohod topology.

6 6 QI HE AND JACK XIN Proof. For any η > 0, t 0, 0 s η, we have (3.4) E θ k (t+s) θ k (t) m(t+s) 1 = E 1 n k ε k ( logp(θ k )+ K K k=m(t) m(t+s) 1 k=m(t) m(t+s) 1 k=m(t) ε k(1+e θ k )+K ε k(1+ O(t+s t) = O(s). sup m(t) k m(t+s) 1 N ) logp(θ k x ki ) + n k m(t+s) 1 k=m(t) ε k E η k E θ k )+K m(t+s) 1 k=m(t) m(t+s) 1 ε k k=m(t) εk η k In the above, we have used Lemma 3. to ensure that sup m(t) k m(t+s) E θ k < and the conditions of the stepsize ε k < and ε k =. Therefore, (3.4) leads to lim limsup E θ k (t+s) θ k (t) = 0. η 0 k The tightness of {θ k ( )} then follows from [11, p. 47] Weak Convergence. Since {θ k ( )} is tight, by Prohorov s Theorem (see [6]), we may select a convergent subsequence. For simplicity, we still denote the subsequence by {θ k ( )} with limit denoted by { θ( )}. Theorem 3.4. Under assumption A, the sequence {θ k ( )} converges weakly to θ( ), which is the solution of Langevin dynamic given by (.3). Proof. By Skorohod representation(see[6]), without loss of generality and without changing notation, we may assume that {θ k ( )} converges to θ( ) almost surely, and the convergence is uniform on each bounded interval. We proceed to characterize the limiting process. Step 1: We show that the algorithm converges to the solution of (.3). Define the operator (3.5) Lg(θ) = 1 g(θ), logp(θ)+ n logp(θ x i ) + 1 tr[ g(θ)], for any suitable function g. For each t > 0 and s > 0, each positive integer κ, each 0 t ι t with ι κ, each bounded and continuous function ρ 0 ( ), and for each twice continuously differentiable function h( ) with compact support, we shall show that [ (3.6) Eρ( θ(t ι );ι κ) h( θ(t+s)) h( θ t ) This yields that h( θ t ) t 0 t+s t ] Lh( θ(u))du = 0. Lh( θ(u))du is a continuous-time Martingale,

7 HYBRID GRADIENT LANGEVIN DYNAMICS FOR BAYESIAN LEARNING 7 which in turn implies that θ( ) is a solution of the Martingale problem with operator L defined in (3.5). To establish the desired result, we work with the sequence θ k ( ). By virtue of the weak convergence and the Skorohod representation, it is readily seen that (3.7) Eρ(θ k (t ι );ι κ)[h(θ k (t+s)) h(θ [ k (t))] ] Eρ( θ(t ι );ι κ) h( θ(t+s)) h( θ(t)) as ε 0. On the other hand, given a small, direct calculation shows that [ ] Eρ(θ k (t ι );ι κ) h(θ k (t+s)) h(θ k (t)) (3.8) = Eρ(θ k (t ι );ι κ) { s/ 1 Step : For simplicity, we define f(θ) = logp(θ)+ (3.9) [ h(θm(t+tk +l + )) h(θ m(t+tk +l )) ]}. N logp(θ x i ), and f k (θ) = logp(θ)+ N n k logp(θ x ki ). n k For the terms on the last line of (3.8), we have lim = lim s/ 1 [h(θ m(t+tk +l + )) h(θ m(t+tk +l ))] { s/ 1 + [ h (θ m(t+tk +l )) By the continuity of f( ) and the fact that Ef(θ) = Ef k (θ), { s/ 1 lim h (θ m(t+tk +l )) = lim + lim = 0. { s/ 1 h (θ m(t+tk +l )) { s/ 1 h (θ m(t+tk +l )) ε i tr[ h(θ m(t+tk +l ))] ε i f i(θ i ) ]}. ε i [ f(θi ) f i (θ m(t+tk +l )) ]} ε i [ f(θi ) f i (θ i ) ]} ε i [ fi (θ i ) f i (θ m(t+tk +l )) ]}

8 8 QI HE AND JACK XIN Thus, in evaluating the limit, f i (θ i ) can be replaced by f(θ m(t+tk +l )). Consequently, by the weak convergence and the Skorohod representation, (3.10) { s/ 1 lim h (θ m(t+tk +l )) = lim { s/ 1 { s/ 1 ε } i f(θ m(t+t k +l )) h (θ m(t+tk +l )) } f(θ m(t+t k +l )) } = lim h (θ k (t+l ))f(θ k (t+l )) { t+s } = Eρ( θ(t ι );ι κ) h ( θ(u))f( θ(u))du. t In the above, treating such terms as f(θ k (t + lδ ε )), we can approximate θ k ( ) by a process taking finitely many values using a standard approximation argument (see example, [6, p. 169] for more details). Similar to (3.10), we also obtain (3.11) lim { s/ 1 { t+s = Eρ( θ(t ι );ι κ) t 1 tr[ h( θ(u)) ] du ε i tr[ h(θ t+tk +l ) ]} }. Step 3: Combining Steps 1, we obtain that θ( ), the weak limit of θ k ( ), is a solution of the Martingale problem with operator L defined in (3.5). Using characteristic functions, we show as in [18, Lemma 7.18], θ( ) the solution of the Martingale problem with operator L, is unique in the sense of distribution. Thus θ k ( ) converges to θ( ) as desired, which concludes the proof of the theorem. Remark 3.5. The proof can also be extended to the algorithm (.) including the preconditioner matrix, which is more common in practice. 4. Numerical simulations. In this section, we show the performance of the hybrid deterministic-stochastic gradient method for binary logistic regression and multinomial logistic regression models of data classification, in comparison with the stochastic gradient descent method Binary logistic regression. We apply our hybrid deterministic-stochastic gradient method to Bayesian logistic regression model and compare with the result of stochastic gradient method. Logistic regression models [0] have been used widely in a vast number of applications for the problem of binary classification. Assume that we have data with N examples of input-output pairs (x i,y i ), where x i R m is a

9 HYBRID GRADIENT LANGEVIN DYNAMICS FOR BAYESIAN LEARNING 9 vector of m features, and y i { 1,1} is the binary outcome. The goal is to find a linear classifier that given the features x i and a vector of parameters β, the sign of the inner product β T x i gives y i. The probability of the outcome given features x i is p 1 (y i x i ) = σ(y i β T x i ), 1 where σ(z) = 1+exp( z). The bias parameter is absorbed into β by including 1 as an entry in x i. We use standard normal distribution as the prior distribution for β. Then the gradient of the log likelihood is: β logp 1(y i x i ) = σ( y i β T x i )y i x i, and the gradient of the log of the prior is β logp(β) = β. We applied our algorithm tothea9adatasetusedin[9]fromuciadultdataset. It consistsof3561observations and 13 features. We used 80% of the data as the training data, i.e. N = 6049 and the rest 0% as the test data. For hybrid deterministic-stochastic method, we set the number of the size of the sampling subset as n k = min(1.1n k 1,N),k = 1,, with n 1 = 1. While for the stochastic gradient method, we use the constant number of samples with n = 4. We run each algorithm 50 times and take the average. Figure 1 shows the average log joint probability per data item. It can be seen that stochastic gradient algorithm leads to large likelihood at the beginning, while the hybrid stochastic-deterministic gradient algorithm dominates the stochastic gradient when the sample size grows. 0.3 Comparison of average log joint probability per data iterm for stochastic and hybrid stochastic methods log probability stochastic hybrid stochastic Iterations Fig. 1. Comparison of the log probability of logistic regression for stochastic and hybrid deterministic-stochastic methods. Figure shows the accuracy by comparing these two methods on the testing data set. The accuracy is in the sense of data percentages correctly predicted over the testing data set. The hybrid nature can be seen again: the stochastic gradient algorithm has more rapid initial progress, while the hybrid deterministic-stochastic gradient algorithm gains more accuracy when the sample size grows with the iteration. 4.. Multinomial logistic regression. In this subsection, we apply both algorithms to a more general logistic regression model. Multinomial logistic regression extends the binary requirement to allow each outcome y i to take any value from a set

10 30 QI HE AND JACK XIN 0.84 Comparison of accuracy of stochastic and Hybrid stochastic methods Accuracy Stochastic hybrid stochastic Iterations Fig.. Comparison of the accuracy of logistic regression on test set for stochastic and hybrid deterministic-stochastic methods. of classes C = {1,,3,,n}, [0]. In this model, we have a separate parameter β j for each class j C. Given data x i R m with m features, we model the probability of the outcome by: p (y i = j x i,(β j ) j C ) = exp(β T j x i) l C exp(βt l x l). The gradient of the log likelihood is: p (y i = j x i,(β j ) j C ) = σ( y i β l I {yi=l}x i )y i x i, β j l C where I { } is the indicator function. We run the experiments for multinomial logistic regression on the well-known MNIST data set [4], containing examples of 8 8 images of digits, where each digit is classified as one of the integers between 0 to 9. We used examples as the training data and the rest 10000as the test data. We set the sample size n = 10 for stochastic gradient algorithm and n k = min(1.1n k,n) for hybrid deterministicstochasticgradientalgorithm. Assumethatpriordistributionsforparameterβ j,j C are standard normal distributions. Then the gradient of the logarithm of the prior for β j is beta j logp(β j ) = β j. We run each algorithm 0 times and take the average. Figure 3 shows the accuracy of these two methods. As the results of the above experiments, the stochastic gradient algorithm is better initially. However, hybrid deterministic-stochastic gradient algorithm gains more accuracy as the sample size grows. 5. Remarks. Our work developed hybrid deterministic-stochastic gradient algorithm of Langevin dynamics to Bayesian learning. We further developed the hybrid deterministic-stochastic gradient optimization method [4] for Bayesian learning, and advanced the stochastic Bayesian learning method in [9]. A comprehensive weak convergence analysis of these algorithms is presented based on stochastic approximations method [6]. We showed by numerical simulations that our proposed method has both the advantages of the stochastic gradient Bayesian learning and the full gradient Bayesian learning. More efficient MCMC sampling methods based on hybrid

11 HYBRID GRADIENT LANGEVIN DYNAMICS FOR BAYESIAN LEARNING Comparison of accuaracy on test set for multilogistic regression for stochastic and hybrid stochastic method Accuracy stochastic hybrid stochastic Iterations Fig. 3. Comparison of the accuracy of multi-logistic regression on test dataset for stochastic and hybrid deterministic-stochastic methods. deterministic-stochastic gradients are interesting directions in the future, for example, the study of the more sophisticated Hamiltonian Monte Carlo approaches based on the hybrid deterministic-stochastic gradients. REFERENCES [1] D.P. Bertsekas and J.N. Tsitsiklis, Neuro-dynamic programming: An overview, in: Proceedings of the 34th IEEE Conference on Decision and Control, 1995, vol. 1, pp [] L. Bottou, Online learning and stochastic approximations, On-line learning in neural networks, 17(1998), pp [3] T. Zhang, Solving large scale linear prediction problems using stochastic gradient descent algorithms, in: Proceedings of the 1th international conference on Machine learning(icml), 004, pp [4] M.P. Friedlander and M. Schmidt, Hybrid deterministic-stochastic methods for data fitting, SIAM Journal on Scientific Computing, 34:3(01), pp [5] L. Bottou, Large-scale machine learning with stochastic gradient descent, in: Proceedings of COMPSTAT 010, 010, pp [6] H.J. Kushner and G. Yin, Stochastic approximation and recursive algorithms and applications, vol. 35, Springer, 003. [7] J.C. Spall, Multivariate stochastic approximation using a simultaneous perturbation gradient approximation, IEEE Transactions on Automatic Control, 37:3(199), pp [8] B. Delyon, M. Lavielle, and E. Moulines, Convergence of a stochastic approximation version of the EM algorithm, Annals of Statistics, 7(1999), pp [9] M. Welling and Y.W. Teh, Bayesian learning via stochastic gradient Langevin dynamics, in: Proceedings of the 8th International Conference on Machine Learning (ICML), 011, pp [10] S. Ahn, A. Korattikara, and M. Welling, Bayesian posterior sampling via stochastic gradient Fisher scoring, in: Proceedings of the 9th International Conference on Machine Learning (ICML), 01, pp [11] H.J. Kushner, Approximation and Weak Convergence Methods for Random Processes With Application to Stochastics Systems Theory, vol. 6, MIT press, [1] A.M. Stuart, J. Voss, and P. Wilberg, Conditional path sampling of SDEs and the Langevin MCMC method, Communications in Mathematical Sciences, :4(004), pp [13] C.P. Robert and G. Casella, Monte Carlo statistical methods, Springer, 010. [14] G.O. Roberts and R.L. Tweedie, Exponential convergence of Langevin distributions and their discrete approximations, Bernoulli, pp , [15] P.E. Kloeden and E. Platen, Numerical solution of stochastic differential equations, vol. 3,

12 3 QI HE AND JACK XIN Springer, 199. [16] H. Robbins and D. Siegmund, A convergence theorem for non negative almost supermartingales and some applications, in: Herbert Robbins Selected Papers, pp Springer, [17] G.O. Roberts and O. Stramer, Langevin diffusions and Metropolis-Hastings algorithms, Methodology and computing in applied probability, 4:4(00), pp [18] G. Yin and Q. Zhang, Continuous-time Markov chains and applications, vol. 37, Springer, 013. [19] J. Nocedal and S.J. Wright, Numerical optimization, Springer, 006. [0] D.W. Hosmer and S. Lemeshow, Applied logistic regression, vol. 354, John Wiley & Sons, 004.

Monte Carlo in Bayesian Statistics

Monte Carlo in Bayesian Statistics Monte Carlo in Bayesian Statistics Matthew Thomas SAMBa - University of Bath m.l.thomas@bath.ac.uk December 4, 2014 Matthew Thomas (SAMBa) Monte Carlo in Bayesian Statistics December 4, 2014 1 / 16 Overview

More information

Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C.

Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C. Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C. Spall John Wiley and Sons, Inc., 2003 Preface... xiii 1. Stochastic Search

More information

Riemann Manifold Methods in Bayesian Statistics

Riemann Manifold Methods in Bayesian Statistics Ricardo Ehlers ehlers@icmc.usp.br Applied Maths and Stats University of São Paulo, Brazil Working Group in Statistical Learning University College Dublin September 2015 Bayesian inference is based on Bayes

More information

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Afternoon Meeting on Bayesian Computation 2018 University of Reading Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of

More information

Lecture 7 and 8: Markov Chain Monte Carlo

Lecture 7 and 8: Markov Chain Monte Carlo Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget

Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget Anoop Korattikara, Yutian Chen and Max Welling,2 Department of Computer Science, University of California, Irvine 2 Informatics Institute,

More information

Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling

Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling Christophe Dupuy INRIA - Technicolor christophe.dupuy@inria.fr Francis Bach INRIA - ENS francis.bach@inria.fr Abstract

More information

Learning the hyper-parameters. Luca Martino

Learning the hyper-parameters. Luca Martino Learning the hyper-parameters Luca Martino 2017 2017 1 / 28 Parameters and hyper-parameters 1. All the described methods depend on some choice of hyper-parameters... 2. For instance, do you recall λ (bandwidth

More information

Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring

Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring Sungjin Ahn Dept. of Computer Science, UC Irvine, Irvine, CA 92697-3425, USA Anoop Korattikara Dept. of Computer Science, UC Irvine, Irvine, CA 92697-3425, USA Max Welling Dept. of Computer Science, UC

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

Hamiltonian Monte Carlo for Scalable Deep Learning

Hamiltonian Monte Carlo for Scalable Deep Learning Hamiltonian Monte Carlo for Scalable Deep Learning Isaac Robson Department of Statistics and Operations Research, University of North Carolina at Chapel Hill isrobson@email.unc.edu BIOS 740 May 4, 2018

More information

Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization

Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization Changyou Chen, David Carlson, Zhe Gan, Chunyuan Li, Lawrence Carin May 2, 2016 1 Changyou Chen Bridging the Gap between Stochastic

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Some Results on the Ergodicity of Adaptive MCMC Algorithms

Some Results on the Ergodicity of Adaptive MCMC Algorithms Some Results on the Ergodicity of Adaptive MCMC Algorithms Omar Khalil Supervisor: Jeffrey Rosenthal September 2, 2011 1 Contents 1 Andrieu-Moulines 4 2 Roberts-Rosenthal 7 3 Atchadé and Fort 8 4 Relationship

More information

Ergodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R.

Ergodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R. Ergodic Theorems Samy Tindel Purdue University Probability Theory 2 - MA 539 Taken from Probability: Theory and examples by R. Durrett Samy T. Ergodic theorems Probability Theory 1 / 92 Outline 1 Definitions

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Perturbed Proximal Gradient Algorithm

Perturbed Proximal Gradient Algorithm Perturbed Proximal Gradient Algorithm Gersende FORT LTCI, CNRS, Telecom ParisTech Université Paris-Saclay, 75013, Paris, France Large-scale inverse problems and optimization Applications to image processing

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Markov Chain Monte Carlo. Simulated Annealing.

Markov Chain Monte Carlo. Simulated Annealing. Aula 10. Simulated Annealing. 0 Markov Chain Monte Carlo. Simulated Annealing. Anatoli Iambartsev IME-USP Aula 10. Simulated Annealing. 1 [RC] Stochastic search. General iterative formula for optimizing

More information

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Lecture 8: Bayesian Estimation of Parameters in State Space Models in State Space Models March 30, 2016 Contents 1 Bayesian estimation of parameters in state space models 2 Computational methods for parameter estimation 3 Practical parameter estimation in state space

More information

Stochastic Gradient Descent in Continuous Time

Stochastic Gradient Descent in Continuous Time Stochastic Gradient Descent in Continuous Time Justin Sirignano University of Illinois at Urbana Champaign with Konstantinos Spiliopoulos (Boston University) 1 / 27 We consider a diffusion X t X = R m

More information

A System Theoretic Perspective of Learning and Optimization

A System Theoretic Perspective of Learning and Optimization A System Theoretic Perspective of Learning and Optimization Xi-Ren Cao* Hong Kong University of Science and Technology Clear Water Bay, Kowloon, Hong Kong eecao@ee.ust.hk Abstract Learning and optimization

More information

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

MCMC algorithms for fitting Bayesian models

MCMC algorithms for fitting Bayesian models MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models

More information

Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk

Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk John Lafferty School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 lafferty@cs.cmu.edu Abstract

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) Markov Chain Monte Carlo (MCMC Dependent Sampling Suppose we wish to sample from a density π, and we can evaluate π as a function but have no means to directly generate a sample. Rejection sampling can

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Bayesian Inference and MCMC

Bayesian Inference and MCMC Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the

More information

MARKOV CHAIN MONTE CARLO

MARKOV CHAIN MONTE CARLO MARKOV CHAIN MONTE CARLO RYAN WANG Abstract. This paper gives a brief introduction to Markov Chain Monte Carlo methods, which offer a general framework for calculating difficult integrals. We start with

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER 204 205 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE hour Please note that the rubric of this paper is made different from many other papers.

More information

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Machine Learning. Lecture 3: Logistic Regression. Feng Li. Machine Learning Lecture 3: Logistic Regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2016 Logistic Regression Classification

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2019 Due: 11am Monday, February 25th, 2019 Submit scan of plots/written responses to Gradebook; submit your

More information

Approximate Slice Sampling for Bayesian Posterior Inference

Approximate Slice Sampling for Bayesian Posterior Inference Approximate Slice Sampling for Bayesian Posterior Inference Anonymous Author 1 Anonymous Author 2 Anonymous Author 3 Unknown Institution 1 Unknown Institution 2 Unknown Institution 3 Abstract In this paper,

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine

More information

STA205 Probability: Week 8 R. Wolpert

STA205 Probability: Week 8 R. Wolpert INFINITE COIN-TOSS AND THE LAWS OF LARGE NUMBERS The traditional interpretation of the probability of an event E is its asymptotic frequency: the limit as n of the fraction of n repeated, similar, and

More information

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run

More information

Least Mean Square Algorithms With Markov Regime-Switching Limit

Least Mean Square Algorithms With Markov Regime-Switching Limit IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 50, NO. 5, MAY 2005 577 Least Mean Square Algorithms With Markov Regime-Switching Limit G. George Yin, Fellow, IEEE, and Vikram Krishnamurthy, Fellow, IEEE

More information

Branching Processes II: Convergence of critical branching to Feller s CSB

Branching Processes II: Convergence of critical branching to Feller s CSB Chapter 4 Branching Processes II: Convergence of critical branching to Feller s CSB Figure 4.1: Feller 4.1 Birth and Death Processes 4.1.1 Linear birth and death processes Branching processes can be studied

More information

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Bagging During Markov Chain Monte Carlo for Smoother Predictions Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods

More information

On Markov chain Monte Carlo methods for tall data

On Markov chain Monte Carlo methods for tall data On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational

More information

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods Changyou Chen Department of Electrical and Computer Engineering, Duke University cc448@duke.edu Duke-Tsinghua Machine Learning Summer

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods By Oleg Makhnin 1 Introduction a b c M = d e f g h i 0 f(x)dx 1.1 Motivation 1.1.1 Just here Supresses numbering 1.1.2 After this 1.2 Literature 2 Method 2.1 New math As

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017 1 Will Monroe CS 109 Logistic Regression Lecture Notes #22 August 14, 2017 Based on a chapter by Chris Piech Logistic regression is a classification algorithm1 that works by trying to learn a function

More information

Almost Sure Convergence of Two Time-Scale Stochastic Approximation Algorithms

Almost Sure Convergence of Two Time-Scale Stochastic Approximation Algorithms Almost Sure Convergence of Two Time-Scale Stochastic Approximation Algorithms Vladislav B. Tadić Abstract The almost sure convergence of two time-scale stochastic approximation algorithms is analyzed under

More information

Thermostatic Controls for Noisy Gradient Systems and Applications to Machine Learning

Thermostatic Controls for Noisy Gradient Systems and Applications to Machine Learning Thermostatic Controls for Noisy Gradient Systems and Applications to Machine Learning Ben Leimkuhler University of Edinburgh Joint work with C. Matthews (Chicago), G. Stoltz (ENPC-Paris), M. Tretyakov

More information

A SCALED STOCHASTIC NEWTON ALGORITHM FOR MARKOV CHAIN MONTE CARLO SIMULATIONS

A SCALED STOCHASTIC NEWTON ALGORITHM FOR MARKOV CHAIN MONTE CARLO SIMULATIONS A SCALED STOCHASTIC NEWTON ALGORITHM FOR MARKOV CHAIN MONTE CARLO SIMULATIONS TAN BUI-THANH AND OMAR GHATTAS Abstract. We propose a scaled stochastic Newton algorithm ssn) for local Metropolis-Hastings

More information

Approximate Slice Sampling for Bayesian Posterior Inference

Approximate Slice Sampling for Bayesian Posterior Inference Approximate Slice Sampling for Bayesian Posterior Inference Christopher DuBois GraphLab, Inc. Anoop Korattikara Dept. of Computer Science UC Irvine Max Welling Informatics Institute University of Amsterdam

More information

Outline. A Central Limit Theorem for Truncating Stochastic Algorithms

Outline. A Central Limit Theorem for Truncating Stochastic Algorithms Outline A Central Limit Theorem for Truncating Stochastic Algorithms Jérôme Lelong http://cermics.enpc.fr/ lelong Tuesday September 5, 6 1 3 4 Jérôme Lelong (CERMICS) Tuesday September 5, 6 1 / 3 Jérôme

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Logistic Regression. Seungjin Choi

Logistic Regression. Seungjin Choi Logistic Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

On the fast convergence of random perturbations of the gradient flow.

On the fast convergence of random perturbations of the gradient flow. On the fast convergence of random perturbations of the gradient flow. Wenqing Hu. 1 (Joint work with Chris Junchi Li 2.) 1. Department of Mathematics and Statistics, Missouri S&T. 2. Department of Operations

More information

Variational Inference via Stochastic Backpropagation

Variational Inference via Stochastic Backpropagation Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation

More information

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016 Lecture 7 Logistic Regression Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 11, 2016 Luigi Freda ( La Sapienza University) Lecture 7 December 11, 2016 1 / 39 Outline 1 Intro Logistic

More information

Bayesian Deep Learning

Bayesian Deep Learning Bayesian Deep Learning Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp June 06, 2018 Mohammad Emtiyaz Khan 2018 1 What will you learn? Why is Bayesian inference

More information

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17 MCMC for big data Geir Storvik BigInsight lunch - May 2 2018 Geir Storvik MCMC for big data BigInsight lunch - May 2 2018 1 / 17 Outline Why ordinary MCMC is not scalable Different approaches for making

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods Tomas McKelvey and Lennart Svensson Signal Processing Group Department of Signals and Systems Chalmers University of Technology, Sweden November 26, 2012 Today s learning

More information

Adaptive Rejection Sampling with fixed number of nodes

Adaptive Rejection Sampling with fixed number of nodes Adaptive Rejection Sampling with fixed number of nodes L. Martino, F. Louzada Institute of Mathematical Sciences and Computing, Universidade de São Paulo, Brazil. Abstract The adaptive rejection sampling

More information

Weak Convergence: Introduction

Weak Convergence: Introduction 7 Weak Convergence: Introduction 7.0 Outline of Chapter Up to now, we have concentrated on the convergence of {θ n } or of {θ n ( )} to an appropriate limit set with probability one. In this chapter, we

More information

dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés

dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés Inférence pénalisée dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés Gersende Fort Institut de Mathématiques de Toulouse, CNRS and Univ. Paul Sabatier Toulouse,

More information

Consistency of the maximum likelihood estimator for general hidden Markov models

Consistency of the maximum likelihood estimator for general hidden Markov models Consistency of the maximum likelihood estimator for general hidden Markov models Jimmy Olsson Centre for Mathematical Sciences Lund University Nordstat 2012 Umeå, Sweden Collaborators Hidden Markov models

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2018 Due: Friday, February 23rd, 2018, 11:55 PM Submit code and report via EEE Dropbox You should submit a

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos & Aarti Singh Contents Markov Chain Monte Carlo Methods Goal & Motivation Sampling Rejection Importance Markov

More information

Rate of Convergence of Learning in Social Networks

Rate of Convergence of Learning in Social Networks Rate of Convergence of Learning in Social Networks Ilan Lobel, Daron Acemoglu, Munther Dahleh and Asuman Ozdaglar Massachusetts Institute of Technology Cambridge, MA Abstract We study the rate of convergence

More information

1 Geometry of high dimensional probability distributions

1 Geometry of high dimensional probability distributions Hamiltonian Monte Carlo October 20, 2018 Debdeep Pati References: Neal, Radford M. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo 2.11 (2011): 2. Betancourt, Michael. A conceptual

More information

ABC methods for phase-type distributions with applications in insurance risk problems

ABC methods for phase-type distributions with applications in insurance risk problems ABC methods for phase-type with applications problems Concepcion Ausin, Department of Statistics, Universidad Carlos III de Madrid Joint work with: Pedro Galeano, Universidad Carlos III de Madrid Simon

More information

LEARNING SPARSE STRUCTURED ENSEMBLES WITH STOCASTIC GTADIENT MCMC SAMPLING AND NETWORK PRUNING

LEARNING SPARSE STRUCTURED ENSEMBLES WITH STOCASTIC GTADIENT MCMC SAMPLING AND NETWORK PRUNING LEARNING SPARSE STRUCTURED ENSEMBLES WITH STOCASTIC GTADIENT MCMC SAMPLING AND NETWORK PRUNING Yichi Zhang Zhijian Ou Speech Processing and Machine Intelligence (SPMI) Lab Department of Electronic Engineering

More information

Ergodic Subgradient Descent

Ergodic Subgradient Descent Ergodic Subgradient Descent John Duchi, Alekh Agarwal, Mikael Johansson, Michael Jordan University of California, Berkeley and Royal Institute of Technology (KTH), Sweden Allerton Conference, September

More information

Scalable Deep Poisson Factor Analysis for Topic Modeling: Supplementary Material

Scalable Deep Poisson Factor Analysis for Topic Modeling: Supplementary Material : Supplementary Material Zhe Gan ZHEGAN@DUKEEDU Changyou Chen CHANGYOUCHEN@DUKEEDU Ricardo Henao RICARDOHENAO@DUKEEDU David Carlson DAVIDCARLSON@DUKEEDU Lawrence Carin LCARIN@DUKEEDU Department of Electrical

More information

Patterns of Scalable Bayesian Inference Background (Session 1)

Patterns of Scalable Bayesian Inference Background (Session 1) Patterns of Scalable Bayesian Inference Background (Session 1) Jerónimo Arenas-García Universidad Carlos III de Madrid jeronimo.arenas@gmail.com June 14, 2017 1 / 15 Motivation. Bayesian Learning principles

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

Development of Stochastic Artificial Neural Networks for Hydrological Prediction Development of Stochastic Artificial Neural Networks for Hydrological Prediction G. B. Kingston, M. F. Lambert and H. R. Maier Centre for Applied Modelling in Water Engineering, School of Civil and Environmental

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

On the Optimal Scaling of the Modified Metropolis-Hastings algorithm

On the Optimal Scaling of the Modified Metropolis-Hastings algorithm On the Optimal Scaling of the Modified Metropolis-Hastings algorithm K. M. Zuev & J. L. Beck Division of Engineering and Applied Science California Institute of Technology, MC 4-44, Pasadena, CA 925, USA

More information

Average Reward Parameters

Average Reward Parameters Simulation-Based Optimization of Markov Reward Processes: Implementation Issues Peter Marbach 2 John N. Tsitsiklis 3 Abstract We consider discrete time, nite state space Markov reward processes which depend

More information

Asymptotic and finite-sample properties of estimators based on stochastic gradients

Asymptotic and finite-sample properties of estimators based on stochastic gradients Asymptotic and finite-sample properties of estimators based on stochastic gradients Panagiotis (Panos) Toulis panos.toulis@chicagobooth.edu Econometrics and Statistics University of Chicago, Booth School

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods Pattern Recognition and Machine Learning Chapter 11: Sampling Methods Elise Arnaud Jakob Verbeek May 22, 2008 Outline of the chapter 11.1 Basic Sampling Algorithms 11.2 Markov Chain Monte Carlo 11.3 Gibbs

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information