Approximate Bayesian computation for the parameters of PRISM programs

Size: px
Start display at page:

Download "Approximate Bayesian computation for the parameters of PRISM programs"


1 Approximate Bayesian computation for the parameters of PRISM programs James Cussens Department of Computer Science & York Centre for Complex Systems Analysis University of York Heslington, York, YO10 5DD, UK. Abstract. Probabilistic logic programming formalisms permit the definition of potentially very complex probability distributions. This complexity can often make learning hard, even when structure is fixed and learning reduces to parameter estimation. In this paper an approximate Bayesian computation (ABC) method is presented which computes approximations to the posterior distribution over PRISM parameters. The key to ABC approaches is that the likelihood function need not be computed, instead a distance between the observed data and synthetic data generated by candidate parameter values is used to drive the learning. This makes ABC highly appropriate for PRISM programs which can have an intractable likelihood function, but from which synthetic data can be readily generated. The algorithm is experimentally shown to work well on an easy problem but further work is required to produce acceptable results on harder ones. 1 Introduction In the Bayesian approach to parameter estimation a prior distribution for the parameters is combined with observed data to produce a posterior distribution. A key feature of the Bayesian approach is that the posterior provides a full picture of the information contained in prior and data: with an uninformative prior and little data we are not in a position to make confident estimates of the parameters and this will be reflected in a flat posterior. In contrast when much data is available the posterior will concentrate probability mass in small regions of the parameter space reflecting greater confidence in parameter estimates. Despite its attractive features the Bayesian approach is problematic because in many cases computing or even representing the posterior distribution is very difficult. One area in which this is often the case is statistical relational learning (SRL). SRL formalisms combine probabilistic models with rich representation languages (often logical) which allows highly complex probabilistic models to be described. In particular, the likelihood function (the probability of observed data as a function of the model s parameters) is often intractable. This makes Bayesian and non-bayesian parameter estimation difficult since the likelihood function plays a key role in both.

2 In this paper an approximate Bayesian computation (ABC) method is presented which approximates the posterior distribution over the parameters for a PRISM program with a given structure. The key feature of ABC approaches is that the likelihood function is never calculated. Instead synthetic datasets are generated and compared with the actually observed data. If a candidate parameter set generates synthetic datasets which are mostly close to the real data then it will tend to end up with high posterior probability. The rest of the paper is set out as follows. In Section 2 an account of approximate Bayesian computation is given. In Section 3 the essentials of PRISM programs are explained. Section 4 is the core of the paper where it is shown how to apply ABC to PRISM. Section 5 reports on initial experimental results and the paper concludes with Section 6 which includes pointers to future work. 2 Approximate Bayesian computation The ABC method applied in this paper is the ABC sequential Monte Carlo (ABC SMC) algorithm devised by Toni et al [1] and so the basic ideas of ABC will be explained using the notation of that paper. ABC approaches are motivated when the likelihood function is intractable but it is straightforward to sample synthetic data using any given candidate parameter set θ. The simplest ABC algorithm is a rejection sampling approach described by Marjoram et al [2]. Since this is a Bayesian approach there must be a user-defined prior distribution π(θ) over the model parameters. Let x 0 be the observed data. If it is possible to readily sample from π(θ) then it is possible to sample from the posterior distribution π(θ x 0 ) as follows: (1) sample θ from π, (2) sample synthetic data x from the model with its parameters set to θ, (i.e. sample from f(x θ ) where f is the likelihood function), (3) if x 0 = x accept θ. The problem, of course, with this algorithm is that in most real situations the probability of sampling synthetic data which is exactly equal to the observed data will be tiny. A somewhat more realistic option is to define a distance function d(x 0, x ) which measure how close synthetic data x is to the real data x 0. x is now accepted at stage (3) above when d(x 0, x ) ǫ for some user-defined ǫ. With this adaptation the rejection sampling approach will produce samples from π(θ d(x 0, x ) ǫ). As long as ǫ is reasonably small, this will be a good approximation to π(θ x 0 ). Choosing a value for ǫ is crucial: too big and the approximation to the posterior will be poor, too small and very few synthetic datasets will be accepted. A way out of this conundrum is to choose not one value for ǫ, but a sequence of decreasing values: ǫ 1,... ǫ T (ǫ 1 > > ǫ T ). This is the key idea behind the ABC sequential Monte Carlo (ABC SMC) algorithm: In ABC SMC, a number of sampled parameter values (called particles) {θ (1)... θ (N) }, sampled from the prior distribution π(θ), is propagated through a sequence of intermediate distributions π(θ d(x 0, x ) ǫ t ), t = 1,...T 1, until it represents a sample from the target distribution π(θ d(x 0, x ) ǫ T ). [1]

3 An important problem is how to move from sampling from π(θ d(x 0, x ) ǫ t ) to sampling from π(θ d(x 0, x ) ǫ t+1 ). In ABC SMC this is addressed via importance sampling. Samples are, in fact, not sampled from π(θ d(x 0, x ) ǫ t ) but from a different sequence of distributions η t (θ). Each such sample θ t is then weighted as follows w t (θ t ) = π(θt d(x0,x ) ǫ t) η t(θ t). η 1, the first distribution sampled from, is chosen to be the prior π. Subsequent distributions η t are generated via a user-defined perturbation kernels K t (θ t 1, θ t ) which perform moves around the parameter space. These are the basic ideas of the ABC SMC algorithm; full details are supplied by Toni et al [1] (particularly Appendix A). As a convenience the description of the ABC SMC algorithm supplied in that paper is reproduced (almost verbatim) in Fig 1. S1 Initialise ǫ 1,... ǫ T. Set the population indicator t = 0. S2.0 Set the particle indicator i = 1. S2.1 If t = 0, sample θ independently from π(θ). If t > 0, sample θ from the previous population {θ (i) t 1 } with weights wt 1 and perturb the particle to obtain θ K t(θ θ ), where K t is a perturbation kernel. If π(θ ) = 0, return to S2.1 Simulate a candidate dataset x (b) f(x θ ) B t times (b = 1,..., B t) and calculate b t(θ ) = P B t b=1 1(d(x0, x (b)) ǫ t). If b t(θ ) = 0, return to S2.1. S2.2 Set θ (i) t = θ and calculate the weight for particle θ (i) w (i) t = t, 8 < b t(θ (i) t ), if t = 0 : If i < N set i = i + 1, go to S2.1 S.3 Normalize the weights. If t < T, set t = t + 1, go to S2.0 π(θ (i) t )b t(θ (i) t ) P Nj=1 w (j) t 1 Kt(θ(j) t 1,θ(j) t ) if t > 0 Fig.1. ABC SMC algorithm reproduced from Toni et al [1]. Note, from Fig 1, that rather than generate a single dataset from f(x θ ), B t datasets are sampled where B t is set by the user. The quantity b t (θ ) is the count of synthetic datasets which are within ǫ t. The intuitive idea behind ABC SMC is that a particle θ that generates many synthetic datasets close to x 0, will get high weight and is thus more likely to be sampled for use at the next iteration.

4 3 PRISM PRISM (PRogramming In Statistical Modelling) [3] is a well-known SRL formalism which defines probability distributions over possible worlds (Herbrand models). The probabilistic element of a PRISM program is supplied using switches. A switch is syntactically defined using declarations such as those given in Fig 2 for switches init, tr(s0), tr(s1), out(s0) and out(s1). The declaration in Fig 2 for init, for example, defines an infinite collection of independent and identically distributed binary random variablesinit 1,init 2,... with values s0 and s1 and with distribution P(init i = s0) = 0.9, P(init i = s1) = 0.1 for all i. The rest of a PRISM program is essentially a Prolog program. Switches provide the probabilistic element via the built-in predicate msw/2. Each time a goal such as :- msw(init,s) is called the variable S is instantiated to s0 with probability 0.9 and s1 with probability 0.1. If this were the ith call to this goal then this amounts to sampling from the variable init i. Although later versions of PRISM do not require this, it is convenient to specify a target predicate where queries to the target predicate will lead to calls to msw/2 goals (usually via intermediate predicates). The target predicate is thus a probabilistic predicate: the PRISM program defines a distribution over instantiations of the variables in target predicate goals. For example, in the PRISM program in Fig 2 which implements a hidden Markov model,hmm/1 would be the target predicate. A query such as :- hmm(x). will lead to instantiations such as X = [a,a,b,a,a]. values(init,[s0,s1]). values(out(_),[a,b]). values(tr(_),[s0,s1]). % state initialization % symbol emission % state transition hmm(l):- % To observe a string L: msw(init,s), % Choose an initial state randomly hmm(1,5,s,l). % Start stochastic transition (loop) hmm(t,n,_,[]):- T>N,!. % Stop the loop hmm(t,n,s,[ob Y]) :- % Loop: current state is S, current time is T msw(out(s),ob), % Output Ob at the state S msw(tr(s),next), % Transit from S to Next. T1 is T+1, % Count up time hmm(t1,n,next,y). % Go next (recursion) :- set_sw(init, [0.9,0.1]), set_sw(tr(s0), [0.2,0.8]), set_sw(tr(s1), [0.8,0.2]), set_sw(out(s0),[0.5,0.5]), set_sw(out(s1),[0.6,0.4]). Fig. 2. PRISM encoding of a simple 2-state hidden Markov model (this example is distributed with the PRISM system).

5 The switch probabilities are the parameters of a PRISM program and the data used for parameter estimation in PRISM will be a collection of ground instances of the target predicates which are imagined to have been sampled from the unknown true PRISM program with the true parameters. PRISM contains a built-in EM algorithm for maximum likelihood parameter and maximum a posteriori (MAP) estimation [4]. In both cases a point estimate for each parameter is provided. In contrast here an approximate sample from the posterior distribution over parameters is provided by a population of particles. 4 ABC for PRISM To apply the ABC SMC algorithm it is necessary to choose: (1) a prior distribution for the parameters, (2) a distance function, (3) a perturbation kernel and (4) also the specific experimental parameters such as the sequence of ǫ t, etc. The first of these three are dealt with in the following three sections ( ). The choice of experimental parameters is addressed in Section Choice of prior distribution The obvious prior distribution is chosen. Each switch has a user-defined Dirichlet prior distribution and the full joint prior distribution is just a product of these. This is the same as the prior used for MAP estimation in PRISM [4, 4.7.2]. To sample from this prior it is enough to sample from each Dirichlet independently. Sampling from each Dirichlet is achieved by exploiting the relationship between Dirichlet and Gamma distributions. To produce a sample (p 1,..., p k ) from a Dirichlet with parameters (α 1,... α k ), values z i are sampled from Gamma(α i, 1) and then p i is set to z i /(z z k ). The z i are sampled using the algorithm of Cheng and Feast [5] for α i > 1 and the algorithm of Ahrens and Dieter [6] for α i 1. Both these algorithms are given in [7]. For Dirichlet distributions containing small values of α i, numerical problems sometimes produced samples where p i = 0 for some i which is wrong since the Dirichlet has density 0 for any probability distribution containing a zero value. This problem was solved by the simple expedient of not choosing small values for the α i! 4.2 Choice of distance function The basic ABC approach leads to a sample drawn from π(θ d(x 0, x ) ǫ) rather than π(θ x 0 ). For this to be a good approximation it is enough that f(x θ) f(x 0 θ) for all x where d(x 0, x ) ǫ. With this in mind d is defined as follows. Let P(x 0 ) be the empirical distribution defined by the real data x 0. P(x 0 ) assigns a probability to every possible ground instance of the target predicate. This probability is just the frequency of the ground instance in the data divided by the total number of datapoints. P(x ) is the corresponding empirical distribution for fake data x. Both P(x 0 ) and P(x ) can be viewed as (possibly countably

6 infinite) vectors of real numbers. The distance between x and x is then defined to be the squared Euclidean distance between P(x 0 ) and P(x ). Formally: d(x, x ) = i I(P(x 0 )(i) P(x )(i)) 2 (1) where I is just some (possibly countably infinite) index set for the set of all ground instances of the target predicate. In practice most terms in the sum on the RHS of (1) will be zero, since typically most ground instances appear neither in the real data nor in fake datasets. 4.3 Choice of perturbation kernel Recall that each particle θ defines a multinomial distribution of the appropriate dimension for each switch of the PRISM program. The perturbation kernel K t (θ θ ) has two stages. Firstly, Dirichlet distributions are derived from θ by multiplying each probability in θ by a global value α t where α t > 0. Secondly, a new particle is sampled from this product of Dirichlets using exactly the same procedure as was used for sampling from the original prior distribution. Large values of α t will make small moves in the parameter space likely (since the Dirichlet distributions will be concentrated around θ ) and small values of α t will encourage larger moves. An attractive option is to start with small values of α t to encourage exploration of parameter space and to progressively increase α t in the hope of convergence to a stable set of particles giving a good approximation to the posterior. 5 Experimental results The ABC SMC algorithm has been implemented as a PRISM program which is supplied in the supplementary materials. PRISM 2.0 beta 4, kindly supplied by the PRISM developers, was used. As an initial test, ABC was done for the simplest possible parameter estimation problem. A PRISM program representing a biassed coin (P(heads) = 0.7, P(tails) = 0.3) was written and data of 100 simulated tosses were produced. This resulted in 67 heads and 33 tails. ABC was run several times with the following (more or less arbitrarily chosen) parameters: prior distribution π(θ) = Dir(1, 1), sequence of thresholds ǫ = (0), number of synthetic datasets B t = 50, perturbation kernel parameter α t = 2, number of particles T = 50 and population size N = 50. As expected the final population of (weighted) particles were always concentrated around the maximum likelihood estimate P(heads) = 0.67, P(tails) = Here are the 4 most heavily weighted particles with their weights from one particular run: (0.646, ), (w = 0.051), (0.647, 0.353), (w = 0.044), (0.667, 0.332), (w = 0.044), (0.62, 0.38), (w = 0.037). Estimates of the posterior mean were similar for different ABC runs: here are such estimates from 5 runs: (0.663, 0.337),

7 (0.655, 0.345), (0.664, 0.336), (0.667, 0.333), (0.677, 0.323). Note that, in this trivial problem, successful parameter estimation was possible by going directly for a zero distance threshold ǫ = (0). For a more substantial test, 100 ground hmm/1 atoms were sampled from the PRISM encoded HMM show in Fig 2. Dir(1, 1) priors were used for all 5 switches. The experimental parameters were α t = 10, B t = 100, N = 200 and ǫ = (0.1, 0.05). Ideally, different runs of ABC should generate similar approximations to posterior quantities. To look into this, marginal posterior distributions for the probability of the first value of each of the 5 switches were estimated using 3 different ABC runs. The results are shown in Fig 3. These plots were produced using the density function in R with the final weighted population of particles as input. There is evident variation between the results of the 3 runs, but similarities also. All 3 densities for init contain two local modes, all 3 for out(s1) put most mass in the middle, all 3 fortr(s0) have a fairly even spread apart from extreme values. init out(s0) N = 200 Bandwidth = N = 200 Bandwidth = out(s1) tr(s0) N = 200 Bandwidth = N = 200 Bandwidth = tr(s1) N = 200 Bandwidth = Fig. 3. Posterior distributions for HMM switch probabilities init, out(s0), out(s1), tr(s0), tr(s1) as estimated by three different runs of ABC.

8 6 Conclusions and future work This paper has described ABC SMC for PRISM programs and has shown some initial results for a working implementation. Evidently, considerably more experimentation and theoretical analysis is required to provide reliable approximations to posterior quantities using ABC SMC. In the experiments reported above the perturbation kernel K t did not vary with t. It is likely that better results are possible by reducing the probability of big perturbations as t increases. In addition the choice for the sequence ǫ t thresholds was fairly arbitrary. Finally, it may be that better results are achievable by throwing more computational resources at the problem: most obviously increasing the number of particles, but also by lengthening the sequence of ǫ t thresholds to effect a smoother convergence to the posterior. Another avenue for improvement is the choice of distance function. The function introduced in Section 4.2 is a generic function that is applicable to any PRISM program. It seems likely that domain knowledge could be used to choose domain-specific distance functions which reflect the real difference between different ground atoms. The function used here treats all distinct pairs of ground atoms as equally different which will not be appropriate in many cases. References 1. Toni, T., Welch, D., Strelkowa, N., Ipsen, A., Stumpf, M.P.: Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems. Journal of the Royal Society Interface 6(31) (2009) Marjoram, P., Molitor, J., Plagnol, V., Tavaré, S.: Markov chain Monte Carlo without likelihoods. Proceedings of the National Academy of Science 100 (December 2003) Sato, T., Kameya, Y.: Parameter learning of logic programs for symbolic-statistical modeling. Journal of Artificial Intelligence Research 15 (2001) Sato, T., Zhou, N.F., Kameya, Y., Izumi, Y.: PRISM User s Manual (Version ). (2009) 5. Cheng, B., Feast, G.: Some simple gamma variate generators. Applied Statistics 28 (1979) Ahrens, J., Dieter, U.: Computer methods for sampling from gamma, beta, Poisson and binomial distributions. Computing 12 (1974) Robert, C.P., Casella, R.: Monte Carlo Statistical Methods. Second edn. Springer, New York (2004)

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics!! Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (, Zoubin Ghahramni (,

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics!! Lecture 9 Sequential Data So far

More information

Bayesian Models in Machine Learning

Bayesian Models in Machine Learning Bayesian Models in Machine Learning Lukáš Burget Escuela de Ciencias Informáticas 2017 Buenos Aires, July 24-29 2017 Frequentist vs. Bayesian Frequentist point of view: Probability is the frequency of

More information

Towards a Bayesian model for Cyber Security

Towards a Bayesian model for Cyber Security Towards a Bayesian model for Cyber Security Mark Briers ( Joint work with Henry Clausen and Prof. Niall Adams (Imperial College London) 27 September 2017 The Alan Turing Institute

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007 Bayesian inference Fredrik Ronquist and Peter Beerli October 3, 2007 1 Introduction The last few decades has seen a growing interest in Bayesian inference, an alternative approach to statistical inference.

More information



More information

Approximate Bayesian Computation: a simulation based approach to inference

Approximate Bayesian Computation: a simulation based approach to inference Approximate Bayesian Computation: a simulation based approach to inference Richard Wilkinson Simon Tavaré 2 Department of Probability and Statistics University of Sheffield 2 Department of Applied Mathematics

More information

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials) Markov Networks l Like Bayes Nets l Graph model that describes joint probability distribution using tables (AKA potentials) l Nodes are random variables l Labels are outcomes over the variables Markov

More information

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials) Markov Networks l Like Bayes Nets l Graphical model that describes joint probability distribution using tables (AKA potentials) l Nodes are random variables l Labels are outcomes over the variables Markov

More information

Lecture 21: Spectral Learning for Graphical Models

Lecture 21: Spectral Learning for Graphical Models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation

More information

2 Inference for Multinomial Distribution

2 Inference for Multinomial Distribution Markov Chain Monte Carlo Methods Part III: Statistical Concepts By K.B.Athreya, Mohan Delampady and T.Krishnan 1 Introduction In parts I and II of this series it was shown how Markov chain Monte Carlo

More information

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Spring 009 Mark Craven Sequence Motifs what is a sequence

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics!! Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information


Bayesian Networks BY: MOHAMAD ALSABBAGH Bayesian Networks BY: MOHAMAD ALSABBAGH Outlines Introduction Bayes Rule Bayesian Networks (BN) Representation Size of a Bayesian Network Inference via BN BN Learning Dynamic BN Introduction Conditional

More information

The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations

The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations John R. Michael, Significance, Inc. and William R. Schucany, Southern Methodist University The mixture

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Parameter Estimation December 14, 2015 Overview 1 Motivation 2 3 4 What did we have so far? 1 Representations: how do we model the problem? (directed/undirected). 2 Inference: given a model and partially

More information

Time-Sensitive Dirichlet Process Mixture Models

Time-Sensitive Dirichlet Process Mixture Models Time-Sensitive Dirichlet Process Mixture Models Xiaojin Zhu Zoubin Ghahramani John Lafferty May 25 CMU-CALD-5-4 School of Computer Science Carnegie Mellon University Pittsburgh, PA 523 Abstract We introduce

More information

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception PROBABILITY DISTRIBUTIONS Credits 2 These slides were sourced and/or modified from: Christopher Bishop, Microsoft UK Parametric Distributions 3 Basic building blocks: Need to determine given Representation:

More information

Approximate Bayesian Computation

Approximate Bayesian Computation Approximate Bayesian Computation Sarah Filippi Department of Statistics University of Oxford 09/02/2016 Parameter inference in a signalling pathway A RAS Receptor Growth factor Cell membrane Phosphorylation

More information

CS 361: Probability & Statistics

CS 361: Probability & Statistics March 14, 2018 CS 361: Probability & Statistics Inference The prior From Bayes rule, we know that we can express our function of interest as Likelihood Prior Posterior The right hand side contains the

More information

Sequential Monte Carlo Methods for Bayesian Computation

Sequential Monte Carlo Methods for Bayesian Computation Sequential Monte Carlo Methods for Bayesian Computation A. Doucet Kyoto Sept. 2012 A. Doucet (MLSS Sept. 2012) Sept. 2012 1 / 136 Motivating Example 1: Generic Bayesian Model Let X be a vector parameter

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

arxiv: v1 [] 30 Sep 2009

arxiv: v1 [] 30 Sep 2009 Model choice versus model criticism arxiv:0909.5673v1 [] 30 Sep 2009 Christian P. Robert 1,2, Kerrie Mengersen 3, and Carla Chen 3 1 Université Paris Dauphine, 2 CREST-INSEE, Paris, France, and

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Topic Modelling and Latent Dirichlet Allocation

Topic Modelling and Latent Dirichlet Allocation Topic Modelling and Latent Dirichlet Allocation Stephen Clark (with thanks to Mark Gales for some of the slides) Lent 2013 Machine Learning for Language Processing: Lecture 7 MPhil in Advanced Computer

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford Maximum Likelihood Principle A generative model for

More information

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2 Logistics CSE 446: Point Estimation Winter 2012 PS2 out shortly Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2 Last Time Random variables, distributions Marginal, joint & conditional

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007 Sequential Monte Carlo and Particle Filtering Frank Wood Gatsby, November 2007 Importance Sampling Recall: Let s say that we want to compute some expectation (integral) E p [f] = p(x)f(x)dx and we remember

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

LECTURE 15 Markov chain Monte Carlo

LECTURE 15 Markov chain Monte Carlo LECTURE 15 Markov chain Monte Carlo There are many settings when posterior computation is a challenge in that one does not have a closed form expression for the posterior distribution. Markov chain Monte

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences!! h0p:// Lecture 7 Approximate

More information

Bayesian networks: approximate inference

Bayesian networks: approximate inference Bayesian networks: approximate inference Machine Intelligence Thomas D. Nielsen September 2008 Approximative inference September 2008 1 / 25 Motivation Because of the (worst-case) intractability of exact

More information

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14 STATS 306B: Unsupervised Learning Spring 2014 Lecture 5 April 14 Lecturer: Lester Mackey Scribe: Brian Do and Robin Jia 5.1 Discrete Hidden Markov Models 5.1.1 Recap In the last lecture, we introduced

More information

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm 6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm Overview The EM algorithm in general form The EM algorithm for hidden markov models (brute force) The EM algorithm for hidden markov models (dynamic

More information

Tutorial on ABC Algorithms

Tutorial on ABC Algorithms Tutorial on ABC Algorithms Dr Chris Drovandi Queensland University of Technology, Australia July 3, 2014 Notation Model parameter θ with prior π(θ) Likelihood is f(ý θ) with observed

More information

An introduction to Sequential Monte Carlo

An introduction to Sequential Monte Carlo An introduction to Sequential Monte Carlo Thang Bui Jes Frellsen Department of Engineering University of Cambridge Research and Communication Club 6 February 2014 1 Sequential Monte Carlo (SMC) methods

More information

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01 STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01 Nasser Sadeghkhani There are two main schools to statistical inference: 1-frequentist

More information

eqr094: Hierarchical MCMC for Bayesian System Reliability

eqr094: Hierarchical MCMC for Bayesian System Reliability eqr094: Hierarchical MCMC for Bayesian System Reliability Alyson G. Wilson Statistical Sciences Group, Los Alamos National Laboratory P.O. Box 1663, MS F600 Los Alamos, NM 87545 USA Phone: 505-667-9167

More information

Intelligent Systems:

Intelligent Systems: Intelligent Systems: Undirected Graphical models (Factor Graphs) (2 lectures) Carsten Rother 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM Roadmap for next two lectures Definition

More information

Expectation-Maximization (EM) algorithm

Expectation-Maximization (EM) algorithm I529: Machine Learning in Bioinformatics (Spring 2017) Expectation-Maximization (EM) algorithm Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2017 Contents Introduce

More information

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond Lecture 2: From Linear Regression to Kalman Filter and Beyond Department of Biomedical Engineering and Computational Science Aalto University January 26, 2012 Contents 1 Batch and Recursive Estimation

More information

Graphical Models and Kernel Methods

Graphical Models and Kernel Methods Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.

More information

Lecture 15. Probabilistic Models on Graph

Lecture 15. Probabilistic Models on Graph Lecture 15. Probabilistic Models on Graph Prof. Alan Yuille Spring 2014 1 Introduction We discuss how to define probabilistic models that use richly structured probability distributions and describe how

More information

1 Probabilities. 1.1 Basics 1 PROBABILITIES

1 Probabilities. 1.1 Basics 1 PROBABILITIES 1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability

More information

Particle Filtering a brief introductory tutorial. Frank Wood Gatsby, August 2007

Particle Filtering a brief introductory tutorial. Frank Wood Gatsby, August 2007 Particle Filtering a brief introductory tutorial Frank Wood Gatsby, August 2007 Problem: Target Tracking A ballistic projectile has been launched in our direction and may or may not land near enough to

More information

Dynamic Approaches: The Hidden Markov Model

Dynamic Approaches: The Hidden Markov Model Dynamic Approaches: The Hidden Markov Model Davide Bacciu Dipartimento di Informatica Università di Pisa Machine Learning: Neural Networks and Advanced Models (AA2) Inference as Message

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. Machine Learning Course: Two Models Described by Same Graph Latent variables Observations

More information

an introduction to bayesian inference

an introduction to bayesian inference with an application to network analysis january 13, 2010 motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena

More information

Infinite-State Markov-switching for Dynamic. Volatility Models : Web Appendix

Infinite-State Markov-switching for Dynamic. Volatility Models : Web Appendix Infinite-State Markov-switching for Dynamic Volatility Models : Web Appendix Arnaud Dufays 1 Centre de Recherche en Economie et Statistique March 19, 2014 1 Comparison of the two MS-GARCH approximations

More information

Auxiliary Particle Methods

Auxiliary Particle Methods Auxiliary Particle Methods Perspectives & Applications Adam M. Johansen 1 Oxford University Man Institute 29th May 2008 1 Collaborators include: Arnaud Doucet, Nick Whiteley

More information

Introduc)on to Bayesian Methods

Introduc)on to Bayesian Methods Introduc)on to Bayesian Methods Bayes Rule py x)px) = px! y) = px y)py) py x) = px y)py) px) px) =! px! y) = px y)py) y py x) = py x) =! y "! y px y)py) px y)py) px y)py) px y)py)dy Bayes Rule py x) =

More information

Strong Lens Modeling (II): Statistical Methods

Strong Lens Modeling (II): Statistical Methods Strong Lens Modeling (II): Statistical Methods Chuck Keeton Rutgers, the State University of New Jersey Probability theory multiple random variables, a and b joint distribution p(a, b) conditional distribution

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

Bayesian Machine Learning - Lecture 7

Bayesian Machine Learning - Lecture 7 Bayesian Machine Learning - Lecture 7 Guido Sanguinetti Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh March 4, 2015 Today s lecture 1

More information

Investigation into the use of confidence indicators with calibration

Investigation into the use of confidence indicators with calibration WORKSHOP ON FRONTIERS IN BENCHMARKING TECHNIQUES AND THEIR APPLICATION TO OFFICIAL STATISTICS 7 8 APRIL 2005 Investigation into the use of confidence indicators with calibration Gerard Keogh and Dave Jennings

More information

Notes on pseudo-marginal methods, variational Bayes and ABC

Notes on pseudo-marginal methods, variational Bayes and ABC Notes on pseudo-marginal methods, variational Bayes and ABC Christian Andersson Naesseth October 3, 2016 The Pseudo-Marginal Framework Assume we are interested in sampling from the posterior distribution

More information

Bayesian Inference and MCMC

Bayesian Inference and MCMC Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the

More information

Likelihood-free MCMC

Likelihood-free MCMC Bayesian inference for stable distributions with applications in finance Department of Mathematics University of Leicester September 2, 2011 MSc project final presentation Outline 1 2 3 4 Classical Monte

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani Department of Engineering University of Cambridge,

More information

Conditional probabilities and graphical models

Conditional probabilities and graphical models Conditional probabilities and graphical models Thomas Mailund Bioinformatics Research Centre (BiRC), Aarhus University Probability theory allows us to describe uncertainty in the processes we model within

More information

A Logic-based Approach to Generatively Defined Discriminative Modeling

A Logic-based Approach to Generatively Defined Discriminative Modeling A Logic-based Approach to Generatively Defined Discriminative Modeling Taisuke Sato 1, Keiichi Kubota 1, and Yoshitaka Kameya 2 1 Tokyo institute of Technology, Japan {sato, 2

More information

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis Summarizing a posterior Given the data and prior the posterior is determined Summarizing the posterior gives parameter estimates, intervals, and hypothesis tests Most of these computations are integrals

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Introduction to Machine Learning. Lecture 2

Introduction to Machine Learning. Lecture 2 Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for

More information

Estimation of Quantiles

Estimation of Quantiles 9 Estimation of Quantiles The notion of quantiles was introduced in Section 3.2: recall that a quantile x α for an r.v. X is a constant such that P(X x α )=1 α. (9.1) In this chapter we examine quantiles

More information

Bayesian Inference. Chapter 1. Introduction and basic concepts

Bayesian Inference. Chapter 1. Introduction and basic concepts Bayesian Inference Chapter 1. Introduction and basic concepts M. Concepción Ausín Department of Statistics Universidad Carlos III de Madrid Master in Business Administration and Quantitative Methods Master

More information

Probability theory basics

Probability theory basics Probability theory basics Michael Franke Basics of probability theory: axiomatic definition, interpretation, joint distributions, marginalization, conditional probability & Bayes rule. Random variables:

More information

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning CS 446 Machine Learning Fall 206 Nov 0, 206 Bayesian Learning Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Bayesian Learning Naive Bayes Logistic Regression Bayesian Learning So far, we

More information

Fast Likelihood-Free Inference via Bayesian Optimization

Fast Likelihood-Free Inference via Bayesian Optimization Fast Likelihood-Free Inference via Bayesian Optimization Michael Gutmann University of Helsinki Aalto University Helsinki Institute for Information Technology

More information

Bayesian Networks. Motivation

Bayesian Networks. Motivation Bayesian Networks Computer Sciences 760 Spring 2014 Motivation Assume we have five Boolean variables,,,, The joint probability is,,,, How many state configurations

More information

4 : Exact Inference: Variable Elimination

4 : Exact Inference: Variable Elimination 10-708: Probabilistic Graphical Models 10-708, Spring 2014 4 : Exact Inference: Variable Elimination Lecturer: Eric P. ing Scribes: Soumya Batra, Pradeep Dasigi, Manzil Zaheer 1 Probabilistic Inference

More information


MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION THOMAS MAILUND Machine learning means different things to different people, and there is no general agreed upon core set of algorithms that must be

More information

CS 361: Probability & Statistics

CS 361: Probability & Statistics October 17, 2017 CS 361: Probability & Statistics Inference Maximum likelihood: drawbacks A couple of things might trip up max likelihood estimation: 1) Finding the maximum of some functions can be quite

More information

1 Probabilities. 1.1 Basics 1 PROBABILITIES

1 Probabilities. 1.1 Basics 1 PROBABILITIES 1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Bayesian Inference: Posterior Intervals

Bayesian Inference: Posterior Intervals Bayesian Inference: Posterior Intervals Simple values like the posterior mean E[θ X] and posterior variance var[θ X] can be useful in learning about θ. Quantiles of π(θ X) (especially the posterior median)

More information

An ABC interpretation of the multiple auxiliary variable method

An ABC interpretation of the multiple auxiliary variable method School of Mathematical and Physical Sciences Department of Mathematics and Statistics Preprint MPS-2016-07 27 April 2016 An ABC interpretation of the multiple auxiliary variable method by Dennis Prangle

More information

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1 Bayes Networks CS540 Bryan R Gibson University of Wisconsin-Madison Slides adapted from those used by Prof. Jerry Zhu, CS540-1 1 / 59 Outline Joint Probability: great for inference, terrible to obtain

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences!! h0p:// Lecture 2 In our

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

SAMPLING ALGORITHMS. In general. Inference in Bayesian models SAMPLING ALGORITHMS SAMPLING ALGORITHMS In general A sampling algorithm is an algorithm that outputs samples x 1, x 2,... from a given distribution P or density p. Sampling algorithms can for example be

More information

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution Outline A short review on Bayesian analysis. Binomial, Multinomial, Normal, Beta, Dirichlet Posterior mean, MAP, credible interval, posterior distribution Gibbs sampling Revisit the Gaussian mixture model

More information


PROBABILISTIC REASONING SYSTEMS PROBABILISTIC REASONING SYSTEMS In which we explain how to build reasoning systems that use network models to reason with uncertainty according to the laws of probability theory. Outline Knowledge in uncertain

More information

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence Bayesian Inference in GLMs Frequentists typically base inferences on MLEs, asymptotic confidence limits, and log-likelihood ratio tests Bayesians base inferences on the posterior distribution of the unknowns

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Curve Fitting Re-visited, Bishop1.2.5

Curve Fitting Re-visited, Bishop1.2.5 Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the

More information

Modeling Environment

Modeling Environment Topic Model Modeling Environment What does it mean to understand/ your environment? Ability to predict Two approaches to ing environment of words and text Latent Semantic Analysis (LSA) Topic Model LSA

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Infinite Feature Models: The Indian Buffet Process Eric Xing Lecture 21, April 2, 214 Acknowledgement: slides first drafted by Sinead Williamson

More information

Bayesian RL Seminar. Chris Mansley September 9, 2008

Bayesian RL Seminar. Chris Mansley September 9, 2008 Bayesian RL Seminar Chris Mansley September 9, 2008 Bayes Basic Probability One of the basic principles of probability theory, the chain rule, will allow us to derive most of the background material in

More information

Bayesian Nonparametrics for Speech and Signal Processing

Bayesian Nonparametrics for Speech and Signal Processing Bayesian Nonparametrics for Speech and Signal Processing Michael I. Jordan University of California, Berkeley June 28, 2011 Acknowledgments: Emily Fox, Erik Sudderth, Yee Whye Teh, and Romain Thibaux Computer

More information

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1

More information

Stat 516, Homework 1

Stat 516, Homework 1 Stat 516, Homework 1 Due date: October 7 1. Consider an urn with n distinct balls numbered 1,..., n. We sample balls from the urn with replacement. Let N be the number of draws until we encounter a ball

More information

Bayesian Inference. Chapter 2: Conjugate models

Bayesian Inference. Chapter 2: Conjugate models Bayesian Inference Chapter 2: Conjugate models Conchi Ausín and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Master in Business Administration and Quantitative Methods Master in

More information

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University Phylogenetics: Bayesian Phylogenetic Analysis COMP 571 - Spring 2015 Luay Nakhleh, Rice University Bayes Rule P(X = x Y = y) = P(X = x, Y = y) P(Y = y) = P(X = x)p(y = y X = x) P x P(X = x 0 )P(Y = y X

More information

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017 1 Introduction Let x = (x 1,..., x M ) denote a sequence (e.g. a sequence of words), and let y = (y 1,..., y M ) denote a corresponding hidden sequence that we believe explains or influences x somehow

More information

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures

More information