Approximate Bayesian computation for the parameters of PRISM programs

Size: px

Start display at page:

Download "Approximate Bayesian computation for the parameters of PRISM programs"

Oswin Nelson
5 years ago
Views:

1 Approximate Bayesian computation for the parameters of PRISM programs James Cussens Department of Computer Science & York Centre for Complex Systems Analysis University of York Heslington, York, YO10 5DD, UK. Abstract. Probabilistic logic programming formalisms permit the definition of potentially very complex probability distributions. This complexity can often make learning hard, even when structure is fixed and learning reduces to parameter estimation. In this paper an approximate Bayesian computation (ABC) method is presented which computes approximations to the posterior distribution over PRISM parameters. The key to ABC approaches is that the likelihood function need not be computed, instead a distance between the observed data and synthetic data generated by candidate parameter values is used to drive the learning. This makes ABC highly appropriate for PRISM programs which can have an intractable likelihood function, but from which synthetic data can be readily generated. The algorithm is experimentally shown to work well on an easy problem but further work is required to produce acceptable results on harder ones. 1 Introduction In the Bayesian approach to parameter estimation a prior distribution for the parameters is combined with observed data to produce a posterior distribution. A key feature of the Bayesian approach is that the posterior provides a full picture of the information contained in prior and data: with an uninformative prior and little data we are not in a position to make confident estimates of the parameters and this will be reflected in a flat posterior. In contrast when much data is available the posterior will concentrate probability mass in small regions of the parameter space reflecting greater confidence in parameter estimates. Despite its attractive features the Bayesian approach is problematic because in many cases computing or even representing the posterior distribution is very difficult. One area in which this is often the case is statistical relational learning (SRL). SRL formalisms combine probabilistic models with rich representation languages (often logical) which allows highly complex probabilistic models to be described. In particular, the likelihood function (the probability of observed data as a function of the model s parameters) is often intractable. This makes Bayesian and non-bayesian parameter estimation difficult since the likelihood function plays a key role in both.

2 In this paper an approximate Bayesian computation (ABC) method is presented which approximates the posterior distribution over the parameters for a PRISM program with a given structure. The key feature of ABC approaches is that the likelihood function is never calculated. Instead synthetic datasets are generated and compared with the actually observed data. If a candidate parameter set generates synthetic datasets which are mostly close to the real data then it will tend to end up with high posterior probability. The rest of the paper is set out as follows. In Section 2 an account of approximate Bayesian computation is given. In Section 3 the essentials of PRISM programs are explained. Section 4 is the core of the paper where it is shown how to apply ABC to PRISM. Section 5 reports on initial experimental results and the paper concludes with Section 6 which includes pointers to future work. 2 Approximate Bayesian computation The ABC method applied in this paper is the ABC sequential Monte Carlo (ABC SMC) algorithm devised by Toni et al [1] and so the basic ideas of ABC will be explained using the notation of that paper. ABC approaches are motivated when the likelihood function is intractable but it is straightforward to sample synthetic data using any given candidate parameter set θ. The simplest ABC algorithm is a rejection sampling approach described by Marjoram et al [2]. Since this is a Bayesian approach there must be a user-defined prior distribution π(θ) over the model parameters. Let x 0 be the observed data. If it is possible to readily sample from π(θ) then it is possible to sample from the posterior distribution π(θ x 0 ) as follows: (1) sample θ from π, (2) sample synthetic data x from the model with its parameters set to θ, (i.e. sample from f(x θ ) where f is the likelihood function), (3) if x 0 = x accept θ. The problem, of course, with this algorithm is that in most real situations the probability of sampling synthetic data which is exactly equal to the observed data will be tiny. A somewhat more realistic option is to define a distance function d(x 0, x ) which measure how close synthetic data x is to the real data x 0. x is now accepted at stage (3) above when d(x 0, x ) ǫ for some user-defined ǫ. With this adaptation the rejection sampling approach will produce samples from π(θ d(x 0, x ) ǫ). As long as ǫ is reasonably small, this will be a good approximation to π(θ x 0 ). Choosing a value for ǫ is crucial: too big and the approximation to the posterior will be poor, too small and very few synthetic datasets will be accepted. A way out of this conundrum is to choose not one value for ǫ, but a sequence of decreasing values: ǫ 1,... ǫ T (ǫ 1 > > ǫ T ). This is the key idea behind the ABC sequential Monte Carlo (ABC SMC) algorithm: In ABC SMC, a number of sampled parameter values (called particles) {θ (1)... θ (N) }, sampled from the prior distribution π(θ), is propagated through a sequence of intermediate distributions π(θ d(x 0, x ) ǫ t ), t = 1,...T 1, until it represents a sample from the target distribution π(θ d(x 0, x ) ǫ T ). [1]

3 An important problem is how to move from sampling from π(θ d(x 0, x ) ǫ t ) to sampling from π(θ d(x 0, x ) ǫ t+1 ). In ABC SMC this is addressed via importance sampling. Samples are, in fact, not sampled from π(θ d(x 0, x ) ǫ t ) but from a different sequence of distributions η t (θ). Each such sample θ t is then weighted as follows w t (θ t ) = π(θt d(x0,x ) ǫ t) η t(θ t). η 1, the first distribution sampled from, is chosen to be the prior π. Subsequent distributions η t are generated via a user-defined perturbation kernels K t (θ t 1, θ t ) which perform moves around the parameter space. These are the basic ideas of the ABC SMC algorithm; full details are supplied by Toni et al [1] (particularly Appendix A). As a convenience the description of the ABC SMC algorithm supplied in that paper is reproduced (almost verbatim) in Fig 1. S1 Initialise ǫ 1,... ǫ T. Set the population indicator t = 0. S2.0 Set the particle indicator i = 1. S2.1 If t = 0, sample θ independently from π(θ). If t > 0, sample θ from the previous population {θ (i) t 1 } with weights wt 1 and perturb the particle to obtain θ K t(θ θ ), where K t is a perturbation kernel. If π(θ ) = 0, return to S2.1 Simulate a candidate dataset x (b) f(x θ ) B t times (b = 1,..., B t) and calculate b t(θ ) = P B t b=1 1(d(x0, x (b)) ǫ t). If b t(θ ) = 0, return to S2.1. S2.2 Set θ (i) t = θ and calculate the weight for particle θ (i) w (i) t = t, 8 < b t(θ (i) t ), if t = 0 : If i < N set i = i + 1, go to S2.1 S.3 Normalize the weights. If t < T, set t = t + 1, go to S2.0 π(θ (i) t )b t(θ (i) t ) P Nj=1 w (j) t 1 Kt(θ(j) t 1,θ(j) t ) if t > 0 Fig.1. ABC SMC algorithm reproduced from Toni et al [1]. Note, from Fig 1, that rather than generate a single dataset from f(x θ ), B t datasets are sampled where B t is set by the user. The quantity b t (θ ) is the count of synthetic datasets which are within ǫ t. The intuitive idea behind ABC SMC is that a particle θ that generates many synthetic datasets close to x 0, will get high weight and is thus more likely to be sampled for use at the next iteration.

4 3 PRISM PRISM (PRogramming In Statistical Modelling) [3] is a well-known SRL formalism which defines probability distributions over possible worlds (Herbrand models). The probabilistic element of a PRISM program is supplied using switches. A switch is syntactically defined using declarations such as those given in Fig 2 for switches init, tr(s0), tr(s1), out(s0) and out(s1). The declaration in Fig 2 for init, for example, defines an infinite collection of independent and identically distributed binary random variablesinit 1,init 2,... with values s0 and s1 and with distribution P(init i = s0) = 0.9, P(init i = s1) = 0.1 for all i. The rest of a PRISM program is essentially a Prolog program. Switches provide the probabilistic element via the built-in predicate msw/2. Each time a goal such as :- msw(init,s) is called the variable S is instantiated to s0 with probability 0.9 and s1 with probability 0.1. If this were the ith call to this goal then this amounts to sampling from the variable init i. Although later versions of PRISM do not require this, it is convenient to specify a target predicate where queries to the target predicate will lead to calls to msw/2 goals (usually via intermediate predicates). The target predicate is thus a probabilistic predicate: the PRISM program defines a distribution over instantiations of the variables in target predicate goals. For example, in the PRISM program in Fig 2 which implements a hidden Markov model,hmm/1 would be the target predicate. A query such as :- hmm(x). will lead to instantiations such as X = [a,a,b,a,a]. values(init,[s0,s1]). values(out(_),[a,b]). values(tr(_),[s0,s1]). % state initialization % symbol emission % state transition hmm(l):- % To observe a string L: msw(init,s), % Choose an initial state randomly hmm(1,5,s,l). % Start stochastic transition (loop) hmm(t,n,_,[]):- T>N,!. % Stop the loop hmm(t,n,s,[ob Y]) :- % Loop: current state is S, current time is T msw(out(s),ob), % Output Ob at the state S msw(tr(s),next), % Transit from S to Next. T1 is T+1, % Count up time hmm(t1,n,next,y). % Go next (recursion) :- set_sw(init, [0.9,0.1]), set_sw(tr(s0), [0.2,0.8]), set_sw(tr(s1), [0.8,0.2]), set_sw(out(s0),[0.5,0.5]), set_sw(out(s1),[0.6,0.4]). Fig. 2. PRISM encoding of a simple 2-state hidden Markov model (this example is distributed with the PRISM system).

5 The switch probabilities are the parameters of a PRISM program and the data used for parameter estimation in PRISM will be a collection of ground instances of the target predicates which are imagined to have been sampled from the unknown true PRISM program with the true parameters. PRISM contains a built-in EM algorithm for maximum likelihood parameter and maximum a posteriori (MAP) estimation [4]. In both cases a point estimate for each parameter is provided. In contrast here an approximate sample from the posterior distribution over parameters is provided by a population of particles. 4 ABC for PRISM To apply the ABC SMC algorithm it is necessary to choose: (1) a prior distribution for the parameters, (2) a distance function, (3) a perturbation kernel and (4) also the specific experimental parameters such as the sequence of ǫ t, etc. The first of these three are dealt with in the following three sections ( ). The choice of experimental parameters is addressed in Section Choice of prior distribution The obvious prior distribution is chosen. Each switch has a user-defined Dirichlet prior distribution and the full joint prior distribution is just a product of these. This is the same as the prior used for MAP estimation in PRISM [4, 4.7.2]. To sample from this prior it is enough to sample from each Dirichlet independently. Sampling from each Dirichlet is achieved by exploiting the relationship between Dirichlet and Gamma distributions. To produce a sample (p 1,..., p k ) from a Dirichlet with parameters (α 1,... α k ), values z i are sampled from Gamma(α i, 1) and then p i is set to z i /(z z k ). The z i are sampled using the algorithm of Cheng and Feast [5] for α i > 1 and the algorithm of Ahrens and Dieter [6] for α i 1. Both these algorithms are given in [7]. For Dirichlet distributions containing small values of α i, numerical problems sometimes produced samples where p i = 0 for some i which is wrong since the Dirichlet has density 0 for any probability distribution containing a zero value. This problem was solved by the simple expedient of not choosing small values for the α i! 4.2 Choice of distance function The basic ABC approach leads to a sample drawn from π(θ d(x 0, x ) ǫ) rather than π(θ x 0 ). For this to be a good approximation it is enough that f(x θ) f(x 0 θ) for all x where d(x 0, x ) ǫ. With this in mind d is defined as follows. Let P(x 0 ) be the empirical distribution defined by the real data x 0. P(x 0 ) assigns a probability to every possible ground instance of the target predicate. This probability is just the frequency of the ground instance in the data divided by the total number of datapoints. P(x ) is the corresponding empirical distribution for fake data x. Both P(x 0 ) and P(x ) can be viewed as (possibly countably

6 infinite) vectors of real numbers. The distance between x and x is then defined to be the squared Euclidean distance between P(x 0 ) and P(x ). Formally: d(x, x ) = i I(P(x 0 )(i) P(x )(i)) 2 (1) where I is just some (possibly countably infinite) index set for the set of all ground instances of the target predicate. In practice most terms in the sum on the RHS of (1) will be zero, since typically most ground instances appear neither in the real data nor in fake datasets. 4.3 Choice of perturbation kernel Recall that each particle θ defines a multinomial distribution of the appropriate dimension for each switch of the PRISM program. The perturbation kernel K t (θ θ ) has two stages. Firstly, Dirichlet distributions are derived from θ by multiplying each probability in θ by a global value α t where α t > 0. Secondly, a new particle is sampled from this product of Dirichlets using exactly the same procedure as was used for sampling from the original prior distribution. Large values of α t will make small moves in the parameter space likely (since the Dirichlet distributions will be concentrated around θ ) and small values of α t will encourage larger moves. An attractive option is to start with small values of α t to encourage exploration of parameter space and to progressively increase α t in the hope of convergence to a stable set of particles giving a good approximation to the posterior. 5 Experimental results The ABC SMC algorithm has been implemented as a PRISM program which is supplied in the supplementary materials. PRISM 2.0 beta 4, kindly supplied by the PRISM developers, was used. As an initial test, ABC was done for the simplest possible parameter estimation problem. A PRISM program representing a biassed coin (P(heads) = 0.7, P(tails) = 0.3) was written and data of 100 simulated tosses were produced. This resulted in 67 heads and 33 tails. ABC was run several times with the following (more or less arbitrarily chosen) parameters: prior distribution π(θ) = Dir(1, 1), sequence of thresholds ǫ = (0), number of synthetic datasets B t = 50, perturbation kernel parameter α t = 2, number of particles T = 50 and population size N = 50. As expected the final population of (weighted) particles were always concentrated around the maximum likelihood estimate P(heads) = 0.67, P(tails) = Here are the 4 most heavily weighted particles with their weights from one particular run: (0.646, ), (w = 0.051), (0.647, 0.353), (w = 0.044), (0.667, 0.332), (w = 0.044), (0.62, 0.38), (w = 0.037). Estimates of the posterior mean were similar for different ABC runs: here are such estimates from 5 runs: (0.663, 0.337),

7 (0.655, 0.345), (0.664, 0.336), (0.667, 0.333), (0.677, 0.323). Note that, in this trivial problem, successful parameter estimation was possible by going directly for a zero distance threshold ǫ = (0). For a more substantial test, 100 ground hmm/1 atoms were sampled from the PRISM encoded HMM show in Fig 2. Dir(1, 1) priors were used for all 5 switches. The experimental parameters were α t = 10, B t = 100, N = 200 and ǫ = (0.1, 0.05). Ideally, different runs of ABC should generate similar approximations to posterior quantities. To look into this, marginal posterior distributions for the probability of the first value of each of the 5 switches were estimated using 3 different ABC runs. The results are shown in Fig 3. These plots were produced using the density function in R with the final weighted population of particles as input. There is evident variation between the results of the 3 runs, but similarities also. All 3 densities for init contain two local modes, all 3 for out(s1) put most mass in the middle, all 3 fortr(s0) have a fairly even spread apart from extreme values. init out(s0) N = 200 Bandwidth = N = 200 Bandwidth = out(s1) tr(s0) N = 200 Bandwidth = N = 200 Bandwidth = tr(s1) N = 200 Bandwidth = Fig. 3. Posterior distributions for HMM switch probabilities init, out(s0), out(s1), tr(s0), tr(s1) as estimated by three different runs of ABC.

8 6 Conclusions and future work This paper has described ABC SMC for PRISM programs and has shown some initial results for a working implementation. Evidently, considerably more experimentation and theoretical analysis is required to provide reliable approximations to posterior quantities using ABC SMC. In the experiments reported above the perturbation kernel K t did not vary with t. It is likely that better results are possible by reducing the probability of big perturbations as t increases. In addition the choice for the sequence ǫ t thresholds was fairly arbitrary. Finally, it may be that better results are achievable by throwing more computational resources at the problem: most obviously increasing the number of particles, but also by lengthening the sequence of ǫ t thresholds to effect a smoother convergence to the posterior. Another avenue for improvement is the choice of distance function. The function introduced in Section 4.2 is a generic function that is applicable to any PRISM program. It seems likely that domain knowledge could be used to choose domain-specific distance functions which reflect the real difference between different ground atoms. The function used here treats all distinct pairs of ground atoms as equally different which will not be appropriate in many cases. References 1. Toni, T., Welch, D., Strelkowa, N., Ipsen, A., Stumpf, M.P.: Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems. Journal of the Royal Society Interface 6(31) (2009) Marjoram, P., Molitor, J., Plagnol, V., Tavaré, S.: Markov chain Monte Carlo without likelihoods. Proceedings of the National Academy of Science 100 (December 2003) Sato, T., Kameya, Y.: Parameter learning of logic programs for symbolic-statistical modeling. Journal of Artificial Intelligence Research 15 (2001) Sato, T., Zhou, N.F., Kameya, Y., Izumi, Y.: PRISM User s Manual (Version ). (2009) 5. Cheng, B., Feast, G.: Some simple gamma variate generators. Applied Statistics 28 (1979) Ahrens, J., Dieter, U.: Computer methods for sampling from gamma, beta, Poisson and binomial distributions. Computing 12 (1974) Robert, C.P., Casella, R.: Monte Carlo Statistical Methods. Second edn. Springer, New York (2004)

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project