MCMC and likelihood-free methods

Size: px

Start display at page:

Download "MCMC and likelihood-free methods"

Sarah Houston
5 years ago
Views:

1 MCMC and likelihood-free methods Christian P. Robert Université Paris-Dauphine, IUF, & CREST Université de Besançon, November 22, 2012

2 MCMC and likelihood-free methods Computational issues in Bayesian cosmology Computational issues in Bayesian cosmology Computational issues in Bayesian cosmology The Metropolis-Hastings Algorithm The Gibbs Sampler Approximate Bayesian computation

3 Computational issues in Bayesian cosmology Statistical problems in cosmology Potentially high dimensional parameter space [Not considered here] Immensely slow computation of likelihoods, e.g WMAP, CMB, because of numerically costly spectral transforms [Data is a Fortran program] Nonlinear dependence and degeneracies between parameters introduced by physical constraints or theoretical assumptions

Computational issues in Bayesian cosmology Cosmological data Posterior distribution of cosmological parameters for recent observational data of CMB anisotropies (differences in

4 Computational issues in Bayesian cosmology Cosmological data Posterior distribution of cosmological parameters for recent observational data of CMB anisotropies (differences in temperature from directions) [WMAP], SNIa, and cosmic shear. Combination of three likelihoods, some of which are available as public (Fortran) code, and of a uniform prior on a hypercube.

5 Computational issues in Bayesian cosmology Cosmology parameters Parameters for the cosmology likelihood (C=CMB, S=SNIa, L=lensing) Symbol Description Minimum Maximum Experiment Ω b Baryon density C L Ω m Total matter density C S L w Dark-energy eq. of state C S L n s Primordial spectral index C L 2 R Normalization (large scales) C σ 8 Normalization (small scales) C L h Hubble constant C L τ Optical depth C M Absolute SNIa magnitude S α Colour response S β Stretch response S a L b galaxy z-distribution fit L c L For WMAP5, σ 8 is a deduced quantity that depends on the other parameters

6 Computational issues in Bayesian cosmology Adaptation of importance function [Benabed et al., MNRAS, 2010]

7 Computational issues in Bayesian cosmology Estimates Parameter PMC MCMC Ω b Ω m τ w ± n s R h a b c M ± α β ± 0.16 σ Means and 68% credible intervals using lensing, SNIa and CMB

8 Computational issues in Bayesian cosmology Evidence/Marginal likelihood/integrated Likelihood... Central quantity of interest in (Bayesian) model choice π(x) E = π(x)dx = q(x) q(x)dx. expressed as an expectation under any density q with large enough support.

9 Computational issues in Bayesian cosmology Evidence/Marginal likelihood/integrated Likelihood... Central quantity of interest in (Bayesian) model choice π(x) E = π(x)dx = q(x) q(x)dx. expressed as an expectation under any density q with large enough support. Importance sampling provides a sample x 1,... x N q and approximation of the above integral, where the w n = π(x n) q(x n ) E N n=1 w n are the (unnormalised) importance weights.

10 Computational issues in Bayesian cosmology Back to cosmology questions Standard cosmology successful in explaining recent observations, such as CMB, SNIa, galaxy clustering, cosmic shear, galaxy cluster counts, and Lyα forest clustering. Flat ΛCDM model with only six free parameters (Ω m, Ω b, h, n s, τ, σ 8 )

11 Computational issues in Bayesian cosmology Back to cosmology questions Standard cosmology successful in explaining recent observations, such as CMB, SNIa, galaxy clustering, cosmic shear, galaxy cluster counts, and Lyα forest clustering. Flat ΛCDM model with only six free parameters (Ω m, Ω b, h, n s, τ, σ 8 ) Extensions to ΛCDM may be based on independent evidence (massive neutrinos from oscillation experiments), predicted by compelling hypotheses (primordial gravitational waves from inflation) or reflect ignorance about fundamental physics (dynamical dark energy). Testing for dark energy, curvature, and inflationary models

12 Computational issues in Bayesian cosmology Extended models Focus on the dark energy equation-of-state parameter, modeled as w = 1 w = w 0 w = w 0 + w 1 (1 a) ΛCDM wcdm w(z)cdm In addition, curvature parameter Ω K for each of the above is either Ω K = 0 ( flat ) or Ω K 0 ( curved ). Choice of models represents simplest models beyond a cosmological constant model able to explain the observed, recent accelerated expansion of the Universe.

13 Computational issues in Bayesian cosmology Cosmology priors Prior ranges for dark energy and curvature models. In case of w(a) models, the prior on w 1 depends on w 0 Parameter Description Min. Max. Ω m Total matter density Ω b Baryon density h Hubble parameter Ω K Curvature 1 1 w 0 Constant dark-energy par. 1 1/3 1/3 w w 1 Linear dark-energy par. 1 w a acc

14 Computational issues in Bayesian cosmology Results In most cases evidence in favour of the standard model. especially when more datasets/experiments are combined. Largest evidence is ln B 12 = 1.8, for the w(z)cdm model and CMB alone. Case where a large part of the prior range is still allowed by the data, and a region of comparable size is excluded. Hence weak evidence that both w 0 and w 1 are required, but excluded when adding SNIa and BAO datasets. Results on the curvature are compatible with current findings: non-flat Universe(s) strongly disfavoured for the three dark-energy cases.

15 Computational issues in Bayesian cosmology Evidence

16 Computational issues in Bayesian cosmology Posterior outcome Posterior on dark-energy parameters w 0 and w 1 as 68%- and 95% credible regions for WMAP (solid blue lines), WMAP+SNIa (dashed green) and WMAP+SNIa+BAO (dotted red curves). Allowed prior range as red straight lines.

17 MCMC and likelihood-free methods The Metropolis-Hastings Algorithm The Metropolis-Hastings Algorithm Computational issues in Bayesian cosmology The Metropolis-Hastings Algorithm The Gibbs Sampler Approximate Bayesian computation

18 The Metropolis-Hastings Algorithm Monte Carlo basics General purpose A major computational issue in Bayesian statistics: Given a density π known up to a normalizing constant, and an integrable function h, compute h(x) π(x)µ(dx) Π(h) = h(x)π(x)µ(dx) = π(x)µ(dx) when h(x) π(x)µ(dx) is intractable.

19 The Metropolis-Hastings Algorithm Monte Carlo basics Monte Carlo 101 Generate an iid sample x 1,..., x N from π and estimate Π(h) by ^Π MC N (h) = N 1 LLN: ^Π MC as N (h) Π(h) If Π(h 2 ) = h 2 (x)π(x)µ(dx) <, N h(x i ). i=1 CLT: N ( ^Π MC N (h) Π(h)) L N ( 0, Π { [h Π(h)] 2 }).

20 The Metropolis-Hastings Algorithm Monte Carlo basics Monte Carlo 101 Generate an iid sample x 1,..., x N from π and estimate Π(h) by ^Π MC N (h) = N 1 LLN: ^Π MC as N (h) Π(h) If Π(h 2 ) = h 2 (x)π(x)µ(dx) <, N h(x i ). i=1 CLT: N ( ^Π MC N (h) Π(h)) L N ( 0, Π { [h Π(h)] 2 }). Caveat conducting to MCMC Often impossible or inefficient to simulate directly from Π

21 The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains (MCMC) It is not necessary to use a sample from the distribution f to approximate the integral I = h(x)f(x)dx,

22 The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains (MCMC) It is not necessary to use a sample from the distribution f to approximate the integral I = h(x)f(x)dx, [notation warnin: π turned to f!]

The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains (MCMC) It is not necessary to use a sample from the distribution

23 The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains (MCMC) It is not necessary to use a sample from the distribution f to approximate the integral I = h(x)f(x)dx, We can obtain X 1,..., X n f (approx) without directly simulating from f, using an ergodic Markov chain with stationary distribution f

24 The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains (MCMC) It is not necessary to use a sample from the distribution f to approximate the integral I = h(x)f(x)dx, We can obtain X 1,..., X n f (approx) without directly simulating from f, using an ergodic Markov chain with stationary distribution f Andreï Markov

25 The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains (2) Idea For an arbitrary starting value x (0), an ergodic chain (X (t) ) is generated using a transition kernel with stationary distribution f

26 The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains (2) Idea For an arbitrary starting value x (0), an ergodic chain (X (t) ) is generated using a transition kernel with stationary distribution f irreducible Markov chain with stationary distribution f is ergodic with limiting distribution f under weak conditions hence convergence in distribution of (X (t) ) to a random variable from f. for T 0 large enough T 0, X (T 0) distributed from f Markov sequence is dependent sample X (T 0), X (T 0+1),... generated from f Birkoff s ergodic theorem extends LLN, sufficient for most approximation purposes

27 The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains (2) Idea For an arbitrary starting value x (0), an ergodic chain (X (t) ) is generated using a transition kernel with stationary distribution f Problem: How can one build a Markov chain with a given stationary distribution?

28 The Metropolis-Hastings Algorithm The Metropolis Hastings algorithm The Metropolis Hastings algorithm Arguments: The algorithm uses the objective (target) density and a conditional density f q(y x) called the instrumental (or proposal) distribution Nicholas Metropolis

29 The Metropolis-Hastings Algorithm The Metropolis Hastings algorithm The MH algorithm Algorithm (Metropolis Hastings) Given x (t), 1. Generate Y t q(y x (t) ). 2. Take X (t+1) = { Y t with prob. ρ(x (t), Y t ), x (t) with prob. 1 ρ(x (t), Y t ), where { f(y) ρ(x, y) = min f(x) } q(x y) q(y x), 1.

30 The Metropolis-Hastings Algorithm The Metropolis Hastings algorithm Features Independent of normalizing constants for both f and q( x) (ie, those constants independent of x) Never move to values with f(y) = 0 The chain (x (t) ) t may take the same value several times in a row, even though f is a density wrt Lebesgue measure The sequence (y t ) t is usually not a Markov chain

31 The Metropolis-Hastings Algorithm The Metropolis Hastings algorithm Convergence properties 1. The M-H Markov chain is reversible, with invariant/stationary density f since it satisfies the detailed balance condition f(y) K(y, x) = f(x) K(x, y)

32 The Metropolis-Hastings Algorithm The Metropolis Hastings algorithm Convergence properties 1. The M-H Markov chain is reversible, with invariant/stationary density f since it satisfies the detailed balance condition f(y) K(y, x) = f(x) K(x, y) 2. As f is a probability measure, the chain is positive recurrent

33 The Metropolis-Hastings Algorithm The Metropolis Hastings algorithm Convergence properties 1. The M-H Markov chain is reversible, with invariant/stationary density f since it satisfies the detailed balance condition f(y) K(y, x) = f(x) K(x, y) 2. As f is a probability measure, the chain is positive recurrent 3. If [ f(yt ) q(x (t) ] Y t ) Pr f(x (t) ) q(y t X (t) ) 1 < 1. (1) that is, the event {X (t+1) = X (t) } is possible, then the chain is aperiodic

34 The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms Random walk Metropolis Hastings Use of a local perturbation as proposal Y t = X (t) + ε t, where ε t g, independent of X (t). The instrumental density is of the form g(y x) and the Markov chain is a random walk if we take g to be symmetric g(x) = g( x)

35 The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms Random walk Metropolis Hastings [code] Algorithm (Random walk Metropolis) Given x (t) 1. Generate Y t g(y x (t) ) 2. Take X (t+1) = Y t x (t) { } f(y t ) with prob. min 1, f(x (t), ) otherwise.

36 The Metropolis-Hastings Algorithm Extensions Langevin Algorithms Proposal based on the Langevin diffusion L t is defined by the stochastic differential equation dl t = db t log f(l t)dt, where B t is the standard Brownian motion Theorem The Langevin diffusion is the only non-explosive diffusion which is reversible with respect to f.

37 The Metropolis-Hastings Algorithm Extensions Discretization Instead, consider the sequence x (t+1) = x (t) + σ2 2 log f(x(t) ) + σε t, ε t N p (0, I p ) where σ 2 corresponds to the discretization step

38 The Metropolis-Hastings Algorithm Extensions Discretization Instead, consider the sequence x (t+1) = x (t) + σ2 2 log f(x(t) ) + σε t, ε t N p (0, I p ) where σ 2 corresponds to the discretization step Unfortunately, the discretized chain may be transient, for instance when lim σ 2 log f(x) x 1 > 1 x ±

39 The Metropolis-Hastings Algorithm Extensions MH correction Accept the new value Y t with probability { exp Y t x (t) σ2 f(y t ) f(x (t) ) 2 log / } f(x(t) ) 2 2σ 2 { exp x (t) Y t σ2 2 log f(y / } 1. t) 2 2σ 2 Choice of the scaling factor σ Should lead to an acceptance rate of to achieve optimal convergence rates (when the components of x are uncorrelated) [Roberts & Rosenthal, 1998; Girolami & Calderhead, 2011]

40 The Metropolis-Hastings Algorithm Extensions Optimizing the Acceptance Rate Problem of choosing the transition q kernel from a practical point of view Most common solutions: (a) a fully automated algorithm like ARMS; [Gilks & Wild, 1992] (b) an instrumental density g which approximates f, such that f/g is bounded for uniform ergodicity to apply; (c) a random walk In both cases (b) and (c), the choice of g is critical,

41 The Metropolis-Hastings Algorithm Extensions Case of the random walk Different approach to acceptance rates A high acceptance rate does not indicate that the algorithm is moving correctly since it indicates that the random walk is moving too slowly on the surface of f.

42 The Metropolis-Hastings Algorithm Extensions Case of the random walk Different approach to acceptance rates A high acceptance rate does not indicate that the algorithm is moving correctly since it indicates that the random walk is moving too slowly on the surface of f. If x (t) and y t are close, i.e. f(x (t) ) f(y t ) y is accepted with probability ( ) f(yt ) min f(x (t) ), 1 1. For multimodal densities with well separated modes, the negative effect of limited moves on the surface of f clearly shows.

43 The Metropolis-Hastings Algorithm Extensions Case of the random walk (2) If the average acceptance rate is low, the successive values of f(y t ) tend to be small compared with f(x (t) ), which means that the random walk moves quickly on the surface of f since it often reaches the borders of the support of f

44 The Metropolis-Hastings Algorithm Extensions Rule of thumb In small dimensions, aim at an average acceptance rate of 50%. In large dimensions, at an average acceptance rate of 25%. [Gelman,Gilks and Roberts, 1995]

45 The Metropolis-Hastings Algorithm Extensions Rule of thumb In small dimensions, aim at an average acceptance rate of 50%. In large dimensions, at an average acceptance rate of 25%. [Gelman,Gilks and Roberts, 1995] warnin: rule to be taken with a pinch of salt!

46 The Metropolis-Hastings Algorithm Extensions Role of scale Example (Noisy AR(1)) Hidden Markov chain from a regular AR(1) model, x t+1 = ϕx t + ɛ t+1 ɛ t N(0, τ 2 ) and observables y t x t N(x 2 t, σ 2 )

47 The Metropolis-Hastings Algorithm Extensions Role of scale Example (Noisy AR(1)) Hidden Markov chain from a regular AR(1) model, x t+1 = ϕx t + ɛ t+1 ɛ t N(0, τ 2 ) and observables y t x t N(x 2 t, σ 2 ) The distribution of x t given x t 1, x t+1 and y t is exp 1 } {(x 2τ 2 t ϕx t 1 ) 2 + (x t+1 ϕx t ) 2 + τ2 σ 2 (y t x 2 t) 2.

48 The Metropolis-Hastings Algorithm Extensions Role of scale Example (Noisy AR(1) continued) For a Gaussian random walk with scale ω small enough, the random walk never jumps to the other mode. But if the scale ω is sufficiently large, the Markov chain explores both modes and give a satisfactory approximation of the target distribution.

49 The Metropolis-Hastings Algorithm Extensions Role of scale Markov chain based on a random walk with scale ω =.1.

50 The Metropolis-Hastings Algorithm Extensions Role of scale Markov chain based on a random walk with scale ω =.5.

51 MCMC and likelihood-free methods The Gibbs Sampler The Gibbs Sampler Computational issues in Bayesian cosmology The Metropolis-Hastings Algorithm The Gibbs Sampler Approximate Bayesian computation

52 The Gibbs Sampler General Principles General Principles A very specific simulation algorithm based on the target distribution f: 1. Uses the conditional densities f 1,..., f p from f

53 The Gibbs Sampler General Principles General Principles A very specific simulation algorithm based on the target distribution f: 1. Uses the conditional densities f 1,..., f p from f 2. Start with the random variable X = (X 1,..., X p )

54 The Gibbs Sampler General Principles General Principles A very specific simulation algorithm based on the target distribution f: 1. Uses the conditional densities f 1,..., f p from f 2. Start with the random variable X = (X 1,..., X p ) 3. Simulate from the conditional densities, for i = 1, 2,..., p. X i x 1, x 2,..., x i 1, x i+1,..., x p f i (x i x 1, x 2,..., x i 1, x i+1,..., x p )

55 The Gibbs Sampler General Principles Gibbs code Algorithm (Gibbs sampler) Given x (t) = (x (t) 1,..., x(t) p ), generate 1. X (t+1) 1 f 1 (x 1 x (t) 2,..., x(t) p ); 2. X (t+1) 2 f 2 (x 2 x (t+1) 1, x (t) 3,..., x(t) p ),... p. X (t+1) p f p (x p x (t+1) 1,..., x (t+1) p 1 ) X (t+1) X f

56 The Gibbs Sampler General Principles Properties The full conditionals densities f 1,..., f p are the only densities used for simulation. Thus, even in a high dimensional problem, all of the simulations may be univariate

57 The Gibbs Sampler General Principles toy example: iid N(µ, σ 2 ) variates When Y 1,..., Y n iid N(y µ, σ 2 ) with both µ and σ unknown, the posterior in (µ, σ 2 ) is conjugate outside a standard familly

58 The Gibbs Sampler General Principles toy example: iid N(µ, σ 2 ) variates When Y 1,..., Y n iid N(y µ, σ 2 ) with both µ and σ unknown, the posterior in (µ, σ 2 ) is conjugate outside a standard familly But... ( µ Y 0:n, σ 2 N µ 1 n n i=1 Y i, σ2 σ 2 Y 1:n, µ IG ( σ 2 n 2 1, 1 2 n ) n i=1 (Y i µ) 2 ) assuming constant (improper) priors on both µ and σ 2 Hence we may use the Gibbs sampler for simulating from the posterior of (µ, σ 2 )

59 The Gibbs Sampler General Principles toy example: R code Gibbs Sampler for Gaussian posterior n = length(y); S = sum(y); mu = S/n; for (i in 1:500) S2 = sum((y-mu)^2); sigma2 = 1/rgamma(1,n/2-1,S2/2); mu = S/n + sqrt(sigma2/n)*rnorm(1);

60 The Gibbs Sampler General Principles Example of results with n = 10 observations from the N(0, 1) distribution Number of Iterations 1

61 The Gibbs Sampler General Principles Example of results with n = 10 observations from the N(0, 1) distribution Number of Iterations 1, 2

62 The Gibbs Sampler General Principles Example of results with n = 10 observations from the N(0, 1) distribution Number of Iterations 1, 2, 3

63 The Gibbs Sampler General Principles Example of results with n = 10 observations from the N(0, 1) distribution Number of Iterations 1, 2, 3, 4

64 The Gibbs Sampler General Principles Example of results with n = 10 observations from the N(0, 1) distribution Number of Iterations 1, 2, 3, 4, 5

65 The Gibbs Sampler General Principles Example of results with n = 10 observations from the N(0, 1) distribution Number of Iterations 1, 2, 3, 4, 5, 10

66 The Gibbs Sampler General Principles Example of results with n = 10 observations from the N(0, 1) distribution Number of Iterations 1, 2, 3, 4, 5, 10, 25

67 The Gibbs Sampler General Principles Example of results with n = 10 observations from the N(0, 1) distribution Number of Iterations 1, 2, 3, 4, 5, 10, 25, 50

68 The Gibbs Sampler General Principles Example of results with n = 10 observations from the N(0, 1) distribution Number of Iterations 1, 2, 3, 4, 5, 10, 25, 50, 100

69 The Gibbs Sampler General Principles Example of results with n = 10 observations from the N(0, 1) distribution Number of Iterations 1, 2, 3, 4, 5, 10, 25, 50, 100, 500

70 The Gibbs Sampler General Principles Limitations of the Gibbs sampler Formally, a special case of a sequence of 1-D M-H kernels, all with acceptance rate uniformly equal to 1. The Gibbs sampler 1. limits the choice of instrumental distributions

71 The Gibbs Sampler General Principles Limitations of the Gibbs sampler Formally, a special case of a sequence of 1-D M-H kernels, all with acceptance rate uniformly equal to 1. The Gibbs sampler 1. limits the choice of instrumental distributions 2. requires some knowledge of f

72 The Gibbs Sampler General Principles Limitations of the Gibbs sampler Formally, a special case of a sequence of 1-D M-H kernels, all with acceptance rate uniformly equal to 1. The Gibbs sampler 1. limits the choice of instrumental distributions 2. requires some knowledge of f 3. is, by construction, multidimensional

73 The Gibbs Sampler General Principles Limitations of the Gibbs sampler Formally, a special case of a sequence of 1-D M-H kernels, all with acceptance rate uniformly equal to 1. The Gibbs sampler 1. limits the choice of instrumental distributions 2. requires some knowledge of f 3. is, by construction, multidimensional 4. does not apply to problems where the number of parameters varies as the resulting chain is not irreducible.

74 The Gibbs Sampler General Principles A wee problem µ µ 1 Gibbs started at random

75 The Gibbs Sampler General Principles A wee problem Gibbs stuck at the wrong mode µ µ µ 1 Gibbs started at random µ 1

76 The Gibbs Sampler General Principles Slice sampler as generic Gibbs If f(θ) can be written as a product k f i (θ), i=1

77 The Gibbs Sampler General Principles Slice sampler as generic Gibbs If f(θ) can be written as a product k f i (θ), it can be completed as k i=1 i=1 I 0 ωi f i (θ), leading to the following Gibbs algorithm:

78 The Gibbs Sampler General Principles Slice sampler (code) Algorithm (Slice sampler) Simulate 1. ω (t+1) 1 U [0,f1 (θ (t) )] ;... k. ω (t+1) k U [0,fk (θ (t) )] ; k+1. θ (t+1) U A (t+1), with A (t+1) = {y; f i (y) ω (t+1) i, i = 1,..., k}.

79 y MCMC and likelihood-free methods The Gibbs Sampler General Principles Example of results with a truncated N( 3, 1) distribution x Number of Iterations 2

80 y MCMC and likelihood-free methods The Gibbs Sampler General Principles Example of results with a truncated N( 3, 1) distribution x Number of Iterations 2, 3

81 y MCMC and likelihood-free methods The Gibbs Sampler General Principles Example of results with a truncated N( 3, 1) distribution x Number of Iterations 2, 3, 4

82 y MCMC and likelihood-free methods The Gibbs Sampler General Principles Example of results with a truncated N( 3, 1) distribution x Number of Iterations 2, 3, 4, 5

83 y MCMC and likelihood-free methods The Gibbs Sampler General Principles Example of results with a truncated N( 3, 1) distribution x Number of Iterations 2, 3, 4, 5, 10

84 y MCMC and likelihood-free methods The Gibbs Sampler General Principles Example of results with a truncated N( 3, 1) distribution x Number of Iterations 2, 3, 4, 5, 10, 50

85 y MCMC and likelihood-free methods The Gibbs Sampler General Principles Example of results with a truncated N( 3, 1) distribution x Number of Iterations 2, 3, 4, 5, 10, 50, 100

86 MCMC and likelihood-free methods Approximate Bayesian computation Approximate Bayesian computation Computational issues in Bayesian cosmology The Metropolis-Hastings Algorithm The Gibbs Sampler Approximate Bayesian computation

87 Approximate Bayesian computation ABC basics Regular Bayesian computation issues Recap : When faced with a non-standard posterior distribution π(θ y) π(θ)l(θ y) the standard solution is to use simulation (Monte Carlo) to produce a sample θ 1,..., θ T from π(θ y) (or approximately by Markov chain Monte Carlo methods) [Robert & Casella, 2004]

88 Approximate Bayesian computation ABC basics Untractable likelihoods Cases when the likelihood function f(y θ) is unavailable (in analytic and numerical senses) and when the completion step f(y θ) = f(y, z θ) dz is impossible or too costly because of the dimension of z c MCMC cannot be implemented! Z

genes via a phylogenetic tree that is close to impossible to integrate out

89 Approximate Bayesian computation ABC basics Illustration Phylogenetic tree: in population genetics, reconstitution of a common ancestor from a sample of genes via a phylogenetic tree that is close to impossible to integrate out [100 processor days with 4 parameters] [Cornuet et al., 2009, Bioinformatics]

90 Approximate Bayesian computation ABC basics Illustration!""#$%&'()*+,(-*.&(/+0$'"1)()&$/+2!,03! 1/+*%*'"4*+56(""4&7()&$/.+.1#+4*.+8-9':*.+ demo-genetic Différents inference scénarios possibles, choix de scenario par ABC Genetic model of evolution from a common ancestor (MRCA) characterized by a set of parameters that cover historical, demographic, and genetic factors Dataset of polymorphism (DNA sample) observed at the present time Le scenario 1a est largement soutenu par rapport aux autres! plaide pour une origine commune des populations pygmées d Afrique de l Ouest Verdu et al

Approximate Bayesian computation ABC basics Illustration Pygmies population demo-genetics Pygmies populations: do they have a common origin?

91 Approximate Bayesian computation ABC basics Illustration Pygmies population demo-genetics Pygmies populations: do they have a common origin? when and how did they split from non-pygmies populations? were there more recent interactions between pygmies and non-pygmies populations?!""#$%&'()*+,(-*.&(/+0$'"1)()&$/+2!,03! 1/+*%*'"4*+56(""4&7()&$/.+.1#+4*.+8-9':*.+ 94

92 Approximate Bayesian computation ABC basics The ABC method Bayesian setting: target is π(θ)f(x θ)

93 Approximate Bayesian computation ABC basics The ABC method Bayesian setting: target is π(θ)f(x θ) When likelihood f(x θ) not in closed form, likelihood-free rejection technique:

94 Approximate Bayesian computation ABC basics The ABC method Bayesian setting: target is π(θ)f(x θ) When likelihood f(x θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y f(y θ), under the prior π(θ), keep jointly simulating θ π(θ), z f(z θ ), until the auxiliary variable z is equal to the observed value, z = y. [Tavaré et al., 1997]

95 Approximate Bayesian computation ABC basics Why does it work?! The proof is trivial: f(θ i ) π(θ i )f(z θ i )I y (z) z D π(θ i )f(y θ i ) = π(θ i y). [Accept Reject 101]

96 Approximate Bayesian computation ABC basics A as approximative When y is a continuous random variable, equality z = y is replaced with a tolerance condition, where ρ is a distance ρ(y, z) ɛ

97 Approximate Bayesian computation ABC basics A as approximative When y is a continuous random variable, equality z = y is replaced with a tolerance condition, where ρ is a distance Output distributed from ρ(y, z) ɛ π(θ) P θ {ρ(y, z) < ɛ} π(θ ρ(y, z) < ɛ)

98 Approximate Bayesian computation ABC basics ABC algorithm Algorithm 1 Likelihood-free rejection sampler 2 for i = 1 to N do repeat generate θ from the prior distribution π( ) generate z from the likelihood f( θ ) until ρ{η(z), η(y)} ɛ set θ i = θ end for where η(y) defines a (not necessarily sufficient) statistic

99 Approximate Bayesian computation ABC basics Output The likelihood-free algorithm samples from the marginal in z of: π ɛ (θ, z y) = π(θ)f(z θ)i Aɛ,y (z) A ɛ,y Θ π(θ)f(z θ)dzdθ, where A ɛ,y = {z D ρ(η(z), η(y)) < ɛ}.

100 Approximate Bayesian computation ABC basics Output The likelihood-free algorithm samples from the marginal in z of: π ɛ (θ, z y) = π(θ)f(z θ)i Aɛ,y (z) A ɛ,y Θ π(θ)f(z θ)dzdθ, where A ɛ,y = {z D ρ(η(z), η(y)) < ɛ}. The idea behind ABC is that the summary statistics coupled with a small tolerance should provide a good approximation of the posterior distribution: π ɛ (θ y) = π ɛ (θ, z y)dz π(θ η(y)).

101 . MCMC and likelihood-free methods Approximate Bayesian computation ABC basics Pima Indian benchmark Density Density Density Figure: Comparison between density estimates of the marginals on β 1 (left), β 2 (center) and β 3 (right) from ABC rejection samples (red) and MCMC samples (black)

102 Approximate Bayesian computation ABC basics ABC advances Simulating from the prior is often poor in efficiency

103 Approximate Bayesian computation ABC basics ABC advances Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x s within the vicinity of y... [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]

104 Approximate Bayesian computation ABC basics ABC advances Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x s within the vicinity of y... [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger ɛ [Beaumont et al., 2002]

105 Approximate Bayesian computation ABC basics ABC advances Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x s within the vicinity of y... [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger ɛ [Beaumont et al., 2002]...or even by including ɛ in the inferential framework [ABC µ ] [Ratmann et al., 2009]

106 Approximate Bayesian computation ABC basics ABC-MCMC Markov chain (θ (t) ) created via the transition function θ K ω (θ θ (t) ) if x f(x θ ) is such that x = y θ (t+1) = and u U(0, 1) π(θ )K ω (θ (t) θ ) θ (t) otherwise, π(θ (t) )K ω (θ θ (t) ),

107 Approximate Bayesian computation ABC basics ABC-MCMC Markov chain (θ (t) ) created via the transition function θ K ω (θ θ (t) ) if x f(x θ ) is such that x = y θ (t+1) = and u U(0, 1) π(θ )K ω (θ (t) θ ) θ (t) otherwise, π(θ (t) )K ω (θ θ (t) ), has the posterior π(θ y) as stationary distribution [Marjoram et al, 2003]

108 Approximate Bayesian computation ABC basics ABC-MCMC (2) Algorithm 2 Likelihood-free MCMC sampler Use Algorithm 1 to get (θ (0), z (0) ) for t = 1 to N do Generate θ from K ω ( θ (t 1) ), Generate z from the likelihood f( θ ), Generate u from U [0,1], if u π(θ )K ω (θ (t 1) θ ) π(θ (t 1) K ω (θ θ (t 1) ) I A ɛ,y (z ) then set (θ (t), z (t) ) = (θ, z ) else (θ (t), z (t) )) = (θ (t 1), z (t 1) ), end if end for

109 Approximate Bayesian computation ABC basics Sequential Monte Carlo SMC is a simulation technique to approximate a sequence of related probability distributions π n with π 0 easy and π T as target. Iterated IS as PMC : particles moved from time n to time n via kernel K n and use of a sequence of extended targets π n π n (z 0:n ) = π n (z n ) n L j (z j+1, z j ) where the L j s are backward Markov kernels [check that π n (z n ) is a marginal] [Del Moral, Doucet & Jasra, Series B, 2006] j=0

110 Approximate Bayesian computation ABC basics Sequential Monte Carlo (2) Algorithm 3 SMC sampler [Del Moral, Doucet & Jasra, Series B, 2006] sample z (0) i γ 0 (x) (i = 1,..., N) compute weights w (0) i = π 0 (z (0) i ))/γ 0 (z (0) i ) for t = 1 to N do if ESS(w (t 1) ) < N T then resample N particles z (t 1) and set weights to 1 end if generate z (t 1) i K t (z (t 1) i, ) and set weights to end for w (t) i = W (t 1) i 1 π t (z (t) i π t 1 (z (t 1) i ))L t 1 (z (t) i ), z (t 1) i )) ))K t (z (t 1) i ), z (t) i ))

111 Approximate Bayesian computation ABC basics ABC-SMC [Del Moral, Doucet & Jasra, 2009] True derivation of an SMC-ABC algorithm Use of a kernel K n associated with target π ɛn and derivation of the backward kernel Update of the weights L n 1 (z, z ) = π ɛ n (z )K n (z, z) π n (z) w in w i(n 1) M m=1 I A ɛn (x m in ) M m=1 I A ɛn 1 (xm i(n 1) ) when x m in K(x i(n 1), )

Adaptive Monte Carlo methods

Adaptive Monte Carlo methods Jean-Michel Marin Projet Select, INRIA Futurs, Université Paris-Sud joint with Randal Douc (École Polytechnique), Arnaud Guillin (Université de Marseille) and Christian Robert