Sampling from complex probability distributions

Size: px

Start display at page:

Download "Sampling from complex probability distributions"

Rafe Fisher
5 years ago
Views:

1 Sampling from complex probability distributions Louis J. M. Aslett Department of Mathematical Sciences Durham University UTOPIAE Training School II 4 July /37

2 Motivation 2/37

3 Sampling from probability distributions why? Monte Carlo essentially avoids the quandry of choosing an accurate but intractable model versus a simple but computable one. 3/37

4 Sampling from probability distributions why? Monte Carlo essentially avoids the quandry of choosing an accurate but intractable model versus a simple but computable one. May want to answer: Probabilistic questions simulate physical random processes concerned with some corresponding random outcome may be inherent or perceived randomness eg simulation of shuttle launch Deterministic questions most often, boils down to computation of high dimensional integrals use experimental methods to answer theoretical question 3/37

5 Bayesian inference (recall Georgios Karagiannis talk) Data: t = {t 1,..., t n } Model: t is the realisation of a random vector T having probability density π T Ψ ( ψ), where ψ is an unknown parameter. π T Ψ (t ) is the likelihood. Prior: all knowledge about ψ which is not contained in t is expressed via prior density π Ψ (ψ). 4/37

6 Bayesian inference (recall Georgios Karagiannis talk) Data: t = {t 1,..., t n } Model: t is the realisation of a random vector T having probability density π T Ψ ( ψ), where ψ is an unknown parameter. π T Ψ (t ) is the likelihood. Prior: all knowledge about ψ which is not contained in t is expressed via prior density π Ψ (ψ). Posterior: Bayes Theorem enables us to rationally update the prior to our posterior belief in light of the new evidence (data). Bayes Theorem π Ψ T (ψ t) = π T Ψ(t ψ) π Ψ (ψ) Ω π T Ψ(t ψ) π Ψ (dψ) 4/37

7 Bayesian inference (recall Georgios Karagiannis talk) Data: t = {t 1,..., t n } Model: t is the realisation of a random vector T having probability density π T Ψ ( ψ), where ψ is an unknown parameter. π T Ψ (t ) is the likelihood. Prior: all knowledge about ψ which is not contained in t is expressed via prior density π Ψ (ψ). Posterior: Bayes Theorem enables us to rationally update the prior to our posterior belief in light of the new evidence (data). Bayes Theorem π Ψ T (ψ t) = π T Ψ(t ψ) π Ψ (ψ) Ω π T Ψ(t ψ) π Ψ (dψ) π T Ψ (t ψ) π Ψ (ψ) 4/37

8 Everything is about expectations Recall, for a random variable X Ω having probability density π(x), E[f(X)] := f(x)π(dx) µ Ω 5/37

9 Everything is about expectations Recall, for a random variable X Ω having probability density π(x), E[f(X)] := f(x)π(dx) µ Pretty much all statements of probability can be phrased in terms of expectations. X R, P(X < a) = a π(dx) = Ω I (,a) (x)π(dx) = E[I (,a) (X)] 5/37

10 Everything is about expectations Recall, for a random variable X Ω having probability density π(x), E[f(X)] := f(x)π(dx) µ Pretty much all statements of probability can be phrased in terms of expectations. X R, P(X < a) = a π(dx) = Ω I (,a) (x)π(dx) = E[I (,a) (X)] X Ω, A Ω, P(X A) = A π(dx) = Ω I A (x)π(dx) = E[I A (X)] ie for statements of probability define f(x) := I A (x) 5/37

11 Bayesian inference again May want samples directly from the posterior; marginal kernel density estimates; posterior predictive simulation; etc 6/37

12 Bayesian inference again May want samples directly from the posterior; marginal kernel density estimates; posterior predictive simulation; etc Or may want to answer question about a probability under the posterior P(ψ A t) = π(dψ t) A Ω = I A(ψ)π(t ψ) π(dψ) Ω π(t ψ) π(dψ) = I A (ψ)π(dψ t) Ω 6/37

13 So just do numerical integration? Midpoint Riemann integral in 1-dim using n evaulations: b a f(x)π(dx) = b a g(x) dx b a n n g(x i ) i=1 where x i := a + b a n ( i 1 ) 2 7/37

14 So just do numerical integration? Midpoint Riemann integral in 1-dim using n evaulations: where b a f(x)π(dx) = b a x i := a + b a n It is easy to show the error is: b g(x) dx b a n g(x i ) n a i=1 g(x) dx b a n ( i 1 ) 2 (b a)3 24n 2 n g(x i ) i=1 max f (z) a z b Clearly, (b a)3 24 max a z b f (z) is fixed by the problem, so we achieve desired accuracy by controlling n 2. 7/37

15 So just do numerical integration? error in midpoint Riemann integral in 1-dim: O(n 2 ) 8/37

16 So just do numerical integration? error in midpoint Riemann integral in 1-dim: O(n 2 ) but error in midpoint Riemann integral in d-dim: O(n 2/d ) so-called curse of dimensionality 8/37

17 So just do numerical integration? but error in midpoint Riemann integral in 1-dim: O(n 2 ) error in midpoint Riemann integral in d-dim: O(n 2/d ) so-called curse of dimensionality error in Monte Carlo integration: ie independent of dimension O(n 1/2 ) 8/37

18 So just do numerical integration? but error in midpoint Riemann integral in 1-dim: O(n 2 ) error in midpoint Riemann integral in d-dim: O(n 2/d ) so-called curse of dimensionality error in Monte Carlo integration: ie independent of dimension O(n 1/2 ) Simpson s improves this to O(n 4/d ), but in general Bakhvalov s Theorem bounds all possible quadriture methods by O(n r/d ) quadriture can t beat curse of dimensionality. 8/37

19 Order of error Method Monte Carlo d-dim Riemann 2-dim Riemann 6-dim n Note: this is the order of error, not absolute error! 9/37

20 Monte Carlo to the rescue? Monte Carlo integration in d-dim using n evaulations: µ f(x)π(dx) 1 n where x i π( ) n f(x i ) ˆµ i=1 10/37

21 Monte Carlo to the rescue? Monte Carlo integration in d-dim using n evaulations: µ f(x)π(dx) 1 n n f(x i ) ˆµ i=1 where x i π( ) The root mean square error is: ( E f(x)π(dx) 1 n ) n 2 f(x i ) = i=1 σ n where σ = Var π (f(x)). Again, σ is (mostly) inherent to the problem, so we achieve desired accuracy by controlling n 1/2 10/37

22 Monte Carlo to the rescue? Monte Carlo integration in d-dim using n evaulations: µ f(x)π(dx) 1 n n f(x i ) ˆµ i=1 where x i π( ) The root mean square error is: ( E f(x)π(dx) 1 n ) n 2 f(x i ) = i=1 σ n where σ = Var π (f(x)). Again, σ is (mostly) inherent to the problem, so we achieve desired accuracy by controlling n 1/2 Recall we can set f(x) := I A (x) to compute probabilities. 10/37

23 Monte Carlo the practicality Remarkably: No dependence on d. No dependence on smoothness of integrand. n σ can itself be directly estimated from the samples drawn. 11/37

24 Monte Carlo the practicality Remarkably: No dependence on d. No dependence on smoothness of integrand. n σ can itself be directly estimated from the samples drawn. but the crux of the last slide was: where x i π( ) Methodological research in Monte Carlo is largely preocupied with how to achieve this for complex probability distributions. 11/37

25 Monte Carlo achieving desired accuracy A simple application of Chebyshev s inequality allows us to bound how certain we are in a fully quantified way, P( ˆµ µ ε) E[(ˆµ µ)2 ] ε 2 = σ nε 2 12/37

26 Monte Carlo achieving desired accuracy A simple application of Chebyshev s inequality allows us to bound how certain we are in a fully quantified way, P( ˆµ µ ε) E[(ˆµ µ)2 ] ε 2 = σ nε 2 Or indeed, invoking the iid central limit theorem we can asymptotically state, ( ) ˆµ µ P σn 1/2 z n Φ(z) and so form confidence intervals for µ based on large n samples. 12/37

27 Simple MC 13/37

28 The setting Almost all Monte Carlo procedures start from the assumption that we have available an unlimited stream of independent uniformly distributed values, typically on the interval [0, 1]. We now want to study how to convert a stream into a stream u i Unif(0, 1) x j π( ) where x j is generated by some algorithm depending on one or more u i. In more advanced methods (see MCMC), x j may also depend on x j 1 or even x 1,..., x j 1. 14/37

29 Inverse sampling Let F (x) := P(X x) be the cumulative distribution function for our target probability density function π( ). Inverse Sampling 1 Sample U Unif(0, 1). 2 Set X = F 1 (U). 15/37

30 Inverse sampling Let F (x) := P(X x) be the cumulative distribution function for our target probability density function π( ). Inverse Sampling 1 Sample U Unif(0, 1). 2 Set X = F 1 (U). Why is X π( )? P(X x) = P(F 1 (U) x) = P(F (F 1 (U)) F (x)) = P(U F (x)) = F (x) 15/37

31 Inverse sampling Let F (x) := P(X x) be the cumulative distribution function for our target probability density function π( ). Inverse Sampling 1 Sample U Unif(0, 1). 2 Set X = F 1 (U). Why is X π( )? P(X x) = P(F 1 (U) x) = P(F (F 1 (U)) F (x)) = P(U F (x)) = F (x) To avoid problems with discrete distributions, we must define F 1 (u) = inf{x : F (x) u}, u [0, 1] 15/37

32 Rejection sampling Seek a density π( ) we can sample from and such that π(x) c π(x) x where c <. π( ) and π( ) need not be normalised. Rejection Sampling 1 Sample Y π( ) and U Unif(0, 1). 2 If U π(y ) c π(y ), return Y, else return to 1. 16/37

33 Rejection sampling Seek a density π( ) we can sample from and such that π(x) c π(x) x where c <. π( ) and π( ) need not be normalised. Rejection Sampling 1 Sample Y π( ) and U Unif(0, 1). 2 If U π(y ) c π(y ), return Y, else return to 1. This is not perfectly efficient as we must iterate 1 & 2 a random number of times until acceptance, with P(accept) = 1 c 16/37

34 Rejection sampling caution, a low-d method Consider a multi-variate Normal distribution centered at 0, ( π(x) = det(2πσ) 1/2 exp 1 ) 2 xt Σ 1 x Say want to produce samples for target where Σ =..... =. QT ΛQ using a proposal π( ) where Σ = σi. If σ < max{λ i }, c =. c minimal for σ = max{λ i }. 17/37

35 1e-01 Acceptance Probability 1e-04 1e Dimension 18/37

36 Rejection sampling demo Shiny demo for rejection sampling with: π(x) (x 5) 2 cos (x 1/4) and Thus, π(x) N(µ = 4, σ = 3) c 4.9 and P(accept) /37

37 Importance sampling (I) If we have a probability density π( ) which is close to π( ), then we can produce a weighted set of samples. 20/37

38 Importance sampling (I) If we have a probability density π( ) which is close to π( ), then we can produce a weighted set of samples. Importance Sampling 1 Sample X π( ) 2 Sample is x i = X, with weight w i = π(x) π(x) 20/37

39 Importance sampling (I) If we have a probability density π( ) which is close to π( ), then we can produce a weighted set of samples. Importance Sampling 1 Sample X π( ) 2 Sample is x i = X, with weight w i = π(x) π(x) Monte Carlo estimator is slightly modified to account for weights: µ f(x)π(dx) 1 n w i f(x i ) ˆµ n i=1 20/37

40 Importance sampling (I) If we have a probability density π( ) which is close to π( ), then we can produce a weighted set of samples. Importance Sampling 1 Sample X π( ) 2 Sample is x i = X, with weight w i = π(x) π(x) Monte Carlo estimator is slightly modified to account for weights: µ f(x)π(dx) 1 n w i f(x i ) ˆµ n i=1 Standard Importance Sampling Properties E[ˆµ] = µ, Var(ˆµ) = σ π (f(x)π(x) µ π(x)) 2 where σ π = dx n f(x)π(x) 20/37

41 Importance sampling (II) Consequently, can show optimal proposal for importance sampling is: π(x) opt = f(x) π(x) Ω f(x) π(dx) 21/37

42 Importance sampling (II) Consequently, can show optimal proposal for importance sampling is: π(x) opt = f(x) π(x) Ω f(x) π(dx) Hence, importance sampling shows how to beat naïve Monte Carlo when estimating expectations of non-identity functionals in practice, we can never compute the optimal π( ). Can still provide a nice guide 21/37

43 Importance sampling unnormalised π( ) We can still perform importance sampling if π( ) and π( ) are only known up to a normalising constant. Algorithm for sampling is unchanged, but the self-normalised importance sampling estimate becomes: f(x)π(dx) ni=1 f(x i )w i ni=1 w i 22/37

44 Importance sampling unnormalised π( ) We can still perform importance sampling if π( ) and π( ) are only known up to a normalising constant. Algorithm for sampling is unchanged, but the self-normalised importance sampling estimate becomes: f(x)π(dx) ni=1 f(x i )w i ni=1 w i Self-normalised Importance Sampling Properties µvar(w ) Cov(W, W f(x)) E[ˆµ] = µ + + O(n 2 ) n n Var(ˆµ) wi 2 (f(x i ) ˆµ) 2 and π(x) opt f(x) µ π(x) i=1 22/37

45 Importance sampling simple diagnostic Equate variance of importance sampling estimate to Monte Carlo variance for a fixed sample size n e : ( ni=1 ) f(x i )w i Var ni=1 = σ2 w i n e = Var ( n i=1 f(x i )w i ) ( n i=1 w i ) 2 = σ2 n e = σ2 n i=1 wi 2 ( n i=1 w i ) 2 = σ2 n e = n e = n w2 w 2 balanced weights are desirable. small n e diagnose a problem with IS large n e all is ok with IS 23/37

46 MCMC 24/37

47 Markov Chain Monte Carlo Standard Monte Carlo methods indeed have the nice O(n 1/2 ) convergence rates no dependence on dimension d Constant in the error term still depends on dimension! no completely free lunch But there are methods which control the error term better than standard Monte Carlo MCMC, introduced in 1953, constructs a Markov Chain whose stationary distribution is the target distribution of interest, π( ). 25/37

48 Markov Chains Saw Markov Chains in an imprecise probability context yesterday morning (Gert de Cooman s talk). Recall, a process (X 1, X 2,... ) is a continuous state space, discrete time Markov Chain if P(X t A X 1 = x 1,..., X t 1 = x t 1 ) P(X t A X t 1 = x t 1 ) 26/37

49 Markov Chains Saw Markov Chains in an imprecise probability context yesterday morning (Gert de Cooman s talk). Recall, a process (X 1, X 2,... ) is a continuous state space, discrete time Markov Chain if P(X t A X 1 = x 1,..., X t 1 = x t 1 ) P(X t A X t 1 = x t 1 ) The transition probabilities from a current state are defined by a kernel function K(x, ), such that, P(X t A X t 1 = x t 1 ) = K(x t 1, dy) K(x t 1, A) A 26/37

50 Markov Chains Saw Markov Chains in an imprecise probability context yesterday morning (Gert de Cooman s talk). Recall, a process (X 1, X 2,... ) is a continuous state space, discrete time Markov Chain if P(X t A X 1 = x 1,..., X t 1 = x t 1 ) P(X t A X t 1 = x t 1 ) The transition probabilities from a current state are defined by a kernel function K(x, ), such that, P(X t A X t 1 = x t 1 ) = K(x t 1, dy) K(x t 1, A) Under certain conditions, these chains will have a stationary distribution. We are interested in constucting Markov Chains with the stationary distribution we want to target, ie π(dx)k(x, y) = π(y) Ω A 26/37

51 Diving straight in There is a rich and interesting theory of Markov Chains, but we ll fast-forward to the action for today. 27/37

52 Diving straight in Metropolis-Hastings is a method to algorithmically construct K(x, ) such that π( ) will be stationary distribution. 28/37

53 Diving straight in Metropolis-Hastings is a method to algorithmically construct K(x, ) such that π( ) will be stationary distribution. Metropolis-Hastings Specify a target, π( ), proposal, q( x), and starting point x 1. To sample the Markov Chain, repeat: 1 Sample x q( x t 1 ) 2 Compute { α(x π(x ) q(x t 1 x } ) x t 1 ) = min 1, π(x t 1 ) q(x x t 1 ) 3 Sample u Unif(0, 1). Set, x t = { x if u α(x x t 1 ) x t 1 otherwise 28/37

54 Metropolis-Hastings common proposals Random-walk MH: choose some spherically symmetric distribution g( ) and define q(x x) = x + ε, where ε g( ) the spherical symmetry means acceptance probability simplifies: { α(x π(x } ) x t 1 ) = min 1, π(x t 1 ) often, g( ) is zero-mean multivariate Normal 29/37

55 Metropolis-Hastings common proposals Random-walk MH: choose some spherically symmetric distribution g( ) and define q(x x) = x + ε, where ε g( ) the spherical symmetry means acceptance probability simplifies: { α(x π(x } ) x t 1 ) = min 1, π(x t 1 ) often, g( ) is zero-mean multivariate Normal Independent MH: any choice q(x x) = g(x ), where the proposal does not depend on the current state. generally not a good choice, easy to construct non-ergodic chains 29/37

56 Convergence results We use the same estimator as standard Monte Carlo, µ f(x)π(dx) 1 n where now x i are MCMC draws. n f(x i ) ˆµ i=1 30/37

57 Convergence results We use the same estimator as standard Monte Carlo, µ f(x)π(dx) 1 n where now x i are MCMC draws. n f(x i ) ˆµ i=1 However, we no longer have iid samples from π( ), so standard Monte Carlo convergence results do not apply. Under some mild assumptions, we can state similar results for MCMC: where n(ˆµ µ) n N(0, σ 2 ) σ 2 = Var(f(X 1 )) + 2 Cov(f(X 1 ), f(x i )) i=2 30/37

58 Estimating the variance It is hard to estimate σ 2 in the MCMC setting, but essential to be able to quantify accuracy of estimates. Simple option: always examine autocorrelation plots. These will alert you to situations where the infinite sum is contributing substantially to the variance in your estimate. Better option: use methods such as batch means to estimate σ 2 from the Markov Chain output. See mcmcse R package for easy functions. 31/37

59 Choosing a proposal Counter-intuitively, high acceptance rates in MCMC are a bad thing! strongly correlated draws reduce the efficiency of the estimator by inflating the variance 32/37

60 Choosing a proposal Counter-intuitively, high acceptance rates in MCMC are a bad thing! strongly correlated draws reduce the efficiency of the estimator by inflating the variance But, need to move enough to explore the target long range jumps which reduce correlation have very low acceptance rates 32/37

61 Choosing a proposal Counter-intuitively, high acceptance rates in MCMC are a bad thing! strongly correlated draws reduce the efficiency of the estimator by inflating the variance But, need to move enough to explore the target long range jumps which reduce correlation have very low acceptance rates Need to balance these concerns a famous result shows that in the limit as d, the optimal acceptance rate for a symmetric product form target density is empirically this works well in lower dimensions and other targets, though for very small d should be increased (eg 0.44 in 1D) 32/37

62 Demo Enough talk 1 Example Metropolis-Hastings sampler in R (MCMC.R) 2 MCMC convergence Shiny demo (Shiny/MCMC>R) 33/37

63 Adaptive MCMC Clearly there is an issue: we may get terrible results by making a poor choice of proposal. 34/37

64 Adaptive MCMC Clearly there is an issue: we may get terrible results by making a poor choice of proposal. We can learn from the samples we have already seen to automatically improve our proposal distribution. q t ( x t 1 ) = q( x t 1, {x 1,..., x t 1 }) 34/37

65 Adaptive MCMC Clearly there is an issue: we may get terrible results by making a poor choice of proposal. We can learn from the samples we have already seen to automatically improve our proposal distribution. q t ( x t 1 ) = q( x t 1, {x 1,..., x t 1 }) Warning: this breaks Markov property! Adaptive MCMC Conditions Stationarity: π( ) must be stationary for q t ( x t 1 ) t Diminishing adaptation: lim sup K t ( x) K t+1 ( x) = 0 n x Ω Containment: Time to stationarity from any point in chain with adapted kernel bounded in probability. 34/37

66 Adaptive MCMC Haario et al / Roberts & Rosenthal Idea? q t ( x t 1 ) N ( x t 1, d ˆΣ t ) 35/37

67 Adaptive MCMC Haario et al / Roberts & Rosenthal Idea? q t ( x t 1 ) N This doesn t quite work, use ( q t ( x t 1 ) = (1 β)n to satisfy adaptive conditions. ( x t 1, d x t 1, d ˆΣ t ) ˆΣ t ) + βn ( ) x t 1, 0.12 d I 35/37

68 Adaptive MCMC Haario et al / Roberts & Rosenthal Idea? q t ( x t 1 ) N This doesn t quite work, use ( q t ( x t 1 ) = (1 β)n to satisfy adaptive conditions. ( x t 1, d x t 1, d ˆΣ t ) ˆΣ t ) + βn ( ) x t 1, 0.12 d I /d is the optimal scaling in certain theoretical circumstances. Alternative, scale to target an acceptance rate: ( ) ) q t ( x t 1 ) = (1 β)n (x t 1, e γ b ˆΣt + βn x t 1, 0.12 d I where split into batches b of size 50, say, with γ b = γ b 1 + ( 1) I(α b 1<0.44) min{0.01, n 1/2 } 35/37

69 In Practice 36/37

70 Software mcmc R package Stan Birch metrop for the kind of MCMC shown today temper to handle multi-modality Hamiltonian Monte Carlo several languages Sequential Monte Carlo brand new and particularly exciting probabilistic programming language 37/37

Computational statistics

Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated