Computer intensive statistical methods

Size: px

Start display at page:

Download "Computer intensive statistical methods"

Ashlee Morton
6 years ago
Views:

1 Lecture 13 MCMC, Hybrid chains October 13, 2015 Jonas Wallin Chalmers, Gothenburg university

2 MH algorithm, Chap:6.3 The metropolis hastings requires three objects, the distribution of interest π, and a conditional density q (does not need to depend on π) and a initial value x (0) (or distribution).

3 MH algorithm, Chap:6.3 The metropolis hastings requires three objects, the distribution of interest π, and a conditional density q (does not need to depend on π) and a initial value x (0) (or distribution).the first iteration goes as follows (generation of X (1) ) 1 Generate X q(x x (0) ). 2 Take X (1) = { X with probability ρ(x (0), X ), x (0) with probability 1 ρ(x (0), X ). where ( ) π(y)q(x y) ρ(x, y) = min π(x)q(y x), 1.

4 The transition kernel of the Metropolis-Hastings algorithm Recall P(X (t) A X (t 1) = x) = A K(y x)dy. Lemma (MH transition kernel) The MH algorithm is a Markov chain with the following transition kernel: K(y x) = ρ(x, y)q(y x) + p R (x)δ x (y), where p R (x) = 1 ρ(x, z)q(z x)dz with ρ(x, y) = 1 π(y)q(x y) π(x)q(y x). Note that δ x (y) = δ(y x) = δ(x y) where δ is the so called Dirac delta function, δ x (y) = I(x = y).

5 Convergence of the MH algorithm Theorem (stationary distribution of MH*) The chain (X (k) ) generated by the MH sampler has π as stationary distribution. π-irreducibility depends on q, but as we shall see is often very easy to verify

6 different types of proposal kernels There are a number of different ways of constructing the proposal kernel q. The three main classes are independent proposals, multiplicative proposals, and symmetric proposals.

7 Independent proposal Independent proposals are characterized as follows. (recall the accept reject method) Draw the candidates from q(y), i.e. independently of the current state x. onas Wallin (Chalmers)

8 Independent proposal Independent proposals are characterized as follows. (recall the accept reject method) Draw the candidates from q(y), i.e. independently of the current state x. The acceptance probability reduces to ρ(x, y) = 1 π(y)q(x) π(x)q(y). onas Wallin (Chalmers)

9 Independent proposal Independent proposals are characterized as follows. (recall the accept reject method) Draw the candidates from q(y), i.e. independently of the current state x. The acceptance probability reduces to ρ(x, y) = 1 π(y)q(x) π(x)q(y). Here it is required that {x : π(x) > 0} {x : q(x) > 0} to ensure convergence. (π-irreducibility) onas Wallin (Chalmers)

10 Independent proposal Independent proposals are characterized as follows. (recall the accept reject method) Draw the candidates from q(y), i.e. independently of the current state x. The acceptance probability reduces to ρ(x, y) = 1 π(y)q(x) π(x)q(y). Here it is required that {x : π(x) > 0} {x : q(x) > 0} to ensure convergence. (π-irreducibility) If we take q(x) = π(x), which is of course infeasible in general, the acceptance probability reduces to 1 and we get independent samples from π. onas Wallin (Chalmers)

11 Multiplicative proposals An easy way to obtain an asymmetric proposal where the size of the jump depends on the current state x k is to take X = ɛx k, where ɛ is drawn from some density p. The proposal kernel now becomes q(y x) = p(y/x)/x and the acceptance probability becomes ρ(x, y) = 1 π(y)p(x/y)/y π(x)p(y/x)/x.

12 Symmetric proposal Symmetric proposals are characterized by the following. It holds that q(y x) = q(x y).

13 Symmetric proposal Symmetric proposals are characterized by the following. It holds that q(y x) = q(x y). In this case the acceptance probability reduces to ρ(x, y) = 1 π(y) π(x).

14 Symmetric proposal Symmetric proposals are characterized by the following. It holds that q(y x) = q(x y). In this case the acceptance probability reduces to ρ(x, y) = 1 π(y) π(x). Commonly this corresponds to X = x k + ɛ (random walk proposal) with, e.g., ɛ N (0, σ 2 ), ɛ σt(1). ɛ U[ a, a] or ɛ U{ a, a + 1,..., a 1, a}.

15 Symmetric proposal for multivariate distribution, algorithm Choose an initial distribution π 0 (π 0 (x) = δ(x) for fixed start point), and use ɛ N (0, Σ), draw X (0) f 0 for l = 1 N do draw X N (x; X (l 1), Σ) draw U U[0, 1] if U π(x ) π(x l 1 ) then set X (l) = X else set X (l) = X (l 1) end if end for return X = (X (0),..., X (N) )

16 A simple example Using the MH symmetric algorithm for sampling from a bivariate Gaussian distribution. That is π(x) = N (x; µ, Σ) where [ ] µ = 0, Σ = We use a symmetric proposal with ɛ N (0, σ 2 MH I).

17 Mark-capture-recapture Estimation of populations is a tricky problem.

18 Mark-capture-recapture Estimation of populations is a tricky problem. Suppose there exists N animals, and one observe n animals. If you observe each animal independently then the likelihood is a binomial distribution where p is unknown. n Bin(N, p),

19 Mark-capture-recapture Estimation of populations is a tricky problem. Suppose there exists N animals, and one observe n animals. If you observe each animal independently then the likelihood is a binomial distribution where p is unknown. n Bin(N, p), The issue is that this likelihood does not contain information about N, p jointly.

20 Mark-capture-recapture Estimation of populations is a tricky problem. Suppose there exists N animals, and one observe n animals. If you observe each animal independently then the likelihood is a binomial distribution where p is unknown. n Bin(N, p), The issue is that this likelihood does not contain information about N, p jointly. To remedy this issue one mark captured animals, then release them again, and see how many animals are captured next time.

21 Mark-capture-recapture Suppose one tries to capture a set of animals k times. Each time the animals are marked, and resealed.

22 Mark-capture-recapture Suppose one tries to capture a set of animals k times. Each time the animals are marked, and resealed. Let the number of time each captured animal is captured denote y i.

23 Mark-capture-recapture Suppose one tries to capture a set of animals k times. Each time the animals are marked, and resealed. Let the number of time each captured animal is captured denote y i. Let N denote the total number of animals, n denote the number of unique animals captured and p the probability of capturing an animal at one instance.

24 Mark-capture-recapture Suppose one tries to capture a set of animals k times. Each time the animals are marked, and resealed. Let the number of time each captured animal is captured denote y i. Let N denote the total number of animals, n denote the number of unique animals captured and p the probability of capturing an animal at one instance. Then the likelihood can be derived as follows: onas Wallin (Chalmers)

25 Mark-capture-recapture Suppose one tries to capture a set of animals k times. Each time the animals are marked, and resealed. Let the number of time each captured animal is captured denote y i. Let N denote the total number of animals, n denote the number of unique animals captured and p the probability of capturing an animal at one instance. Then the likelihood can be derived as follows: 1 The probability of detecting n animals is: where p k = 1 (1 p) k. n Bin(N, p k ), onas Wallin (Chalmers)

26 Mark-capture-recapture Suppose one tries to capture a set of animals k times. Each time the animals are marked, and resealed. Let the number of time each captured animal is captured denote y i. Let N denote the total number of animals, n denote the number of unique animals captured and p the probability of capturing an animal at one instance. Then the likelihood can be derived as follows: 1 The probability of detecting n animals is: n Bin(N, p k ), where p k = 1 (1 p) k. 2 Given N, n the probability of y i is: y i 1 Bin(k 1, p).

27 Mark-capture-recapture Thus the posterior distribution of N, p is ( ) N n π(n, p y) (p c ) n (1 p c ) N n p y i 1 (1 p) k y i π(n, p). n i=1

28 Mark-capture-recapture Thus the posterior distribution of N, p is ( ) N n π(n, p y) (p c ) n (1 p c ) N n p y i 1 (1 p) k y i π(n, p). n To sample from this distribution, one can use a MH algorithm. There exists a large number of possbile proposal distribution, we are going to see two: and i=1 q 1 (N, p N, p) = U(N a,..., N + a)i [0,1] (p ). q 2 (N, p N, p) = U(N a,..., N + a)n (p, σ 2 p).

29 Mark-capture-recapture We apply the model of a real data sets where hares where captured and marked six times iter p

30 Mark-capture-recapture We apply the model of a real data sets where hares where captured and marked six times iter p

31 Mark-capture-recapture We apply the model of a real data sets where hares where captured and marked six times. Density N

32 Mark-capture-recapture We apply the model of a real data sets where hares where captured and marked six times. p Counts N

33 Multivariate Gaussian example By far the most used proposal distribution is the MH random walk with normal proposal. If there is a lot of dependence in the posterior distribution it is important that the MH proposal captures this.

34 Poisson Regerssion A classical data (first studied in 1898) is number of deaths in the Prussian Army due to kicks from horses: A question of interest is whether there is a trend in the deaths over time. Bayesian model (using uninformative priors): β j N (0, 10 2 ), Po(y i ; λ = exp(x i β)) where X i = [1, t i ] where t i is the year the observation was made.

35 Poisson Regerssion cont.. The posterior distribution of β is, through Bayes formula, proportional to π(β y, X) e βt β } {{ } π(β) n exp(y i (X i β) exp(x i β)). i=1 } {{ } π(y β)

36 Poisson Regerssion acf Σ = d I.

37 Poisson Regerssion acf Σ = d I. Series Theta[, 1] ACF Lag acf (1) =

38 Poisson Regerssion hist β Counts β

39 Poisson Regerssion acf 2 Σ = d Diag(C π[β]).

40 Poisson Regerssion acf 2 Σ = d Diag(C π[β]). Series Theta[, 1] ACF acf (1) = Lag 0.9.

41 Poisson Regerssion acf 3 Σ = d C π[β].

42 Poisson Regerssion acf 3 Σ = d C π[β]. Series Theta[, 1] ACF acf (1) = Lag

43 Choosing Σ For large dimensions Σ plays a crucial roll for convergence. onas Wallin (Chalmers)

44 Choosing Σ For large dimensions Σ plays a crucial roll for convergence. So far I have manually tuned Σ. onas Wallin (Chalmers)

45 Choosing Σ For large dimensions Σ plays a crucial roll for convergence. So far I have manually tuned Σ. Annoying to impossible for higher dimension. onas Wallin (Chalmers)

46 Choosing Σ For large dimensions Σ plays a crucial roll for convergence. So far I have manually tuned Σ. Annoying to impossible for higher dimension. Optimal scaling: Σ = d C π[x]. onas Wallin (Chalmers)

47 Choosing Σ But we don t have C π [X]. onas Wallin (Chalmers)

48 Choosing Σ But we don t have C π [X]. Lets estimate it! ˆΣ N = 1 N N (x i x i ) (x i x i ) T. i=1 onas Wallin (Chalmers)

49 Symmetric proposal for multivariate distribution, algorithm Choose a Σ 0 and a X draw X (0) π 0 for l = 1 N do draw X N (y; X (l 1), d draw U U[0, 1] if U π(y ) then π(x l 1 ) set X (l) = X else set X (l) = X (l 1) end if update X = l 1 l X + l l X(l) update Σ l = l 1 l Σ l 1 + l l end for return X = (X (0),..., X (N) ) Σ l 1) ( ) ( ) T X (l) X X (l) X

50 Mathematical issues Issue, no longer a Markov chain. But converges to one. However, can sometimes fail (in rare occasions6)!

51 Monte Carlo and credibility interval the confidence interval for different E[β 2 y 1,... y 10 ] number MCMC samples N: N = (black line π(β 2 y 1,... y 10 ), blue lines β 95% credibility interval, and red lines 95% confidence interval of the mean)

52 Monte Carlo and credibility interval the confidence interval for different E[β 2 y 1,... y 10 ] number MCMC samples N: N = (black line π(β 2 y 1,... y 10 ), blue lines β 95% credibility interval, and red lines 95% confidence interval of the mean)

53 Monte Carlo and credibility interval the confidence interval for different E[β 2 y 1,... y 10 ] number MCMC samples N: N = (black line π(β 2 y 1,... y 10 ), blue lines β 95% credibility interval, and red lines 95% confidence interval of the mean)

54 Monte Carlo and credibility interval When the data increase the posterior distribution changes: (blue line π(β 2 y 1,... y 10 ), red line β π(β 2 y 1,... y 100 ), and black line π(β 2 y 1,... y 1000 ), green line π(β 2 y 1,... y 5000 ))

55 MH within Gibbs Going back to Gibbs, Often Gibbs samplers has better convergence then MH.

56 MH within Gibbs Going back to Gibbs, Often Gibbs samplers has better convergence then MH. However, in larger models there typically is one or many parameters that the conditional distribution is not explicit.

57 MH within Gibbs Going back to Gibbs, Often Gibbs samplers has better convergence then MH. However, in larger models there typically is one or many parameters that the conditional distribution is not explicit. An idea is to combine, Gibbs with MH, where the intractable conditional sampling is replaced with MH.

58 Hybrid chains (theoretical motivation) Assume without loss of generality that we have two blocks and want to sample from π(x 1, x 2 ). The hybrid MCMC goes like this, given X (t 1) = x = (x (t 1) 1, x (t 1) 2 ). 1 Draw X (t) 1 q(x 1 x (t 1) 1, x (t 1) 2 ) with corresponding K MH as MH kernel for π(x 1 x 2 = x (t 1) 2 ). onas Wallin (Chalmers)

59 Hybrid chains (theoretical motivation) Assume without loss of generality that we have two blocks and want to sample from π(x 1, x 2 ). The hybrid MCMC goes like this, given X (t 1) = x = (x (t 1) 1, x (t 1) 2 ). 1 Draw X (t) 1 q(x 1 x (t 1) 1, x (t 1) 2 ) with corresponding K MH as MH kernel for π(x 1 x 2 = x (t 1) 2 ). 2 Draw X (t) 2 π(x 2 x 1 = x (t) 1 ) i.e. standard Gibbs. onas Wallin (Chalmers)

60 Hybrid chains (theoretical motivation) Theorem (stationary distribution of Hybrid chain*) The chain (X (k) ) generated by the Hybrid sampler has π as stationary distribution.

61 Global balance equation K(z x) = π(z 2 z 1 )K MH (z 1 x 2, x 1 ) Proof. K(z 2, z 1 x 1, x 2 )π(x 1, x 2 )dx 1 dx 2 = π(z 2 z 1 )K MH (z 1 x 2, x 1 )π(x 1 x 2 )π(x 2 )dx 1 dx 2 = π(z 2 z 1 ) K MH (z 1 x 2, x 1 )π(x 1 x 2 )dx 1 π(x 2 )dx 2 Global Balance MH = π(z 2 z 1 ) π(z 1 x 2 )π(x 2 )dx 2 Law of total prob = π(z 2 z 1 )π(z 1 ) = π(z 1, z 2 ).

62 Example:rats The joint posterior distribution is given by n n ( π(θ, α, β y) θ y i i (1 θ i ) n i y i Γ(α + β) Γ(α)Γ(β) i=1 i=1 }{{} π(y θ) ) θ α i (1 θ i ) β } {{ } π(θ α,β) e α β }{{} π(α,β)

63 Example:rats The joint posterior distribution is given by n n ( π(θ, α, β y) θ y i i (1 θ i ) n i y i Γ(α + β) Γ(α)Γ(β) i=1 i=1 }{{} π(y θ) ) θ α i (1 θ i ) β } {{ } π(θ α,β) The t + 1 iterations of the Gibbs sampler follows as θ (t+1) j... Beta(α (t) + y j, β (t) + n j y j ) ( ) n α (t+1) Γ(α + β (t) ) j... π(α...) e Γ(α) β (t+1) j... π(β...) ( Γ(α (t+1) + β) Γ(β) ) n e (t+1) log(θ j )α e α (t+1) log(θ j 1)β e β The densities of α and β are both log concave. For log concave densities there are good rejection algorithms. e α β }{{} π(α,β)

64 Example:rats (acf) However it turns out that acf for α, β (worth 1 samples) 171, regular MC α β ACF ACF lag lag

65 Hybrid chains ratdata Recall π(θ, α, β y) n i=1 θ y i i (1 θ i ) n i y i } {{ } π(y θ) n ( ) Γ(α + β) θi α (1 θ i ) β Γ(α)Γ(β) i=1 }{{} π(θ α,β) The t + 1 iterations of the Hybrid sampler follows as θ (t+1) j (α (t), β (t) ) Beta(α (t) + y j, β (t) + n j y j ) ( ) Γ(α + β) n (α (t+1), β (t+1) ) θ (t+1) Γ(α)Γ(β) (t+1) e log(θ j )α+ log(θ (t+1) j 1)β (α + β) 5 2 Where the second step is updated through a MH random walk. (α + β) 5 2 }{{} π(α,β)

66 R-code R code for the log likelihood. post_dens_alpha_beta <- function(alpha_beta, theta) { if(min(alpha_beta) <=0){return(-Inf)} alpha <- alpha_beta[1] beta <- alpha_beta[2] n <- length(theta) log_f <- -5/2*log(alpha+beta)# prior log_f <- log_f + n * (lgamma(alpha+beta) -lgamma(alpha) -lgamma(beta)) log_f <- log_f + alpha * sum(log(theta)) + beta * sum(log(1 - theta)) return(log_f) }

67 R-code the MH within Gibbs for(i in 1:sim){ theta <- rbeta(n, ab[1]+dat[,1], ab[2] + dat[,2]-dat[,1]) lik_old <- post_dens_alpha_beta(ab, theta = theta) ab_star <- rmvnorm(1, mean = ab, sigma = Sigma) lik_star <- post_dens_alpha_beta(ab_star, theta = theta) if( log(runif(1)) < lik_star - lik_old) ab = ab_star

68 Example:rats (acf) The acf for α, β (worth 1 40, 1 38 regular MC samples) α β ACF ACF lag lag

69 β hist Counts α

70 Gibbs within Metropolis We have seen that Metropolis within Gibbs works, and also often is quite easy to implement.

71 Gibbs within Metropolis We have seen that Metropolis within Gibbs works, and also often is quite easy to implement. However, if there is strong dependence between parameters it is often better to use Gibbs within Metropolis, or block updating. (Although often a bit harder to implement.)

72 Gibbs within Metropolis(Block sampler) Suppose we want to sample (α, θ), we can sample directly from θ α. Then at t iteration of the Gibbs within MH goes as follows: Sample α N (α, σ 2 MH ) onas Wallin (Chalmers)

73 Gibbs within Metropolis(Block sampler) Suppose we want to sample (α, θ), we can sample directly from θ α. Then at t iteration of the Gibbs within MH goes as follows: Sample α N (α, σ 2 MH ) Sample θ π(. α ) onas Wallin (Chalmers)

74 Gibbs within Metropolis(Block sampler) Suppose we want to sample (α, θ), we can sample directly from θ α. Then at t iteration of the Gibbs within MH goes as follows: Sample α N (α, σ 2 MH ) Sample θ π(. α ) Accept reject (θ, α ) jointly. { (α, θ) (t) (α, θ ) w probability ρ((α, θ) (t 1), (α, θ )), = (α, θ) (t 1) w probability 1 ρ((α, θ) (t 1), (α, θ )). where ( π(α ρ((α, θ), (α, θ, θ ) )π(θ α) )) = min π(α, θ)π(θ α ), 1. onas Wallin (Chalmers)

75 Gibbs within Metropolis(Block sampler) Suppose we want to sample (α, θ), we can sample directly from θ α. Then at t iteration of the Gibbs within MH goes as follows: Sample α N (α, σmh 2 ) Sample θ π(. α ) Accept reject (θ, α ) jointly. { (α, θ) (t) (α, θ ) w probability ρ((α, θ) (t 1), (α, θ )), = (α, θ) (t 1) w probability 1 ρ((α, θ) (t 1), (α, θ )). where ( π(α ρ((α, θ), (α, θ, θ ) )π(θ α) )) = min π(α, θ)π(θ α ), 1. Of course one can generalize this beyond just one dimension and MH-RW kernel. Since all we do is construct a complicated proposal kernel. onas Wallin (Chalmers)

76 R-code gibbs_within_mh.r

77 Example:rats (acf) again However it turns out that acf for α, β, worth 1 14, 1 14 samples, big difference from 1 171, regular MC α β ACF ACF lag lag

78 Theorem (Central limit theorem) Let X (t) be geometric ergodic Markov chain, with stationary distribution π. Then for any h s.t E π [ h(x ) 2+ɛ ] <, where ɛ > 0, T (τt E π (h(x ))) N (0, σ 2 h), Under some additional assumption: σh 2 = V π [h(x 0 )] + 2 C π [h(x 0 ), h(x i )]. i=1

79 Effective Sample size MCMC style 1 σ 2 h is typically not known. 2 Sometimes an AR(1) is a good approximation of the behavior of MCMC chain. 3 Approximate the Markov chain h(x t ) with an AR(1) process, where C(h(X (t) ), h(x (t+h) )) = ρ h (here C is the correlation function). 4 Then the variance of the estimator V[ 1 T T h(x (i) )] 1 + ρ 1 1 ρ T V[h(X (t) )] i=1 Then T 1 ρ 1+ρ is the effective sample size.

80 Estimating asymptotic variance using blocking An alternative is to block the data as follows: τ N = 1 T T h(x k ) = 1 n k=1 n Y l, l=1 where G = T n and def. Y l = 1 G lg m=(l 1)G+1 h(x m ), l = 1, 2,..., n. If the blocks are large enough we can view these as close to independent and identically distributed.

81 Estimating asymptotic variance using blocking (cont.) We may thus expect the CLT to hold at least approximately, implying that ( ) 1 n V(τ N ) = V Y l V(Y 1), n n l=1 where V(Y 1 ) can be estimated using the standard estimator V(Y 1 ) 1 n 1 n (Y m Y n ) 2, m=1 with Y n = n m=1 Y m/n denoting the sample mean.

82 Estimating asymptotic variance using blocking in R T <- length ( X) G = 100 n = floor (T/G) Y = sapply (1:n, function (k) return ( mean (X [(( k - 1) * G + 1):( k * G )])))

83 blocking rats(cont.) G = 100 for the Gibbs within MH for α in the rats example: α ACF Lag

Computer intensive statistical methods

Computer intensive statistical methods Lecture 11 Markov Chain Monte Carlo cont. October 6, 2015 Jonas Wallin jonwal@chalmers.se Chalmers, Gothenburg university The two stage Gibbs sampler If the conditional distributions are easy to sample