Computer intensive statistical methods

Size: px

Start display at page:

Download "Computer intensive statistical methods"

Austen Marshall
5 years ago
Views:

1 Lecture 11 Markov Chain Monte Carlo cont. October 6, 2015 Jonas Wallin Chalmers, Gothenburg university

2 The two stage Gibbs sampler If the conditional distributions are easy to sample from one can use the Gibbs sampler: onas Wallin (Chalmers)

3 The two stage Gibbs sampler If the conditional distributions are easy to sample from one can use the Gibbs sampler: Start with X (0) (0) 1:2 = (X 1, X (0) 2 ) for l = 1 N do draw X (l) 1 π(x 1 X (l 1) 2 ) draw X (l) 2 π(x 2 X (l) 1 ) end for return X = (X (0) 1:2,..., X (N) 1:2 ) The output of algorithm is a Markov chain. onas Wallin (Chalmers)

4 Repetition: Markov chains A Markov chain on χ R d is a family of random variables (= stochastic process) (X k ) k 0, where each X k takes values in χ, and P(X k+1 B X 0, X 1,..., X k ) = P(X k+1 B X k ). The density q of the distribution of X k+1 given X k = x is called the transition density of (X k ). Consequently, P(X k+1 B X k = x k ) = q(x k+1 x k ) dx k+1. As a first example we considered an AR(1) process: X 0 = 0, X k+1 = αx k + ɛ k+1, where α is a constant and (ɛ k ) are i.i.d. variables. B

5 Stationary Markov chains A distribution π(x) is said to be stationary if q(x z)π(z)dz = π(x). (Global balance)

6 Stationary Markov chains A distribution π(x) is said to be stationary if q(x z)π(z)dz = π(x). (Global balance) For a stationary distribution π it holds that f 0 = π f 1 (x 1 ) = q(x 1 x 0 )f 0 (x 0 ) dx 0 = q(x 1 x 0 )π(x 0 ) dx 0 = π(x 1 ) f 2 (x) = q(x 2 x 1 )f 1 (x 1 ) dx 1 = q(x 2 x 1 )π(x 1 ) dx 1 = π(x 2 )... f n (x n ) = π(x n ), n.

7 Stationary Markov chains A distribution π(x) is said to be stationary if q(x z)π(z)dz = π(x). (Global balance) For a stationary distribution π it holds that f 0 = π f 1 (x 1 ) = q(x 1 x 0 )f 0 (x 0 ) dx 0 = q(x 1 x 0 )π(x 0 ) dx 0 = π(x 1 ) f 2 (x) = q(x 2 x 1 )f 1 (x 1 ) dx 1 = q(x 2 x 1 )π(x 1 ) dx 1 = π(x 2 )... f n (x n ) = π(x n ), n. Thus, if the chain starts in the stationary distribution, it will always stay in the stationary distribution. In this case we call also the chain stationary.

8 Stationary distribution of the Gibbs sampler* Theorem (Global balance Gibbs ) The joint density π is a stationary distribution for the Markov chain X (t) generated by the Gibbs sampler.

9 Irreducibility π-irreducibility A Markov chain is said to be π-irreducible if for all points x χ and all measurable sets A s.t π(a) = π(x)dx > 0 there exists for some t s.t: q t (y x)dy > 0. A If this condition holds with t = 1, then the chain is said to be strongly π-irreducible.

π-irreducibility fails π-irreducibility is not always satisfied for the Gibbs sampler: 2.

10 π-irreducibility fails π-irreducibility is not always satisfied for the Gibbs sampler: density x x density

11 π-irreducibility sufficient conditions. Positivity condition A distribution with density π(x 1,..., x d ) satisfy the positivity condition if π(x 1,..., x d ) > 0 for all x 1,..., x d with π(x i ) > 0.

12 π-irreducibility sufficient conditions. Positivity condition A distribution with density π(x 1,..., x d ) satisfy the positivity condition if π(x 1,..., x d ) > 0 for all x 1,..., x d with π(x i ) > 0. Proposition If π satisfies the positivity condition, the Gibbs sampler is π-irreducibility (and recurrent) Markov Chain.

13 π-irreducibility sufficient conditions. The positivity condition is sufficient, but not necessary. 3 density x x density

14 Korsbetning In 1361 the Danish king Valdemar Atterdag conquered Gotland and captured the rich Hanseatic town of Visby. In the graveside was excavated. A total of 493 femurs (237 right, 256 left) were found. How many people where buried there? onas Wallin (Chalmers)

15 Korsbetning model A reasonable (?) prior could be N U[260,..., 2000] p Beta(2, 2) onas Wallin (Chalmers)

16 Korsbetning model A reasonable (?) prior could be N U[260,..., 2000] p Beta(2, 2) The likelihood given the parameter is then π(y N, p) = Bin(y 1 ; N, p)bin(y 2 ; N, p) onas Wallin (Chalmers)

17 Korsbetning Gibbs To generate samples from the posterior we use a Gibbs sampler First the conditional distribution for p is given by π(p N, y) = Beta(y 1 + y 2 + 2, 2N y 1 y 2 + 2) onas Wallin (Chalmers)

18 Korsbetning Gibbs To generate samples from the posterior we use a Gibbs sampler First the conditional distribution for p is given by π(p N, y) = Beta(y 1 + y 2 + 2, 2N y 1 y 2 + 2) The conditional distribution of N is π(n y, p) Bin(y 1 ; N, p)bin(y 2 ; N, p) onas Wallin (Chalmers)

19 Korsbetning results The posterior distribution of N is given by Posterior distribution number of persons Density N

20 Geometric ergodicity Definition (Geometric ergodicity) X (t) is a geometric ergodic Markov Chain if: There exists a ρ < 1 and a function C x > 0 such that if X (0) = x then sup P(X (t) A) π(a) C x ρ t. A χ If there exists a constant C > 0 s.t C x < C for all x then Markov chain is uniformly ergodic.

21 Geometric ergodicity Definition (Geometric ergodicity) X (t) is a geometric ergodic Markov Chain if: There exists a ρ < 1 and a function C x > 0 such that if X (0) = x then sup P(X (t) A) π(a) C x ρ t. A χ If there exists a constant C > 0 s.t C x < C for all x then Markov chain is uniformly ergodic. Uniformly ergodicity for a Markov chain (often) implies that if E f [h 2 (X )] < C(h(X (m) ), h(x (n) )) Cρ n m.

22 Law of Large Numbers v2. Geometric ergodicity gives LLN, we however will show a weak version of LLN: Theorem (Law of large numbers for Markov chains) Let X (t) be a stationary Markov chain, with stationary distribution π, and X (0) π. If h is a function s.t for k i. Then C[h(X (k) ), h(x (i) )] < Cρ k i, def. τ T = 1 T T h(x (i) ) P E π (h(x )). i=1

23 Proof of LLN Proof. First by stationarity E[τ T ] = E[ 1 T T h(x (i) )] = 1 T T E π[ h(x )] = E π [h(x )]. i=1 i=1 Second we bound the variance: Denote S T = T i=1 h(x (i) ), (τ T = S T T ) and µ = E π [h(x )], then 1 T 2 T i=1 j=1 V[τ T ] = E[ 1 T 2 (S T T µ) 2 ] = T E[(h(X (i) ) µ)(h(x (j) ) µ)].

24 Proof of cont Proof. By ergodicity 1 T 2 T T i=1 j=1 E[(h(X (i) ) µ)(h(x (j) ) µ)] 1 T 2 T i=1 j=1 1 T 2 T Cρ i j T j=1 C 1 = C 1 T. Here C 1, C are positive constants. The LLN follows now directly from Chebyshev s inequality.

25 CLT Theorem (Central limit theorem) Let X (t) be geometric ergodic Markov chain, with stationary distribution π. Then for any h s.t E π [ h(x ) 2+ɛ ] <, where ɛ > 0, T (τt E π (h(x ))) N (0, σ 2 h), Under some additional assumption: σh 2 = V π [h(x 0 )] + 2 C π [h(x 0 ), h(x i )] i=1

26 CLT Theorem (Central limit theorem) Let X (t) be geometric ergodic Markov chain, with stationary distribution π. Then for any h s.t E π [ h(x ) 2+ɛ ] <, where ɛ > 0, T (τt E π (h(x ))) N (0, σ 2 h), Under some additional assumption: σ 2 h = V π [h(x 0 )] + 2 C π [h(x 0 ), h(x i )] A side note, in Häggström (2005) it was shown that ɛ can t be dropped. i=1

27 Effective Sample size MCMC style 1 σ 2 h is typically not known.

28 Effective Sample size MCMC style 1 σ 2 h is typically not known. 2 Often an AR(1) is a good approximation of the behavior of MCMC chain.

29 Effective Sample size MCMC style 1 σ 2 h is typically not known. 2 Often an AR(1) is a good approximation of the behavior of MCMC chain. 3 Approximate the Markov chain h(x t ) with an AR(1) process, where C(h(X (t) ), h(x (t+h) )) = ρ h (here C is the correlation function).

30 Effective Sample size MCMC style 1 σ 2 h is typically not known. 2 Often an AR(1) is a good approximation of the behavior of MCMC chain. 3 Approximate the Markov chain h(x t ) with an AR(1) process, where C(h(X (t) ), h(x (t+h) )) = ρ h (here C is the correlation function). 4 Then the variance of the estimator V[ 1 T T h(x (i) )] 1 + ρ 1 1 ρ T V[h(X (t) )] i=1 Then T 1 ρ 1+ρ is the effective sample size.

31 Korsbettning acf Going back to Gibbs sampler for N, p the acf (acf in R) for N looks as follows: Series N_vec ACF Lag

32 Korsbettning acf Going back to Gibbs sampler for N, p the acf (acf in R) for N looks as follows: Series N_vec ACF And 1 ρ 1+ρ samples). onas Wallin (Chalmers) Lag (thus a sample is worth independent

33 Data augmentation MCMC Suppose we augment with z then we now π(θ y) = π(θ, z y) dz. To solve this integral using MCMC, one sample the Markov chain (θ, z) (t) but store only θ (t), Done! Of course there is no free lunch. Adding extra variables increases the variance.

34 censored data Censored data occur frequently in statistics, especially in survival analysis. Suppose y follows standard distribution f (y θ) (with corresponding, unkown or intractable, probability P θ (A) = P(y A θ).), however the data is only observed up to α.

35 censored data Censored data occur frequently in statistics, especially in survival analysis. Suppose y follows standard distribution f (y θ) (with corresponding, unkown or intractable, probability P θ (A) = P(y A θ).), however the data is only observed up to α. The posterior distribution for θ is then n m π(θ y) π(θ) f (y i θ) i=1 m P θ (y α), j=1 where m is the number of censored observations. Typically not possible to sample from.

36 censored data Censored data occur frequently in statistics, especially in survival analysis. Suppose y follows standard distribution f (y θ) (with corresponding, unkown or intractable, probability P θ (A) = P(y A θ).), however the data is only observed up to α. The posterior distribution for θ is then n m π(θ y) π(θ) f (y i θ) i=1 m P θ (y α), where m is the number of censored observations. Typically not possible to sample from. If one observed the censored component sampling θ is simple. Thus introduce the augmented variable z j which correspond to the actual value of the censored observations Z j = y j y j > α (slight abuse of index notation). j=1

37 censored regression (Tobit) Censored regression is regression model that is applied when the variables only in a certain region. This can be model as where ɛ N (0, σ 2 I) and y = Xβ + ɛ, z i if z i [a, b], y i = b if z i b, a if z i a.

38 censored regression (Tobit) Censored regression is regression model that is applied when the variables only in a certain region. This can be model as where ɛ N (0, σ 2 I) and y = Xβ + ɛ, z i if z i [a, b], y i = b if z i b, a if z i a. This model fits perfectly into data augmentation when augmenting with z.

39 Ingorning the censoring To make inference from a Bayesian model we set: π(β) = N (β; 0, Σ β ), π(σ 2 ) = IG(σ 2 ; α, β). The first thing we do is remove the censored observations.

40 y Example data set Assume X i = [1, t i ], t

41 censored regression example To include the censored variables, we apply the following Gibbs sampler: Sample z i β, σ 2 N y>b (X i β, σ 2 ) if y i = b Sample β, σ z

42 posterior distribution Density Density β β 2 Density Density β 1 β 2

43 Hierarchical model definition Often parameters in a model is connected by some structure. For a statistical model to be reasonable it should be able to incorporate structures. onas Wallin (Chalmers)

44 Hierarchical model definition Often parameters in a model is connected by some structure. For a statistical model to be reasonable it should be able to incorporate structures. Recall we have π(y, θ) = π(y θ)π(θ), onas Wallin (Chalmers)

45 Hierarchical model definition Often parameters in a model is connected by some structure. For a statistical model to be reasonable it should be able to incorporate structures. Recall we have π(y, θ) = π(y θ)π(θ), we can impose a prior on the prior by π(y, θ) = π(y θ 1 )π(θ 1 θ 0 )π(θ 0 ). θ 0 is often denoted a hyper prior.

46 Hierarchical model example A classical data set is the study of rat tumors experiments are conducted, and afterwards one studied the number of rats that developed a tumor.

47 Hierarchical model example A classical data set is the study of rat tumors experiments are conducted, and afterwards one studied the number of rats that developed a tumor. 2 One could estimate each experiment independently: n π(θ y) = Binom(n i, y i, θ i )π(θ i ), i=1 however, these leads to problems since the number of observations is few.

48 Hierarchical model example A classical data set is the study of rat tumors experiments are conducted, and afterwards one studied the number of rats that developed a tumor. 2 One could estimate each experiment independently: n π(θ y) = Binom(n i, y i, θ i )π(θ i ), i=1 however, these leads to problems since the number of observations is few. 3 An simplistic approach is to assume that is a joint risk for all experiments: n π(θ y) = π(θ) Binom(n i, y i, θ), i=1 easy to fit, to simplistic, ignores the variation among e experiments.

49 Independent model DAG θ 1 θ 2 θ θ 70 θ 71 θ 72 y 1 y 2 y y 70 y 71 π(θ y) π(θ 72 ) 71 i=1 Binom(n i, y i, θ i )π(θ i ). Hard to make inference and prediction for θ 72 (future experiment) or y 72 (future observation). onas Wallin (Chalmers)

50 Independent model DAG θ y 1 y 2 y y 70 y 71 π(θ y) π(θ) 71 i=1 Binom(n i, y i, θ). To simplistic, does not allow for variance among the experiments.

51 Hierarchicals model DAG α, β θ 1 θ 2 θ θ 70 θ 71 θ 72 y 1 y 2 y y 70 y 71 π(α, β) exp( β α), π(θ i ) Beta(α, β), n π(y θ) = Binom(n i, y i, θ i ) i=1 71 π(θ y) Binom(n i, y i, θ i )π(θ i α, β)π(α, β)dαdβ. i=1

52 Hierarchicals model DAG compact α β θ j y j j = π(θ y) 71 i=1 Binom(n i, y i, θ i )π(θ i α, β)π(α, β)dαdβ.

53 Example:rats The joint posterior distribution is given by n n ( π(θ, α, β y) θ y i i (1 θ i ) n i y i Γ(α + β) Γ(α)Γ(β) i=1 i=1 }{{} π(y θ) ) θ α i (1 θ i ) β } {{ } π(θ α,β) e α β }{{} π(α,β)

54 Example:rats The joint posterior distribution is given by n n ( π(θ, α, β y) θ y i i (1 θ i ) n i y i Γ(α + β) Γ(α)Γ(β) i=1 i=1 }{{} π(y θ) ) θ α i (1 θ i ) β } {{ } π(θ α,β) The t + 1 iterations of the Gibbs sampler follows as θ (t+1) j... Beta(α (t) + y j, β (t) + n j y j ) ( ) n α (t+1) Γ(α + β (t) ) j... π(α...) e Γ(α) β (t+1) j... π(β...) ( Γ(α (t+1) + β) Γ(β) ) n e (t+1) log(θ j )α e α (t+1) log(θ j 1)β e β The densities of α and β are both log concave. For log concave densities there are good rejection algorithms. e α β }{{} π(α,β)

55 Example:rats posterior predicitve Below is the posterior predictive of θ 72 for the Hierarchal model vs θ for the simplistic model f(θ) θ

56 Example:rats (acf) However it turns out that acf for α, β is not good Series alpha_vec C(h) h Series beta_vec C(h) h

Computer intensive statistical methods

Computer intensive statistical methods Lecture 13 MCMC, Hybrid chains October 13, 2015 Jonas Wallin jonwal@chalmers.se Chalmers, Gothenburg university MH algorithm, Chap:6.3 The metropolis hastings requires three objects, the distribution of