Mixture models. Mixture models MCMC approaches Label switching MCMC for variable dimension models. 5 Mixture models

Size: px

Start display at page:

Download "Mixture models. Mixture models MCMC approaches Label switching MCMC for variable dimension models. 5 Mixture models"

Brice Pope
5 years ago
Views:

1 5 MCMC approaches Label switching MCMC for variable dimension models 291/459

2 Missing variable models Complexity of a model may originate from the fact that some piece of information is missing Example Arnason Schwarz model with missing zones Probit model with missing normal variate Generic representation f(x θ) = Z g(x,z θ)dz 292/459

3 Models of mixtures of distributions: x f j with probability p j, for j = 1, 2,...,k, with overall density p 1 f 1 (x) + + p k f k (x). Usual case: parameterised components k p i f(x θ i ) i=1 where weights p i s are distinguished from other parameters 293/459

4 Motivations Dataset made of several latent/missing/unobserved strata/subpopulations. Mixture structure due to the missing origin/allocation of each observation to a specific subpopulation/stratum. Inference on either the allocations (clustering) or on the parameters (θ i, p i ) or on the number of groups Semiparametric perspective where mixtures are basis approximations of unknown distributions 294/459

5 License Dataset derived from license plate image Grey levels concentrated on 256 values later jittered /459

6 Likelihood For a sample of independent random variables (x 1,, x n ), likelihood n {p 1 f 1 (x i ) + + p k f k (x i )}. i=1 Expanding this product involves elementary terms: prohibitive to compute in large samples. But likelihood still computable [pointwise] in O(kn) time. k n 296/459

7 Normal mean benchmark Normal mixture p N (µ 1, 1) + (1 p)n (µ 2, 1) with only unknown means (2-D representation possible) Identifiability Parameters µ 1 and µ 2 identifiable: µ 1 cannot be confused with µ 2 when p is different from 0.5. Presence of a spurious mode, understood by letting p go to µ µ 1 297/459

8 Bayesian Inference For any prior π (θ,p), posterior distribution of (θ,p) available up to a multiplicative constant n k π(θ,p x) p j f(x i θ j ) π (θ,p). at a cost of order O(kn) i=1 j=1 Difficulty Despite this, derivation of posterior characteristics like posterior expectations only possible in an exponential time of order O(k n )! 298/459

9 Missing variable representation Associate to each x i a missing/latent variable z i that indicates its component: z i p M k (p 1,...,p k ) and Completed likelihood x i z i, θ f( θ zi ). n l(θ,p x,z) = p zi f(x i θ zi ), i=1 and [ n ] π(θ,p x,z) p zi f(x i θ zi ) π (θ,p), where z = (z 1,...,z n ). i=1 299/459

10 Partition sets Denote by Z = {1,...,k} n set of the k n possible vectors z. Z decomposed into a partition of sets Z = r j=1z j For a given allocation size vector (n 1,...,n k ), where n n k = n, partition sets { } n n Z j = z : I zi=1 = n 1,..., I zi=k = n k, i=1 for all allocations with the given allocation size (n 1,...,n k ) and where labels j = j(n 1,...,n k ) defined by lexicographical ordering on the (n 1,...,n k ) s. i=1 300/459

11 Posterior closed form representations π (θ,p x) = r i=1 z Z i ω (z)π (θ,p x,z), where ω (z) represents marginal posterior probability of the allocation z conditional on x [derived by integrating out the parameters θ and p] Bayes estimator of (θ,p) r i=1 z Z i ω (z) E π [θ,p x,z]. c Too costly: 2 n terms 301/459

12 MCMC approaches General Gibbs sampling for mixture models Take advantage of the missing data structure: Algorithm Initialization: choose p (0) and θ (0) arbitrarily Step t. For t = 1,... 1 Generate z (t) ( P i z (t) i (i = 1,...,n) from (j = 1,...,k) ) ( = j p (t 1) j,θ (t 1) j,x i p (t 1) j f 2 Generate p (t) from π(p z (t) ), 3 Generate θ (t) from π(θ z (t),x). x i θ (t 1) j ) 302/459

13 MCMC approaches Exponential families When f(x θ) = h(x)exp(r(θ) T(x) > ψ(θ)) simulation of both p and θ usually straightforward: Conjugate prior on θ j given byback to definition π j (θ) exp(r(θ) α j β j ψ(θ)), where α j R k and β j > 0 are hyperparameters and [Dirichlet distribution] p D (γ 1,...,γ k ) 303/459

14 MCMC approaches Gibbs sampling for exponential family mixtures Algorithm Initialization. Choose p (0) and θ (0), Step t. For t = 1,... 1 Generate z (t) i ( P z (t) i 2 Compute n (t) j (i = 1,...,n, j = 1,...,k) from ) = j p (t 1) j,θ (t 1) j,x i = n i=1 I z (t) i =j, s(t) j ( p (t 1) j f x i θ (t 1) j ) = n i=1 I z (t) i =j t(x i) 3 Generate p (t) from D (γ 1 + n 1,...,γ k + n k ), 4 Generate θ (t) j (j = 1,...,k) from π(θ j z (t),x) exp ( ) R(θ j ) (α + s (t) j ) ψ(θ j)(n j + β). 304/459

15 MCMC approaches Normal mean example For mixture of two normal distributions with unknown means, pn(µ, τ 2 ) + (1 p)n(θ, σ 2 ), and a normal prior N (δ, 1/λ) on µ 1 and µ 2, 305/459

16 MCMC approaches Normal mean example (cont d) Algorithm Initialization. Choose µ (0) 1 and µ (0) 2, Step t. For t = 1,... 1 Generate z (t) i (i = 1,...,n) from ( ) ( ) P z (t) i = 1 = 1 P z (t) i = 2 pexp 2 Compute n (t) j = 3 Generate µ (t) j ( 1 ( 2 x i µ (t 1) 1 n n I (t) z i =j and (sx j ) (t) = I (t) z i =j x i i=1 ( i=1 λδ + (s x j ) (t) 1 (j = 1,2) from N, λ + n (t) j λ + n (t) j ) ) 2 ). 306/459

17 Bayesian Core:A Practical Approach to Computational Bayesian Statistics MCMC approaches µ2 µ Normal mean example (cont d) µ1 (a) initialised at random µ1 (b) initialised close to the lower mode 307 / 459

18 Bayesian Core:A Practical Approach to Computational Bayesian Statistics MCMC approaches License Consider k = 3 components, a D3 (1/2, 1/2, 1/2) prior for the weights, a N (x, σ 2 /3) prior on the means µi and a G a(10, σ 2 ) prior on the precisions σi 2, where x and σ 2 are the empirical mean and variance of License [Empirical Bayes] / 459

19 MCMC approaches Metropolis Hastings alternative For the Gibbs sampler, completion of z increases the dimension of the simulation space and reduces the mobility of the parameter chain. Metropolis Hastings algorithm available since posterior available in closed form, as long as q provides a correct exploration of the posterior surface, since π(θ,p x) π(θ, p x) computable in O(kn) time q(θ,p θ,p ) q(θ,p θ,p) 1 309/459

20 MCMC approaches Random walk Metropolis Hastings Proposal distribution for the new value θ j = θ (t 1) j + u j where u j N (0, ζ 2 ) In mean mixture case, Gaussian random walk proposal is ( µ 1 N µ (t 1) 1, ζ 2) and ( µ 2 N µ (t 1) 2, ζ 2) µ µ 1 310/459

21 MCMC approaches Random walk Metropolis Hastings for means Algorithm Initialization: Choose µ (0) 1 and µ (0) 2 Iteration t (t 1): ( 1 Generate µ 1 from N µ (t 1) 1,ζ ), 2 ( 2 Generate µ 2 from N µ (t 1) 2,ζ ), 2 3 Compute r = π ( µ 1, µ 2 x) / ( ) π µ (t 1) 1,µ (t 1) 2 x 4 Generate u U [0,1] : if u < r, then ( ) ( else µ (t) 1,µ(t) 2 = µ (t 1) 1,µ (t 1) 2 ). ( ) µ (t) 1,µ(t) 2 = ( µ 1, µ 2 ) 311/459

22 MCMC approaches Random walk extensions Difficulties with constrained parameters, like p such that k i=1 p k = 1. Resolution by overparameterisation p j = w j / k l=1 w l, w j > 0, and proposed move on the w j s log( w j ) = log(w (t 1) j ) + u j where u j N (0,ζ 2 ) Watch out for the Jacobian in the log transform 312/459

23 Label switching Identifiability A mixture model is invariant under permutations of the indices of the components. E.g., mixtures 0.3N (0, 1) + 0.7N (2.3, 1) and 0.7N (2.3, 1) + 0.3N (0, 1) are exactly the same! c The component parameters θ i are not identifiable marginally since they are exchangeable 313/459

24 Label switching Connected difficulties 1 Number of modes of the likelihood of order O(k!): c Maximization and even [MCMC] exploration of the posterior surface harder 2 Under exchangeable priors on (θ,p) [prior invariant under permutation of the indices], all posterior marginals are identical: c Posterior expectation of θ 1 equal to posterior expectation of θ /459

25 Label switching License Since Gibbs output does not produce exchangeability, the Gibbs sampler has not explored the whole parameter space: it lacks energy to switch simultaneously enough component allocations at once µi pi n µi pi σi n pi σi pi n σi 315/459

26 Label switching Label switching paradox We should observe the exchangeability of the components [label switching] to conclude about convergence of the Gibbs sampler. If we observe it, then we do not know how to estimate the parameters. If we do not, then we are uncertain about the convergence!!! 316/459

27 Label switching Constraints Usual reply to lack of identifiability: impose constraints like µ 1... µ k in the prior Mostly incompatible with the topology of the posterior surface: posterior expectations then depend on the choice of the constraints. Computational detail The constraint does not need to be imposed during the simulation but can instead be imposed after simulation, by reordering the MCMC output according to the constraint. This avoids possible negative effects on convergence. 317/459

28 Label switching Relabeling towards the mode Selection of one of the k! modal regions of the posterior once simulation is over, by computing the approximate MAP { } (θ,p) (i ) with i = arg max π (θ,p) (i) x i=1,...,m Pivotal Reordering At iteration i {1,...,M}, 1 Compute the optimal permutation ( { }) τ i = arg min d τ (θ (i),p (i) ),(θ (i ),p (i ) ) τ S k where d(, ) distance in the parameter space. 2 Set (θ (i),p (i) ) = τ i ((θ (i),p (i) )). 318/459

29 Label switching Re-ban on improper priors Difficult to use improper priors in the setting of mixtures because independent improper priors, π (θ) = k π i (θ i ), with π i (θ i )dθ i = i=1 end up, for all n s, with the property π(θ,p x)dθdp =. Reason There are (k 1) n terms among the k n terms in the expansion that allocate no observation at all to the i-th component. 319/459

30 Label switching Tempering Facilitate exploration of π by flattening the target: simulate from π α (x) π(x) α for α > 0 large enough Determine where the modal regions of π are (possibly with parallel versions using different α s) Recycle simulations from π(x) α into simulations from π by importance sampling Simple modification of the Metropolis Hastings algorithm, with new acceptance {( π(θ,p ) α x) q(θ,p θ,p } ) π(θ, p x) q(θ,p 1 θ,p) 320/459

31 Label switching Tempering with the mean mixture /459

32 MCMC for variable dimension models MCMC for variable dimension models One of the things we do not know is the number of things we do not know P. Green, 1996 Example the number of components in a mixture the number of covariates in a regression model the number of different capture probabilities in a capture-recapture model the number of lags in a time-series model 322/459

33 MCMC for variable dimension models Variable dimension models Variable dimension model defined as a collection of models (k = 1....,K), M k = {f( θ k ); θ k Θ k }, associated with a collection of priors on the parameters of these models, π k (θ k ), and a prior distribution on the indices of these models, { (k),k = 1,...,K}. Global notation: π(m k, θ k ) = (k)π k (θ k ) 323/459

34 MCMC for variable dimension models Bayesian inference for variable dimension models Two perspectives: 1 consider the variable dimension model as a whole and estimate quantities meaningful for the whole like predictives Pr(M k x 1,...,x n ) f k (x θ k )dx π k (θ k x 1,...,x n )dθ. k & quantities only meaningful for submodels (like moments of θ k ), computed from π k (θ k x 1,...,x n ). [Usual setup] 2 resort to testing by choosing the best submodel via p i f i(x θ i)π i(θ i)dθ i ZΘ p(m i x) = i X p j f j(x θ j)π j(θ j)dθ j ZΘ j j 324/459

35 MCMC for variable dimension models Green s reversible jumps Computational burden in exploring [possibly infinite] complex parameter space: Green s method set up a proper measure theoretic framework for designing moves between models/spaces M k /Θ k of varying dimensions [no one-to-one correspondence] Create a reversible kernel K on H = k {k} Θ k such that K(x, dy)π(x)dx = K(y, dx)π(y)dy A B for the invariant density π [x is of the form (k, θ (k) )] and for all sets A, B [un-detailed balance] B A 325/459

36 MCMC for variable dimension models Green s reversible kernel Since Markov kernel K necessarily of the form [either stay at the same value or move to one of the states] K(x, B) = m=1 ρ m (x, y)q m (x, dy) + ω(x)i B (x) where q m (x, dy) transition measure to model M m and ρ m (x, y) corresponding acceptance probability, only need to consider proposals between two models, M 1 and M 2, say. 326/459

37 MCMC for variable dimension models Green s reversibility constraint If transition kernels between those models are K 1 2 (θ 1, dθ) and K 2 1 (θ 2, dθ), formal use of the detailed balance condition π(dθ 1 )K 1 2 (θ 1, dθ) = π(dθ 2 )K 2 1 (θ 2, dθ), To preserve stationarity, necessary symmetry between moves/proposals from M 1 to M 2 and from M 2 to M 1 327/459

38 MCMC for variable dimension models Two-model transitions How to move from model M 1 to M 2, with Markov chain being in state θ 1 M 1 [i.e. k = 1]? Most often M 1 and M 2 are of different dimensions, e.g. dim(m 2 ) > dim(m 1 ). In that case, need to supplement both spaces Θ k1 and Θ k2 with adequate artificial spaces to create a one-to-one mapping between them, most often by augmenting the space of the smaller model. 328/459

39 MCMC for variable dimension models Two-model completions E.g., move from θ 2 Θ 2 to Θ 1 chosen to be a deterministic transform of θ 2 θ 1 = Ψ 2 1 (θ 2 ), Reverse proposal expressed as θ 2 = Ψ 1 2 (θ 1, v 1 2 ) where v 1 2 r.v. of dimension dim(m 2 ) dim(m 1 ), generated as v 1 2 ϕ 1 2 (v 1 2 ). 329/459

40 MCMC for variable dimension models Two-model acceptance probability In this case, θ 2 has density [under stationarity] q 1 2 (θ 2 ) = π 1 (θ 1 )ϕ 1 2 (v 1 2 ) Ψ 1 2 (θ 1, v 1 2 ) (θ 1, v 1 2 ) 1, by the Jacobian rule. To make it π 2 (θ 2 ) we thus need to accept this value with probability α(θ 1, v 1 2 ) = 1 π(m 2, θ 2 ) π(m 1, θ 1 )ϕ 1 2 (v 1 2 ) Ψ 1 2 (θ 1, v 1 2 ) (θ 1, v 1 2 ). This is restricted to the case when only moves between M 1 and M 2 are considered 330/459

41 MCMC for variable dimension models Interpretation The representation puts us back in a fixed dimension setting: M 1 V 1 2 and M 2 in one-to-one relation. reversibility imposes that θ 1 is derived as (θ 1, v 1 2 ) = Ψ (θ 2) appears like a regular Metropolis Hastings move from the couple (θ 1, v 1 2 ) to θ 2 when stationary distributions are π(m 1, θ 1 ) ϕ 1 2 (v 1 2 ) and π(m 2, θ 2 ), and when proposal distribution is deterministic (??) 331/459

42 MCMC for variable dimension models Pseudo-deterministic reasoning Consider the proposals θ 2 N(Ψ 1 2 (θ 1, v 1 2 ), ε) and Ψ 1 2 (θ 1, v 1 2 ) N(θ 2, ε) Reciprocal proposal has density exp { (θ 2 Ψ 1 2 (θ 1,v 1 2 )) 2 /2ε } Ψ 1 2 (θ 1,v 1 2 ) 2πε (θ 1,v 1 2 ) by the Jacobian rule. Thus Metropolis Hastings acceptance probability is π(m 2,θ 2 ) 1 Ψ 1 2 (θ 1,v 1 2 ) π(m 1,θ 1 )ϕ 1 2 (v 1 2 ) (θ 1,v 1 2 ) Does not depend on ε: Let ε go to 0 332/459

43 MCMC for variable dimension models Generic reversible jump acceptance probability If several models are considered simultaneously, with probability 1 2 of choosing move to M 2 while in M 1, as in K(x, B) = X Z ρ m(x, y)q m(x, dy) + ω(x)i B(x) m=1 acceptance probability of θ 2 = Ψ 1 2 (θ 1, v 1 2 ) is π(m 2,θ 2 ) 2 1 α(θ 1,v 1 2 ) = 1 Ψ 1 2 (θ 1,v 1 2 ) π(m 1,θ 1 ) 1 2 ϕ 1 2 (v 1 2 ) (θ 1,v 1 2 ) while acceptance probability of θ 1 with (θ 1, v 1 2 ) = Ψ (θ 2) is α(θ 1,v 1 2 ) = 1 π(m 1,θ 1 ) 1 2 ϕ 1 2 (v 1 2 ) π(m 2,θ 2 ) 2 1 Ψ 1 2 (θ 1,v 1 2 ) (θ 1,v 1 2 ) 1 333/459

44 MCMC for variable dimension models Green s sampler Algorithm Iteration t (t 1): if x (t) = (m, θ (m) ), 1 Select model M n with probability π mn 2 Generate u mn ϕ mn (u) and set (θ (n), v nm ) = Ψ m n (θ (m), u mn ) 3 Take x (t+1) = (n, θ (n) ) with probability ( π(n,θ (n) ) min π(m,θ (m) ) π nm ϕ nm (v nm ) π mn ϕ mn (u mn ) and take x (t+1) = x (t) otherwise. Ψ m n (θ (m),u mn ) (θ (m),u mn ) ),1 334/459

45 MCMC for variable dimension models Mixture of normal distributions M k = (p jk, µ jk, σ jk ); k p jk N(µ jk, σjk 2 ) j=1 Restrict moves from M k to adjacent models, like M k+1 and M k 1, with probabilities π k(k+1) and π k(k 1). 335/459

46 MCMC for variable dimension models Mixture birth Take Ψ k k+1 as a birth step: i.e. add a new normal component in the mixture, by generating the parameters of the new component from the prior distribution (µ k+1, σ k+1 ) π(µ, σ) and p k+1 Be(a 1, a a k ) if (p 1,...,p k ) M k (a 1,...,a k ) Jacobian is (1 p k+1 ) k 1 Death step then derived from the reversibility constraint by removing one of the k components at random. 336/459

47 MCMC for variable dimension models Mixture acceptance probability Birth acceptance probability ( ) π(k+1)k (k + 1)! π(k + 1, θ k+1 ) min π k(k+1) (k + 1)k! π(k, θ k )(k + 1)ϕ k(k+1) (u k(k+1) ), 1 ( π(k+1)k (k + 1) l k+1 (θ k+1 )(1 p k+1 ) k 1 ) = min, 1, π k(k+1) (k) l k (θ k ) where l k likelihood of the k component mixture model M k and (k) prior probability of model M k. Combinatorial terms: there are (k + 1)! ways of defining a (k + 1) component mixture by adding one component, while, given a (k + 1) component mixture, there are (k + 1) choices for a component to die and then k! associated mixtures for the remaining components. 337/459

48 Bayesian Core:A Practical Approach to Computational Bayesian Statistics MCMC for variable dimension models 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 0e+00 2e+05 6e+05 8e+05 1e+06 6e+05 8e+05 1e pi σi 4e+05 iterations 3.0 iterations k µi License 0e+00 2e+05 4e+05 iterations 6e+05 8e+05 1e+06 0e+00 2e+05 4e+05 iterations 338 / 459

49 MCMC for variable dimension models More coordinated moves Use of local moves that preserve structure of the original model. Split move from M k to M k+1 : replaces a random component, say the jth, with two new components, say the jth and the (j + 1)th, that are centered at the earlier jth component. And opposite merge move obtained by joining two components together. 339/459

50 MCMC for variable dimension models Splitting with moment preservation Split parameters for instance created under a moment preservation condition: p jk = p j(k+1) + p (j+1)(k+1), p jk µ jk = p j(k+1) µ j(k+1) + p (j+1)(k+1) µ (j+1)(k+1), p jk σjk 2 = p j(k+1) σj(k+1) 2 + p (j+1)(k+1)σ(j+1)(k+1) 2. M4 Opposite merge move obtained by reversibility constraint M5 M3 340/459

51 MCMC for variable dimension models Splitting details Generate the auxiliary variable u k(k+1) as and take u 1, u 3 U(0, 1), u 2 N (0, τ 2 ) p j(k+1) = u 1 p jk, p (j+1)(k+1) = (1 u 1 )p jk, µ j(k+1) = µ jk + u 2, µ (j+1)(k+1) = µ jk p j(k+1)u 2 p jk p, j)(k+1) σj(k+1) 2 = u 3 σjk 2, σ (j+1)(k+1) = p jk p j(k+1) u 3 p jk p σ 2 j(k+1) jk. 341/459

52 MCMC for variable dimension models Jacobian Corresponding Jacobian 0 1 u 1 1 u 1 p jk p jk p det j(k+1) p jk p j(k+1) = p u jk p j(k+1) u 3 3 C p jk p j(k+1) A σjk 2 p j(k+1) p jk p j(k+1) σ 2 jk p jk (1 u 1) 2 σ2 jk 342/459

53 MCMC for variable dimension models Acceptance probability Corresponding split acceptance probability min ( π(k+1)k (k + 1) π k+1 (θ k+1 )l k+1 (θ k+1 ) π k(k+1) (k) π k (θ k )l k (θ k ) ) p jk (1 u 1 ) 2 σ2 jk,1 where π (k+1)k and π k(k+1) denote split and merge probabilities when in models M k and M k+1 Factorial terms vanish: for a split move there are k possible choices of the split component and then (k + 1)! possible orderings of the θ k+1 vector while, for a merge, there are (k + 1)k possible choices for the components to be merged and then k! ways of ordering the resulting θ k. 343/459

Dynamic models. Dependent data The AR(p) model The MA(q) model Hidden Markov models. 6 Dynamic models

Dynamic models. Dependent data The AR(p) model The MA(q) model Hidden Markov models. 6 Dynamic models 6 Dependent data The AR(p) model The MA(q) model Hidden Markov models Dependent data Dependent data Huge portion of real-life data involving dependent datapoints Example (Capture-recapture) capture histories