Approximating mixture distributions using finite numbers of components

Approximating mixture distributions using finite numbers of components Christian Röver and Tim Friede Department of Medical Statistics University Medical Center Göttingen March 17, 2016 This project has received funding from the European Union s Seventh Framework Programme for research, technological development and demonstration under grant agreement number FP HEALTH 2013-602144. C. Röver, T. Friede Approximating mixture distributions March 17, 2016 1 / 23

Overview meta analysis example general problem: mixture distributions discrete grid approximations design strategy & algorithm application to meta analysis problem C. Röver, T. Friede Approximating mixture distributions March 17, 2016 2 / 23

Meta analysis Context: random-effects meta-analysis # 1 # 2 # 3 # 4 # 5 # 6 # 7 have: estimates y i standard errors σ i want: combined estimate ˆΘ 120 140 160 180 200 220 240 nuisance parameter: between-trial C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 3 / 23

Meta analysis Context: random-effects meta-analysis # 1 # 2 # 3 # 4 # 5 # 6 # 7 have: estimates y i standard errors σ i want: combined estimate ˆΘ Θ 120 140 160 180 200 220 240 nuisance parameter: between-trial C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 3 / 23

Meta analysis Context: random-effects meta-analysis # 1 # 2 # 3 # 4 # 5 # 6 # 7 Θ marginal posterior p(θ) 0.00 0.01 0.02 0.03 0.04 0.05 0.0 120 140 160 180 200 220 240 140 160 180 200 estimation: via marginal posterior distribution of parameterθ C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 4 / 23

Meta analysis Context: random-effects meta-analysis 50% 90% 95% 99% marginal posterior p(τ) 0.00 0.01 0.02 0.03 0.04 0.05 marginal posterior p(θ) 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0 10 20 30 40 0 20 40 60 80 140 160 180 200 two parameters parameter estimation with two unknowns: joint & marginal posterior distributions C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 5 / 23

Meta analysis Context: random-effects meta-analysis here: easy to derive one of the marginals: p(τ y) and conditional posteriors p(θ τ, y) p(τ y) =... (... function of y i, σ i,... ) p(θ τ, y) = Normal(µ = f 1 (τ),σ = f 2 (τ)) but main interest in other marginal: p(θ y) p(θ y) = joint {}}{ p(θ,τ, y) dτ = p(θ τ, y) }{{} conditional p(τ y) }{{} marginal dτ is a mixture distribution C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 6 / 23

Meta analysis Context: random-effects meta-analysis marginal posterior p(τ) 0.00 0.01 0.02 0.03 0.04 0.05 0 10 20 30 40 0 10 20 30 40 50 conditional posterior p(θ τ i) 0.00 0.02 0.04 0.06 0.08 0.10 marginal posterior p(θ y) 0.00 0.01 0.02 0.03 0.04 0.05 0.06 C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 7 / 23

Meta analysis Context: random-effects meta-analysis conditional mean + sd conditional mean conditional mean sd marginal posterior p(τ) 0.00 0.01 0.02 0.03 0.04 0.05 0 10 20 30 40 0 10 20 30 40 50 conditional posterior p(θ τ i) 0.00 0.02 0.04 0.06 0.08 0.10 marginal posterior p(θ y) 0.00 0.01 0.02 0.03 0.04 0.05 0.06 C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 7 / 23

Meta analysis Context: random-effects meta-analysis τ 1 marginal posterior p(τ) 0.00 0.01 0.02 0.03 0.04 0.05 τ 1 0 10 20 30 40 0 10 20 30 40 50 conditional posterior p(θ τ i) 0.00 0.02 0.04 0.06 0.08 0.10 τ = τ 1 marginal posterior p(θ y) 0.00 0.01 0.02 0.03 0.04 0.05 0.06 C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 7 / 23

Meta analysis Context: random-effects meta-analysis τ 1 τ 2 τ 3 τ 4 marginal posterior p(τ) 0.00 0.01 0.02 0.03 0.04 0.05 τ 1 τ 2 τ 3 τ 4 0 10 20 30 40 0 10 20 30 40 50 conditional posterior p(θ τ i) 0.00 0.02 0.04 0.06 0.08 0.10 τ = τ 1 τ = τ 2 τ = τ 3 τ = τ 4 marginal posterior p(θ y) 0.00 0.01 0.02 0.03 0.04 0.05 0.06 C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 7 / 23

Mixture distributions The general problem mixture distribution: a convex combination of component distributions a distribution whose parameters are random variables ( conditional ) distribution with density p(y x) parameter x follows a distribution p(x) marginal / mixture is p(y) = X p(y x) dp(x) x discrete: p(y) = i p(y x i) p(x i ) ubiquitous in many applications Student-t distribution negative binomial distribution marginal distributions convolution... C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 8 / 23

Mixture distributions How to approximate? approximating the continuous mixture through a discrete set of points in τ... actual marginal: approximation: p(θ) = p(θ τ) p(τ) dτ p(θ) i p(θ τ i )π i Questions: how to set up the discrete grid of points? how well can we approximate? do we have a handle on accuracy? C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 9 / 23

Mixture distributions Motivation: discretizing a mixture τ 1 τ 2 τ 3 τ 4 conditional posterior p(θ τ i) 0.00 0.02 0.04 0.06 0.08 0.10 τ = τ 1 τ = τ 2 τ = τ 3 τ = τ 4 0 10 20 30 40 Note: conditional distributions p(θ τ, y) are very different for τ 1 and τ 2 and rather similar for τ 3 and τ 4. idea: may need fewer bins for larger τ values...?... bin spacing based on similarity / dissimilarity of conditionals? C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 10 / 23

Discretizing mixture distributions Setting up a binning need: discretization of the mixing distribution p(x). domain of X: R (or subset) define bin margins: x (1) < x (2) <... < x (k 1) bins: X i = {x : x x (1) } if i = 1 {x : x (i 1) < x x (i) } if 1 < i < k {x : x (k 1) < x} if i = k. reference points: x 1,..., x k, where x i X i bin probabilities: π i = P ( x (i 1) < x x (i) ) = P ( x Xi ) C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 11 / 23

Discretizing mixture distributions Setting up a binned mixture actual distribution: p(x, y) discrete approximation: q(x, y) same marginal (mixing distribution): q(x) = p(x) but binned conditionals: q(y x) = p(y x = x i ) for x X i. q similar to p, instead of conditioning on exact x, conditioning on corresponding bin s reference point x i marginal: q(y) = q(y x) q(x) dx = i π i p(y x i ) C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 12 / 23

Discretizing mixture distributions Setting up a binned mixture in previous example: bin margins: τ (1) = 10, τ (2) = 20, τ (3) = 30 reference points: τ 1 = 5, τ 2 = 15, τ 3 = 25, τ 4 = 35 probabilities: π 1 = 0.34, π 2 = 0.44, π 3 = 0.15, π 4 = 0.07 τ 1 τ 2 τ 3 τ 4 τ~ 1 τ ~ 2 τ ~ 3 τ ~ 4 τ(1) τ(2) τ(3) marginal posterior p(θ y) 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0 10 20 30 40 0 10 20 30 40 C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 13 / 23

Similarity / dissimilarity of distributions Kullback-Leibler divergence The Kullback-Leibler divergence of two distributions with density functions p and q is defined as ( D KL p(θ) q(θ) ) ( p(θ) ) = log p(θ) dθ Θ q(θ) ( p(θ) ) = E p(θ) [log ] q(θ) The symmetrized KL-divergence of two distributions is defined as D s ( p(θ) q(θ) ) = DKL ( p(θ) q(θ) ) +DKL ( q(θ) p(θ) ) the symmetrized KL-divergence... is symmetric: D s ( p(θ) q(θ) ) = Ds ( q(θ) p(θ) ) is always positive: D s ( p(θ) q(θ) ) 0 C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 14 / 23

Divergence Interpretation How to interpret divergences? measure of discrepancy between distributions heuristically: expected log ratio of densities... relevant case here: p(x) q(x). D KL ( p(x), q(x) ) = 0 for p = q D KL ( p(x), q(x) ) = 0.01 corresponds to (expected) 1% difference in densities C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 15 / 23

Divergence Bin-wise maximum divergence: definition consider: divergence between reference point and other points within each bin define: d i { ( = max D s p(y x) p(y xi ) )} { ( ) } = max D s p(y x) q(y x), x X i x X i the bin-wise maximum divergence worst-case discrepancy introduced within each bin C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 16 / 23

Divergence Bin-wise maximum divergence: example τ~ 1 τ ~ 2 τ ~ 3 τ ~ 4 τ (1) τ (2) τ (3) 0 10 20 30 40 recall: actual parameters of conditionals p(y x) (in black) vs. parameters of q(y x) assumed through binning (in red) C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 17 / 23

Divergence Bin-wise maximum divergence: example 150 155 160 165 170 τ~ 2 τ (1) τ (2) divergence D s(p(θ τ) p(θ τ ~ 2)) 0.00 0.05 0.10 0.15 0.20 conditional density p(θ τ) 0.00 0.02 0.04 0.06 10 12 14 16 18 20 10 12 14 16 18 20 C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 18 / 23

Divergence Bin-wise maximum divergence: example 150 155 160 165 170 τ~ 2 τ (1) τ (2) divergence D s(p(θ τ) p(θ τ ~ 2)) 0.00 0.05 0.10 0.15 0.20 maximum conditional density p(θ τ) 0.00 0.02 0.04 0.06 10 12 14 16 18 20 10 12 14 16 18 20 determine maximum d i for each bin i (usually at bin margin) C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 18 / 23

Bounding divergence Idea now consider divergences of true and approximate marginals p(y) and q(y) (not the conditionals!) what about D s ( p(y) q(y) )? having the individual bin-wise divergences d i, we can show: D s ( p(y) q(y) ) π i d i i max d i i in other words: by bounding bin-wise divergences (of conditionals) we can bound the overall divergence (of marginals) DIRECT (Divergence Restricted Conditional Tesselation) method C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 19 / 23

Discretizing mixtures Sequential DIRECT algorithm divergence D s (p(θ τ) p(θ τ ~ i)) 0.00 0.01 τ~ 1 δ 0.0 0.5 1.0 1.5 2.0 2.5 3.0 1st reference point τ 1 at zero C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 20 / 23

Discretizing mixtures Sequential DIRECT algorithm divergence D s (p(θ τ) p(θ τ ~ i)) 0.00 0.01 τ~ 1 τ (1) δ 0.0 0.5 1.0 1.5 2.0 2.5 3.0 1st reference point τ 1 at zero, first margin τ (1) at 0.904 C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 20 / 23

Discretizing mixtures Sequential DIRECT algorithm divergence D s (p(θ τ) p(θ τ ~ i)) 0.00 0.01 τ~ 1 τ (1) δ 0.0 0.5 1.0 1.5 2.0 2.5 3.0 1st reference point τ 1 at zero, first margin τ (1) at 0.904 (... ) C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 20 / 23

Discretizing mixtures Sequential DIRECT algorithm divergence D s (p(θ τ) p(θ τ ~ i)) 0.00 0.01 τ~ 1 τ (1) τ ~ 2 δ 0.0 0.5 1.0 1.5 2.0 2.5 3.0 1st reference point τ 1 at zero, first margin τ (1) at 0.904 (... ) C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 20 / 23

Discretizing mixtures Sequential DIRECT algorithm divergence D s (p(θ τ) p(θ τ ~ i)) 0.00 0.01 τ~ 1 τ (1) τ ~ 2 τ (2) δ 0.0 0.5 1.0 1.5 2.0 2.5 3.0 1st reference point τ 1 at zero, first margin τ (1) at 0.904 (... ) C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 20 / 23

Discretizing mixtures Sequential DIRECT algorithm divergence D s (p(θ τ) p(θ τ ~ i)) 0.00 0.01 τ~ 1 τ (1) τ ~ 2 τ (2) τ ~ 3 τ (3) δ 0.0 0.5 1.0 1.5 2.0 2.5 3.0 1st reference point τ 1 at zero, first margin τ (1) at 0.904 (... ) C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 20 / 23

Discretizing mixtures Sequential DIRECT algorithm 1st bin 2nd bin 3rd bin 4th bin (...) divergence D s (p(θ τ) p(θ τ ~ i)) 0.00 0.01 τ~ 1 τ (1) τ ~ 2 τ (2) τ ~ 3 τ (3) τ ~ 4 τ (4) δ 0.0 0.5 1.0 1.5 2.0 2.5 3.0 1st reference point τ 1 at zero, first margin τ (1) at 0.904 (... ) result: binning with bounded divergence ( δ) per bin C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 20 / 23

Discretizing mixtures Sequential DIRECT algorithm 1st bin 2nd bin 3rd bin 4th bin (...) divergence D s (p(θ τ) p(θ τ ~ i)) 0.00 0.01 τ~ 1 τ (1) τ ~ 2 τ (2) τ ~ 3 τ (3) τ ~ 4 τ (4) δ 0.0 0.5 1.0 1.5 2.0 2.5 3.0 1st reference point τ 1 at zero, first margin τ (1) at 0.904 (... ) result: binning with bounded divergence ( δ) per bin (when to stop?) C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 20 / 23

Discretizing mixtures Sequential DIRECT algorithm (variations possible) 1 Specify δ > 0, 0 ǫ 1, and starting reference point x 1 (e.g. minimum possible value, or ǫ 2 -quantile). Define ǫ 1 0 as ǫ 1 := P(X x 1 ). Set i = 1. 2 Set x = x 1. Obviously, D s ( p(y x1 ) p(y x ) ) = 0. Now increase x as far as possible while ensuring that D s ( p(y x1 ) p(y x ) ) δ. Use this point as the first bin margin: x (1) = x. Compute π 1 = P(x < x (1) ). Set i = i + 1. 3 Increase x until D s ( p(y x(i 1) ) p(y x ) ) = δ. Use this point as the next reference point: x i = x. 4 Increase x again until D s ( p(y xi ) p(y x ) ) = δ. Use this point as the next bin margin: x (i) = x. 5 Compute the bin weight π i = P(x (i 1) < X x (i) ). 6 If P(X > x (i) ) > (ǫ ǫ 1 ), set i = i + 1 and proceed at step 3. Otherwise stop. C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 21 / 23

Discretizing mixtures General algorithm remaining issue: ignored ǫ > 0 tail probability (usually: problems at domain s margins) only need to keep track of reference points x i and probabilities π i meta-analysis example: 35 reference ( support ) points required (δ = 0.01, ǫ = 0.001) 50% 90% 95% 99% 0 10 20 30 40 C. Röver, T. Friede http://arxiv.org/abs/1602.04060 March 17, 2016 22 / 23

Conclusions approximation allows to compute density, quantiles, moments,... algorithm yields quick-and-easy solution need to specify error budget in terms of divergence δ tail probability ǫ also works for discrete distributions (p(x) or p(y x)) meta-analysis application: implemented in bayesmeta R package http://cran.r-project.org/package=bayesmeta general procedure: C. Röver, T. Friede. Discrete approximation of a mixture distribution via restricted divergence. (submitted for publication.) http://arxiv.org/abs/1602.04060 C. Röver, T. Friede Approximating mixture distributions March 17, 2016 23 / 23