A Kullback-Leibler Divergence for Bayesian Model Comparison with Applications to Diabetes Studies. Chen-Pin Wang, UTHSCSA Malay Ghosh, U.

Size: px

Start display at page:

Download "A Kullback-Leibler Divergence for Bayesian Model Comparison with Applications to Diabetes Studies. Chen-Pin Wang, UTHSCSA Malay Ghosh, U."

Roger Pierce
5 years ago
Views:

1 A Kullback-Leibler Divergence for Bayesian Model Comparison with Applications to Diabetes Studies Chen-Pin Wang, UTHSCSA Malay Ghosh, U. Florida Lehmann Symposium, May 9,

2 Background KLD: the expected (with respect to the reference model) logarithm of the ratio of the probability density functions (p.d.f. s) of two models. log ( r(tn θ) f(t n θ) ) r(t n θ) dt n KLD: a measure of the discrepancy of information about θ contained in the data revealed by two competing models (K-L; Lindley; Bernardo; Akaike; Schwarz; Goutis and Robert). Challenge in the Bayesian framework: identify priors that are compatible under the competing models the resulting integrated likelihoods are proper.

3 G-R KLD Remedy: The Kullback-Leibler projection by Goutis and Robert (1998), or G-R KLD: the inf. KLD between the likelihood under the reference model and all possible likelihoods arising from the competing model. G-R KLD is the KLD between the reference model and the competing model evaluated at its MLE if the reference model is correctly specified (ref. Akaike 1974). G-R KLD overcomes the challenges associated with prior elicitation in calculating KLD under the Bayesian framework.

4 G-R KLD The Bayesian estimate of G-R KLD: integrating the G-R KLD with respect to the posterior distribution of model parameters under the reference model. Bayesian estimate of G-R KLD is not subject to impropriety of the prior as long as the posterior under the reference model is proper. G-R KLD is suitable for comparing the predictivity of the competing models. G-R KLD was originally developed for comparing nested GLM with a known true model, and its extension to general model comparison remains limited.

5 Proposed KLD log ( r(tn θ) f(t n ˆθ f ) ) r(t n θ) dt n. (1) Bayes estimate of (1): { ( ) } r(tn θ) log r(t n θ) dt n π(θ U n ). (2) f(t n ˆθ f ) Objective: To study the property of KLD estimate given in (2).

6 Notations originating from model g gov- X i s are i.i.d. erned by θ Θ. T n = T (X 1,,X n ): diagnostics. the statistic for model Two competing models: r for the reference model and f for the fitted model. Assume that prior π r (θ) leads to proper posterior under r.

7 Our proposed KLD KLD t (r, f θ) quantifies the relative model fit for statistic T n between models r and f. KLD t (r, f θ) is identical to G-R KLD when the reference model r is the correct model. KLD t (r, f θ) is not necessarily the same as the G-R KLD. KLD t (r, f θ) needs no additional adjustment for non-nested situations. KLD t (r, f θ) is more practical than G-R KLD.

8 Regularity Conditions I (A1) For each x, both log r(x θ) and log f(x θ) are 3 times continuously differentiable in θ. Further, there exist neighborhoods N r (δ) =(θ δ r,θ+ δ r ) and N f (δ) =(θ δ f,θ+ δ f ) of θ and integrable functions H θ,δr (x) and H θ,δf (x) such that sup k log r(x θ) θ N(δ r ) θk H θ,δr (x) θ=θ and sup k log f(x θ) θ N(δ f ) θk H θ,δf (x) θ=θ for k=1, 2, 3. (A2) For all sufficiently large λ>0, [ ] E r sup log r(x θ ) θ θ >λ r(x θ) < 0; [ ] E f sup log f(x θ ) < 0. θ θ >λ f(x θ)

9 Regularity Conditions II (A3) E r [ E f [ sup θ (θ δ,θ+δ) sup θ (θ δ,θ+δ) ] log r(x θ ) θ E r [log r(x θ)] as δ 0; ] log f(x θ ) θ E f [log f(x θ)] as δ 0. (A4) The prior density π(θ) is continuously differentiable in a neighborhood of θ and π(θ) > 0. (A5) Suppose that T n is asymptotically normally distributed under both models such that r(t n θ) =σ 1 r (θ)φ( n{t n µ r (θ)}/σ r (θ)) + O(n 1/2 ); f(t n θ) =σ 1 f (θ)φ( n{t n µ f (θ)}/σ f (θ)) + O(n 1/2 ).

10 Theorem 1. Assume the regularity conditions (A1)-(A5). Then 2KLD t (r, f U n ) n when µ f (θ) µ r (θ), and {ˆµ f(u n ) ˆµ r (U n )} 2 ˆσ 2 f (U n) = o p (1)(3) 2KLD t (r, f U n ) Q ˆσ2 r (U n) ˆσ f 2(U = o p (1) (4) n) when µ r (θ) =µ f (θ) but σ 2 r (θ) σ2 f (θ).

11 Remarks for Theorem 1 KLD t (r, f θ) is also a divergence of model parameter estimates Model comparison in real applications may rely on the fit to a multi-dimensional statistic. The results in Theorem 1 are applicable to the multivariate case with a fixed dimension. KLD t (r, f θ) can be viewed as the discrepancy between r and f in terms of their posterior predictivity of T n. We study how KLD t (r, f θ) is connected to a weighted posterior predictive p-value, a typical Bayesian technique to assess model discrepancy (see Rubin 1984; Gelman et al. 1996).

12 Weighted Posterior Predictive P-value WPPP r (U n ) { tn f (y n ˆθ f )dy n r (t n θ)dt n } π r (θ U n ) dθ, (5) where r and f are the predictive density functions of T n under r and f, respectively. WPPP is equivalent to the weighted posterior predictive p-value of T n under f with respect to the posterior predictive distribution of T n under r.

13 Theorem 2. 2KLD t (r, f U n ) n = {Φ 1 (WPPP r (U n ))} 2 { n } (ˆµ r (U n ) ˆµ f (U n )) 2 + ˆσ f 2(U n)+ˆσ r 2 (U n ) when µ f (θ) µ r (θ). Let Q(y) =y log(y) 1. Then ˆσ 2 r (U n ) ˆσ 2 f (U n) + o p(1) (6) ( ) ˆσ r 2 (U n ) 2KLD t (r, f U n ) Q ˆσ f 2(U n) = o p (1) (7) and WPPP r (U n ) 0.5 =o p (1) (8) when µ r (θ) =µ f (θ) but σ 2 r (θ) σ 2 f (θ).

14 Remarks of Theorem 2. It shows the asymptotic relationship between KLD t (r, f u n ) and WPPP. Suppose that µ f (θ) µ r (θ). Both KLD t (r, f U n ) and Φ 1 (WPPP r (U n )) are of order O p (n). KLD t (r, f U n ) is greater than Φ 1 (WPPP r (U n )) by an O p (n) term that assumes positive values with probability 1. When µ r (θ) =µ f (θ) (i.e., both r and f assume the same mean of T n ) but σf 2(θ) σ2 r (θ), Φ 1 (WPPP r (U n )) converges to 0; WPPP r (U n ) converges to 0.5 KLD t (r, f U n ) converges to a positive quantity order O p (1)

15 Example 1. g θ (x i )=φ((x i θ 1 )/ θ 2 )/ θ 2, where θ 2 > 0. Let T n = n[( i X i)/n θ 1 ]/ κ. Let r = g and f θ (x i )=φ((x i θ 1 )/ κ)/ κ. Then X i i.i.d. µ r (θ) = E h (T n ) = µ f (θ) = E f (T n ) = θ 1, σr 2 (θ) = θ 2, σf 2 (θ) =κ, 2 lim KLDt (r, f u n ) n ) (ˆθ 2 (u n ) = log + ˆθ 2 (u n ) κ κ 1 { > 0 if κ θ2 =0 if κ = θ 2. T n is the MLE for θ 1 under both h and f. lim n WPPP(U n )=0.5 WPPP(U n ) is asymptotically equivalent to the KLD approaches.

16 i.i.d. Example 2 Assume X i g θ (x i ) = exp{ θ/(1 θ)}{θ/(1 θ)} x i /x i!, where 0 <θ<1. Let T n = X n /(1 + X n ), r = g, and f θ (x i )=θ x i (1 θ). Then µ r (θ) = µ f (θ) = θ, σ 2 r (θ) = θ(1 θ)3, and σ 2 f (θ) = θ(1 θ) 2. θ = E(X i )/(1 + E(X i )). T n is the MLE for θ under both r and f 2 lim n KLD t (r, f u n )= log(1 ˆθ(u n )) + (1 ˆθ(u n )) 1 > 0for0<θ<1. lim n WPPP(U n )=0.5

17 i.i.d. Example 3 Assume X i g θ (x i ) = Γ((θ 2+1)/2) Γ(θ 2 /2) (1 + (x πθ 2 θ 1 ) 2 /θ 2 ) (1+θ2)/2, where θ 2 > 2. Let T n = X. Let r = g and f θ (x i )=φ(x i θ 1 ). Then µ f (θ) =µ r (θ) =θ 1, σr 2 (θ) =θ 2 /(θ 2 2), and σf 2 (θ) =1 2 lim n KLD t (r, f u n )= log(θ 2 (u n )/(θ 2 (u n ) 2))+θ 2 /(θ 2 (u n ) 2) 1 0 for all θ 2 with equality if and only if θ 2 =.

18 i.i.d. Example 4 Assume X i g θ (x i ) = exp( x i /θ)/θ. Let r = g and f θ (x i ) = exp( x i ), T n = min{x 1,,X n }. Then r θ (t n )=n exp( nt n /θ)/θ and f θ (t n )=n exp( nt n ) WPPP f ( x n )=E f (Pr(T n <T n ) x n ) x n x n +1 KLDt (r, f x n ) log( x n )+n( x n 1) The asymptotic equivalence between KLD t (r, f u n ) and WPPP f (u n ) does not hold in the sense of Thm. 2 due to the violation of the asym. normality assumption.

19 A Study of Glucose Change in Veterans with Type 2 Diabetes A clinical cohort of 507 veterans with type 2 diabetes who had poor glucose control at the baseline and were then treated by metformin as the mono oral glucose-lowering agent. Goal: to compare models that assessed whether obesity was associated with the net change in glucose level between baseline and the end of 5- year follow-up. The empirical mean of the net change in HbA1c over time was similar between the obese vs. nonobese groups ( vs ). The empirical variance was greater in the obese group (1.207 vs ). Distribution of HbA1c was reasonably symmetric. Considered two candidate models for fitting the HbA1c change: a mixture of normals vs. a t-distribution.

20 KLD t (r, f u n )=10.75 suggesting that r was superior to f. KLD t (r, f u n ) result was consistent with Figures 1 & 2 which contrasted the empirical quantiles with predicted quantiles under r and f. Note that both r and f yielded unbiased estimators of E(X i ). Thus the model discrepancy between r and f assessed by KLD t (r, f u n ) is primarily attributed to the difference in the variance assumption between r and f (as evident in Figure 1 which contrasted the empirical quantiles with predicted quantiles under r and f). WPPP=0.522 suggested that the overall fit were similar between the two models (the estimated net change in HbA1c was similar between these two models).

22 A Study of Functioning in the Elderly with Diabetes The study cohort arisen from the subset of 119 participants with diabetes in the San Antonio Longitudinal Study of Aging, a communitybased study of the disablement process in Mexican American and European American older adults. Goal: to compare models that assessed whether glucose control trajectory class (poorer vs. better) was associated with the lower-extremity physical functional limitation score (measured by SPPB) during the first follow up period. SPPB score is discrete in nature with a range of Considered two candidate models for fitting SPPB: a negative binomial vs. a poisson. The empirical variance of SPPB (15.60 vs ) was greater than the mean (7.23 vs. 8.02) in both glucose control classes.

23 KLD t (r, f u n )=32.63 suggested that r was a better fit than f. Both r and f yielded similar estimates of E(X i ). The model discrepancy assessed by KLD t (r, f u n ) could primarily be attributed to the difference in variance estimation between r and f (as evident in Figures 3 & 4). WPPP(U n )=0.539 suggested similar fit between r and f.

25 Summary This paper considers a Bayesian estimate of the G-R-A KLD as given in (2). G-R-A KLD is appropriate for quantifying information discrepancy between the competing models r and f. We derive the asymptotic property of the G-R- A KLD in Theorem 1, and its relationship to a weighted posterior predictive p-value (WPPP) in Theorem 2. Our results need further refinement when the MLE of the mean of T n differs between r and f, or the normality assumption given in (A5) is not suitable. Model comparison in medical research may rely on the fit to a multidimensional statistic. Theorem 1 holds for a multivariate statistic T n with a

26 fixed dimension. Further investigation is needed to assess the property of our proposed KLD for situation when the dimension of T n increases with n. G-R-A KLD provides the relative fit between competing models. For the purpose of assessing absolute model adequacy, a KLD should be used in conjuction with absolute model departure indices such as posterior predictive p-values or residuals. Nevertheless, a KLD is also a measure of the absolute fit of model f when the reference model r is the true model.

Model comparison and selection

Model comparison and selection BS2 Statistical Inference, Lectures 9 and 10, Hilary Term 2008 March 2, 2008 Hypothesis testing Consider two alternative models M 1 = {f (x; θ), θ Θ 1 } and M 2 = {f (x; θ), θ Θ 2 } for a sample (X = x)