A Kullback-Leibler Divergence for Bayesian Model Comparison with Applications to Diabetes Studies. Chen-Pin Wang, UTHSCSA Malay Ghosh, U.

Similar documents
Model comparison and selection

Bayesian Model Diagnostics and Checking

MODEL COMPARISON CHRISTOPHER A. SIMS PRINCETON UNIVERSITY

Statistical Inference

Parameter Estimation

Econometrics I, Estimation

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Model comparison. Christopher A. Sims Princeton University October 18, 2016

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

THE FORMAL DEFINITION OF REFERENCE PRIORS

Integrated Objective Bayesian Estimation and Hypothesis Testing

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Minimum Message Length Analysis of the Behrens Fisher Problem

Bayesian Regularization

Beta statistics. Keywords. Bayes theorem. Bayes rule

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

A Very Brief Summary of Statistical Inference, and Examples

Statistical Inference

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Invariant HPD credible sets and MAP estimators

Lecture 16: Sample quantiles and their asymptotic properties

CSC 2541: Bayesian Methods for Machine Learning

F & B Approaches to a simple model

9 Bayesian inference. 9.1 Subjective probability

Chapter 3. Point Estimation. 3.1 Introduction

Part III. A Decision-Theoretic Approach and Bayesian testing

Brief Review on Estimation Theory

ECE531 Lecture 10b: Maximum Likelihood Estimation

Principles of Statistics

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

1 Hypothesis Testing and Model Selection

Motivational Example

Default priors and model parametrization

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Overall Objective Priors

Objective Bayesian Hypothesis Testing

Conjugate Predictive Distributions and Generalized Entropies

Penalized Loss functions for Bayesian Model Choice

STAT215: Solutions for Homework 2

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

A Very Brief Summary of Statistical Inference, and Examples

Extended Bayesian Information Criteria for Model Selection with Large Model Spaces

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

Bayesian estimation of the discrepancy with misspecified parametric models

Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk

Graduate Econometrics I: Maximum Likelihood I

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Foundations of Nonparametric Bayesian Methods

A BAYESIAN MATHEMATICAL STATISTICS PRIMER. José M. Bernardo Universitat de València, Spain

Maximum Likelihood Estimation

Statistics: Learning models from data

STAT 730 Chapter 4: Estimation

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

MIT Spring 2016

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

6.1 Variational representation of f-divergences

EM Algorithm II. September 11, 2018

Statistical Theory MT 2007 Problems 4: Solution sketches

Model comparison: Deviance-based approaches

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Approximating mixture distributions using finite numbers of components

λ(x + 1)f g (x) > θ 0

Midterm Examination. STA 215: Statistical Inference. Due Wednesday, 2006 Mar 8, 1:15 pm

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

A NEW INFORMATION THEORETIC APPROACH TO ORDER ESTIMATION PROBLEM. Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.

Fisher Information & Efficiency

Maximum Likelihood Estimation

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables THE UNIVERSITY OF MANCHESTER. 21 June :45 11:45

Foundations of Statistical Inference

Seminar über Statistik FS2008: Model Selection

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Information geometry for bivariate distribution control

Qualifying Exam in Probability and Statistics.

A MODIFIED LIKELIHOOD RATIO TEST FOR HOMOGENEITY IN FINITE MIXTURE MODELS

Lecture 3. Truncation, length-bias and prevalence sampling

M. J. Bayarri and M. E. Castellanos. University of Valencia and Rey Juan Carlos University

Lecture notes on statistical decision theory Econ 2110, fall 2013

Lecture 2: Statistical Decision Theory (Part I)

5.3 METABOLIC NETWORKS 193. P (x i P a (x i )) (5.30) i=1

ST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks

The information complexity of sequential resource allocation

Lecture 8: Information Theory and Statistics

Selection Criteria Based on Monte Carlo Simulation and Cross Validation in Mixed Models

Lecture 1: Introduction

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Improper mixtures and Bayes s theorem

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

Bayes spaces: use of improper priors and distances between densities

Bayesian Inference. Chapter 9. Linear models and regression

Section 8.2. Asymptotic normality

BAYESIAN ANALYSIS OF BIVARIATE COMPETING RISKS MODELS

Statistics 135 Fall 2007 Midterm Exam

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Generalized confidence intervals for the ratio or difference of two means for lognormal populations with zeros

Statistics GIDP Ph.D. Qualifying Exam Theory Jan 11, 2016, 9:00am-1:00pm

Methods of Estimation

1. Fisher Information

Chapter 3 : Likelihood function and inference

MML Invariant Linear Regression

Transcription:

A Kullback-Leibler Divergence for Bayesian Model Comparison with Applications to Diabetes Studies Chen-Pin Wang, UTHSCSA Malay Ghosh, U. Florida Lehmann Symposium, May 9, 2011 1

Background KLD: the expected (with respect to the reference model) logarithm of the ratio of the probability density functions (p.d.f. s) of two models. log ( r(tn θ) f(t n θ) ) r(t n θ) dt n KLD: a measure of the discrepancy of information about θ contained in the data revealed by two competing models (K-L; Lindley; Bernardo; Akaike; Schwarz; Goutis and Robert). Challenge in the Bayesian framework: identify priors that are compatible under the competing models the resulting integrated likelihoods are proper.

G-R KLD Remedy: The Kullback-Leibler projection by Goutis and Robert (1998), or G-R KLD: the inf. KLD between the likelihood under the reference model and all possible likelihoods arising from the competing model. G-R KLD is the KLD between the reference model and the competing model evaluated at its MLE if the reference model is correctly specified (ref. Akaike 1974). G-R KLD overcomes the challenges associated with prior elicitation in calculating KLD under the Bayesian framework.

G-R KLD The Bayesian estimate of G-R KLD: integrating the G-R KLD with respect to the posterior distribution of model parameters under the reference model. Bayesian estimate of G-R KLD is not subject to impropriety of the prior as long as the posterior under the reference model is proper. G-R KLD is suitable for comparing the predictivity of the competing models. G-R KLD was originally developed for comparing nested GLM with a known true model, and its extension to general model comparison remains limited.

Proposed KLD log ( r(tn θ) f(t n ˆθ f ) ) r(t n θ) dt n. (1) Bayes estimate of (1): { ( ) } r(tn θ) log r(t n θ) dt n π(θ U n ). (2) f(t n ˆθ f ) Objective: To study the property of KLD estimate given in (2).

Notations originating from model g gov- X i s are i.i.d. erned by θ Θ. T n = T (X 1,,X n ): diagnostics. the statistic for model Two competing models: r for the reference model and f for the fitted model. Assume that prior π r (θ) leads to proper posterior under r.

Our proposed KLD KLD t (r, f θ) quantifies the relative model fit for statistic T n between models r and f. KLD t (r, f θ) is identical to G-R KLD when the reference model r is the correct model. KLD t (r, f θ) is not necessarily the same as the G-R KLD. KLD t (r, f θ) needs no additional adjustment for non-nested situations. KLD t (r, f θ) is more practical than G-R KLD.

Regularity Conditions I (A1) For each x, both log r(x θ) and log f(x θ) are 3 times continuously differentiable in θ. Further, there exist neighborhoods N r (δ) =(θ δ r,θ+ δ r ) and N f (δ) =(θ δ f,θ+ δ f ) of θ and integrable functions H θ,δr (x) and H θ,δf (x) such that sup k log r(x θ) θ N(δ r ) θk H θ,δr (x) θ=θ and sup k log f(x θ) θ N(δ f ) θk H θ,δf (x) θ=θ for k=1, 2, 3. (A2) For all sufficiently large λ>0, [ ] E r sup log r(x θ ) θ θ >λ r(x θ) < 0; [ ] E f sup log f(x θ ) < 0. θ θ >λ f(x θ)

Regularity Conditions II (A3) E r [ E f [ sup θ (θ δ,θ+δ) sup θ (θ δ,θ+δ) ] log r(x θ ) θ E r [log r(x θ)] as δ 0; ] log f(x θ ) θ E f [log f(x θ)] as δ 0. (A4) The prior density π(θ) is continuously differentiable in a neighborhood of θ and π(θ) > 0. (A5) Suppose that T n is asymptotically normally distributed under both models such that r(t n θ) =σ 1 r (θ)φ( n{t n µ r (θ)}/σ r (θ)) + O(n 1/2 ); f(t n θ) =σ 1 f (θ)φ( n{t n µ f (θ)}/σ f (θ)) + O(n 1/2 ).

Theorem 1. Assume the regularity conditions (A1)-(A5). Then 2KLD t (r, f U n ) n when µ f (θ) µ r (θ), and {ˆµ f(u n ) ˆµ r (U n )} 2 ˆσ 2 f (U n) = o p (1)(3) 2KLD t (r, f U n ) Q ˆσ2 r (U n) ˆσ f 2(U = o p (1) (4) n) when µ r (θ) =µ f (θ) but σ 2 r (θ) σ2 f (θ).

Remarks for Theorem 1 KLD t (r, f θ) is also a divergence of model parameter estimates Model comparison in real applications may rely on the fit to a multi-dimensional statistic. The results in Theorem 1 are applicable to the multivariate case with a fixed dimension. KLD t (r, f θ) can be viewed as the discrepancy between r and f in terms of their posterior predictivity of T n. We study how KLD t (r, f θ) is connected to a weighted posterior predictive p-value, a typical Bayesian technique to assess model discrepancy (see Rubin 1984; Gelman et al. 1996).

Weighted Posterior Predictive P-value WPPP r (U n ) { tn f (y n ˆθ f )dy n r (t n θ)dt n } π r (θ U n ) dθ, (5) where r and f are the predictive density functions of T n under r and f, respectively. WPPP is equivalent to the weighted posterior predictive p-value of T n under f with respect to the posterior predictive distribution of T n under r.

Theorem 2. 2KLD t (r, f U n ) n = {Φ 1 (WPPP r (U n ))} 2 { n } (ˆµ r (U n ) ˆµ f (U n )) 2 + ˆσ f 2(U n)+ˆσ r 2 (U n ) when µ f (θ) µ r (θ). Let Q(y) =y log(y) 1. Then ˆσ 2 r (U n ) ˆσ 2 f (U n) + o p(1) (6) ( ) ˆσ r 2 (U n ) 2KLD t (r, f U n ) Q ˆσ f 2(U n) = o p (1) (7) and WPPP r (U n ) 0.5 =o p (1) (8) when µ r (θ) =µ f (θ) but σ 2 r (θ) σ 2 f (θ).

Remarks of Theorem 2. It shows the asymptotic relationship between KLD t (r, f u n ) and WPPP. Suppose that µ f (θ) µ r (θ). Both KLD t (r, f U n ) and Φ 1 (WPPP r (U n )) are of order O p (n). KLD t (r, f U n ) is greater than Φ 1 (WPPP r (U n )) by an O p (n) term that assumes positive values with probability 1. When µ r (θ) =µ f (θ) (i.e., both r and f assume the same mean of T n ) but σf 2(θ) σ2 r (θ), Φ 1 (WPPP r (U n )) converges to 0; WPPP r (U n ) converges to 0.5 KLD t (r, f U n ) converges to a positive quantity order O p (1)

Example 1. g θ (x i )=φ((x i θ 1 )/ θ 2 )/ θ 2, where θ 2 > 0. Let T n = n[( i X i)/n θ 1 ]/ κ. Let r = g and f θ (x i )=φ((x i θ 1 )/ κ)/ κ. Then X i i.i.d. µ r (θ) = E h (T n ) = µ f (θ) = E f (T n ) = θ 1, σr 2 (θ) = θ 2, σf 2 (θ) =κ, 2 lim KLDt (r, f u n ) n ) (ˆθ 2 (u n ) = log + ˆθ 2 (u n ) κ κ 1 { > 0 if κ θ2 =0 if κ = θ 2. T n is the MLE for θ 1 under both h and f. lim n WPPP(U n )=0.5 WPPP(U n ) is asymptotically equivalent to the KLD approaches.

i.i.d. Example 2 Assume X i g θ (x i ) = exp{ θ/(1 θ)}{θ/(1 θ)} x i /x i!, where 0 <θ<1. Let T n = X n /(1 + X n ), r = g, and f θ (x i )=θ x i (1 θ). Then µ r (θ) = µ f (θ) = θ, σ 2 r (θ) = θ(1 θ)3, and σ 2 f (θ) = θ(1 θ) 2. θ = E(X i )/(1 + E(X i )). T n is the MLE for θ under both r and f 2 lim n KLD t (r, f u n )= log(1 ˆθ(u n )) + (1 ˆθ(u n )) 1 > 0for0<θ<1. lim n WPPP(U n )=0.5

i.i.d. Example 3 Assume X i g θ (x i ) = Γ((θ 2+1)/2) Γ(θ 2 /2) (1 + (x πθ 2 θ 1 ) 2 /θ 2 ) (1+θ2)/2, where θ 2 > 2. Let T n = X. Let r = g and f θ (x i )=φ(x i θ 1 ). Then µ f (θ) =µ r (θ) =θ 1, σr 2 (θ) =θ 2 /(θ 2 2), and σf 2 (θ) =1 2 lim n KLD t (r, f u n )= log(θ 2 (u n )/(θ 2 (u n ) 2))+θ 2 /(θ 2 (u n ) 2) 1 0 for all θ 2 with equality if and only if θ 2 =.

i.i.d. Example 4 Assume X i g θ (x i ) = exp( x i /θ)/θ. Let r = g and f θ (x i ) = exp( x i ), T n = min{x 1,,X n }. Then r θ (t n )=n exp( nt n /θ)/θ and f θ (t n )=n exp( nt n ) WPPP f ( x n )=E f (Pr(T n <T n ) x n ) x n x n +1 KLDt (r, f x n ) log( x n )+n( x n 1) The asymptotic equivalence between KLD t (r, f u n ) and WPPP f (u n ) does not hold in the sense of Thm. 2 due to the violation of the asym. normality assumption.

A Study of Glucose Change in Veterans with Type 2 Diabetes A clinical cohort of 507 veterans with type 2 diabetes who had poor glucose control at the baseline and were then treated by metformin as the mono oral glucose-lowering agent. Goal: to compare models that assessed whether obesity was associated with the net change in glucose level between baseline and the end of 5- year follow-up. The empirical mean of the net change in HbA1c over time was similar between the obese vs. nonobese groups (-0.498 vs. -0.379). The empirical variance was greater in the obese group (1.207 vs. 0.865). Distribution of HbA1c was reasonably symmetric. Considered two candidate models for fitting the HbA1c change: a mixture of normals vs. a t-distribution.

KLD t (r, f u n )=10.75 suggesting that r was superior to f. KLD t (r, f u n ) result was consistent with Figures 1 & 2 which contrasted the empirical quantiles with predicted quantiles under r and f. Note that both r and f yielded unbiased estimators of E(X i ). Thus the model discrepancy between r and f assessed by KLD t (r, f u n ) is primarily attributed to the difference in the variance assumption between r and f (as evident in Figure 1 which contrasted the empirical quantiles with predicted quantiles under r and f). WPPP=0.522 suggested that the overall fit were similar between the two models (the estimated net change in HbA1c was similar between these two models).

A Study of Functioning in the Elderly with Diabetes The study cohort arisen from the subset of 119 participants with diabetes in the San Antonio Longitudinal Study of Aging, a communitybased study of the disablement process in Mexican American and European American older adults. Goal: to compare models that assessed whether glucose control trajectory class (poorer vs. better) was associated with the lower-extremity physical functional limitation score (measured by SPPB) during the first follow up period. SPPB score is discrete in nature with a range of 0-12. Considered two candidate models for fitting SPPB: a negative binomial vs. a poisson. The empirical variance of SPPB (15.60 vs. 14.33) was greater than the mean (7.23 vs. 8.02) in both glucose control classes.

KLD t (r, f u n )=32.63 suggested that r was a better fit than f. Both r and f yielded similar estimates of E(X i ). The model discrepancy assessed by KLD t (r, f u n ) could primarily be attributed to the difference in variance estimation between r and f (as evident in Figures 3 & 4). WPPP(U n )=0.539 suggested similar fit between r and f.

Summary This paper considers a Bayesian estimate of the G-R-A KLD as given in (2). G-R-A KLD is appropriate for quantifying information discrepancy between the competing models r and f. We derive the asymptotic property of the G-R- A KLD in Theorem 1, and its relationship to a weighted posterior predictive p-value (WPPP) in Theorem 2. Our results need further refinement when the MLE of the mean of T n differs between r and f, or the normality assumption given in (A5) is not suitable. Model comparison in medical research may rely on the fit to a multidimensional statistic. Theorem 1 holds for a multivariate statistic T n with a

fixed dimension. Further investigation is needed to assess the property of our proposed KLD for situation when the dimension of T n increases with n. G-R-A KLD provides the relative fit between competing models. For the purpose of assessing absolute model adequacy, a KLD should be used in conjuction with absolute model departure indices such as posterior predictive p-values or residuals. Nevertheless, a KLD is also a measure of the absolute fit of model f when the reference model r is the true model.