5: Biostatistical Applications of Bayesian Decision Theory

Size: px

Start display at page:

Download "5: Biostatistical Applications of Bayesian Decision Theory"

Roger Underwood
5 years ago
Views:

1 Introduction to Bayesian Data Analysis 5: Biostatistical Applications of Bayesian Decision Theory David Draper Department of Applied Mathematics and Statistics University of California, Santa Cruz, USA draper Centers for Disease Control and Prevention (Atlanta GA) June 2008 c 2008 David Draper (all rights reserved) Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 1

2 The Big Picture One possible definition statistics is the study of uncertainty: how to measure it, and what to do about it. How to measure uncertainty: probability; two main probability paradigms: frequentist and Bayesian. What to do about uncertainty: two main activities Inference: Generalizing outward from a given set of information (sample) to a larger universe (population), and attaching well-calibrated measures of uncertainty to the generalizations (e.g., Nonwhites in the population of people at substantial risk of HIV 1 infection are 88% more likely to get infected if they don t receive this rgp120 vaccine than if they do receive it (relative risk of infection 1.88, 95% interval estimate ) ). Decision-Making: Taking or recommending an action on the basis of available data, in spite of remaining uncertainties (e.g., Based on this trial, for whom nonwhites were a secondary subgroup, it s recommended that the vaccine be studied further with nonwhites as the primary study group ). Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 2

3 Use of Frequentist and Bayesian Probability in Statistics Frequentist probability: Restrict attention to phenomena that are inherently repeatable under (essentially) identical conditions; then, for an event A of interest, P F (A) = limiting relative frequency with which A occurs in the (hypothetical) repetitions, as number of repetitions n. + Math easier; focuses attention on calibration issues (how often do I get the right answer?). Only applies to inherently repeatable phenomena: can t speak directly about many uncertain things of interest (e.g., P F (this patient is HIV+) is undefined); predictive interval estimates often not so easy to create; small-sample inferential calibration not so easy to achieve. Bayesian probability: numerical weight of evidence in favor of an uncertain proposition, obeying a series of reasonable axioms to ensure that Bayesian probabilities are coherent (internally logically consistent). + Applies to any uncertain situation; predictive intervals easy; Wald (1950; a frequentist!): All good decisions are Bayes rules. Math harder; coherence doesn t guarantee good calibration. Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 3

4 Frequentist Inference, Prediction and Decision-Making Frequentist inference: (1) I think of my data as like a random sample from some population (challenge: often difficult with observational data to identify what this population really is). (2) I identify some numerical summary θ of the population of interest (e.g., a relative risk), and I find a reasonable estimate ˆθ of θ based on the sample (challenge: how define reasonable?). (3) I imagine repeating the random sampling, and I use the random behavior of ˆθ across these repetitions to make probability statements involving (but not about!) θ (e.g., confidence intervals for θ [e.g., I m 95% confident that θ RR is between 1.14 and 3.13 ] or hypothesis tests about θ [e.g., the P value for testing H 0 : θ RR < 1 against H A : θ RR 1 is 0.012, so I reject H 0 ). Frequentist point prediction (e.g., in regression) is easy; constructing predictive intervals to check the calibration of the prediction process on new data is less easy (one solution: the bootstrap); nobody does real-world frequentist decision-making since Wald s famous theorem. Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 4

5 Bayesian Statistical Paradigm Three basic ingredients of the Bayesian statistical paradigm: θ, something of interest which is unknown (or only partially known) to me (e.g., θ RR ). Often θ is a parameter vector (of finite length k, say) or a matrix, but it can literally be almost anything, e.g., a function (e.g., a cumulative distribution function (CDF) or density, a regression surface,...), a phylogenetic tree, an image of the (true) surface of Mars,.... y, an information source which is relevant to decreasing my uncertainty about θ. Often y is a vector of real numbers (of length n, say), but it can also literally be almost anything, e.g., a time series, a movie, the text in a book,.... A desire to learn about θ from y in a way that is both coherent (internally consistent, i.e., free of internal logical contradictions) and well-calibrated (externally consistent, e.g., capable of making accurate predictions of future data y ). Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 5

6 All Uncertainty Quantified With Probability Distributions It turns out (e.g., de Finetti 1990, Jaynes 2003) that I m compelled in this situation to reason within the standard rules of probability as the basis of my inferences about θ, predictions of future data y, and decisions in the face of uncertainty, and to quantify my uncertainty about any unknown quantities through conditional probability distributions, as follows: p(θ y, B) = c p(θ B) l(θ y, B) p(y y, B) = p(y θ, B) p(θ y, B) dθ (1) a = argmax a A E (θ y,b) [U(a, θ)] B stands for my background (often not fully stated) assumptions and judgments about how the world works, as these assumptions relate to learning about θ from y. B is often omitted from the basic equations (sometimes with unfortunate consequences), yielding the simpler-looking forms p(θ y) = c p(θ) l(θ y) a = argmax a A p(y y) = p(y θ) p(θ y) dθ E (θ y) [U(a, θ)] (2) Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 6

7 Prior and Posterior Distributions p(θ y, B) = c p(θ B) l(θ y, B) p(y y, B) = p(y θ, B) p(θ y, B) dθ a = argmax a A E (θ y,b) [U(a, θ)] p(θ B) is my (so-called) prior information about θ given B, in the form of a probability density function (PDF) or probability mass function (PMF) if θ lives continuously or discretely on R k (let s just agree to call this my prior distribution), and p(θ y, B) is my (so-called) posterior distribution about θ given y and B, which summarizes my current total information about θ and solves the inference problem. These are actually not very good names for p(θ B) and p(θ y, B), because (e.g.) p(θ B) really stands for all (relevant) information about θ (given B) external to y, whether that information was obtained before (or after) y arrives, but (a) they do emphasize the sequential nature of learning and (b) through long usage we re stuck with them. c (here and throughout) is a generic positive normalizing constant, inserted into the top equation above to make the left-hand side integrate (or sum) to 1 (as any coherent distribution must). Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 7

8 Sampling Distributions, Likelihood Functions and Utility p(θ y, B) = c p(θ B) l(θ y, B) p(y y, B) = p(y θ, B) p(θ y, B) dθ a = argmax a A E (θ y,b) [U(a, θ)] p(y θ, B) is my sampling distribution for future data values y given θ and B (and presumably I would use the same sampling distribution p(y θ, B) for (past) data values y, thinking before the data arrives about what values of y I might see). This assumes that I m willing to regard my data as like random draws from a population of possible data values (an heroic assumption in some cases, e.g., with observational rather than randomized data). l(θ y, B) is my likelihood function for θ given y and B, which is defined to be any positive constant multiple of the sampling distribution p(y θ, B) but re-interpreted as a function of θ for fixed y: l(θ y, B) = c p(y θ, B). (3) A is my set of possible actions, U(a, θ) is the numerical value (utility) I attach to taking action a if the unknown is really θ, and the third equation says I should find the action a that maximizes expected utility (MEU). Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 8

9 Predictive Distributions and MCMC p(θ y, B) = c p(θ B) l(θ y, B) p(y y, B) = p(y θ, B) p(θ y, B) dθ a = argmax E (θ y,b) [U(a, θ)] a A And p(y y, B), my (posterior) predictive distribution for future data y given (past) data y and B, must be a weighted average of my sampling distribution p(y θ, B) weighted by my current best information p(θ y, B) about θ given y and B. That s the paradigm, and it s been highly successful in the past (say) 30 years in fields as far-ranging as bioinformatics, econometrics, environmetrics, and medicine at quantifying uncertainty in a coherent and well-calibrated way and helping people find satisfying answers to hard scientific questions. Evaluating (potentially high-dimensional) integrals (like the one in the second equation above, and many others that arise in the Bayesian approach) is a technical challenge, often addressed these days with sampling-based Markov chain Monte Carlo (MCMC) methods (e.g., Gilks, Richardson and Spiegelhalter 1996). Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 9

10 An Example of Poorly-Calibrated Frequentist Inference Quality of hospital care is often studied with cluster samples: I take a random sample of J hospitals (indexed by j) and a random sample of N total patients (indexed by i) nested in the chosen hospitals, and I measure quality of care for the chosen patients and various hospital- and patient-level predictors. With y ij as the quality of care score for patient i in hospital j, a first step would often be to fit a variance-components model with random effects at both the hospital and patient levels: y ij = β 0 + u j + e ij, i = 1,..., n j, j = 1,..., J; J j=1 n j = N, (u j σu) 2 IID N(0, σu), 2 (e ij σe) 2 IID N(0, σe). 2 (4) Browne and Draper (2006) used a simulation study to show that, with a variety of maximum-likelihood-based methods for creating confidence intervals for σu, 2 the actual coverage of nominal 95% intervals ranged from 72% to 94% across realistic sample sizes and true parameter values, versus 89 94% for Bayesian methods. Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 10

11 Poorly-Calibrated Frequentist Inference (continued) In a re-analysis of a Guatemalan National Survey of Maternal and Child Health, with three-level data (births within mothers within communities), working with the random-effects logistic regression model (y ijk p ijk ) indep Bernoulli(p ijk ) with logit(p ijk ) = β 0 + β 1 x 1ijk + β 2 x 2jk + β 3 x 3k + u jk + v k, (5) where y ijk is a binary indicator of modern prenatal care or not and where u jk N(0, σu) 2 and v k N(0, σv) 2 were random effects at the mother and community levels (respectively), Browne and Draper (2006) showed that things can be even worse for likelihood-based methods, with actual coverages (at nominal 95%) as low as 0 2% for intervals for σu 2 and σv, 2 whereas Bayesian methods again produce actual coverages from 89 96%. The technical problem is that the marginal likelihood functions for random-effects variances are often heavily skewed, with maxima at or near 0 even when the true variance is positive; Bayesian methods, which integrate over the likelihood function rather than maximizing it, can have (much) better small-sample calibration performance. Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 11

12 Where I m Headed in This Part of the Short Course I ve argued that the frequentist and Bayesian paradigms both have strengths and weaknesses, so (unlike the position taken by many people in the 20th century) my job is not to choose one paradigm and defend it against attacks from people who prefer the other paradigm, but instead to (a) understand both paradigms thoroughly and (b) find a fusion of the two that emphasizes the strengths and plays down the weaknesses. My personal fusion has two steps: (1) to reason in a Bayesian way when formulating my inferences, predictions and decisions, because the Bayesian paradigm is the most flexible approach so far invented for quantifying all relevant sources of uncertainty, and (2) to reason in a frequentist way when evaluating the quality of my answers, by paying attention to calibration issues (e.g., creating a simulation environment similar to the problem I m studying but in which truth is known, and seeing how often my Bayesian methods recover known truth). Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 12

13 Where I m Headed (continued) The 20th century was dominated by the frequentist point of view, which was good because of the emphasis on calibration but bad because this produced an over-emphasis on inference at the expense of prediction and decision-making. In particular, problems that at first look inferential (because that was the only way the last century s dominant paradigm could handle them) may profitably be reformulated as decisions, and people sometimes use inferential tools to suggest optimal behaviors that are not as optimal as they initially seem. Here I ll describe two case studies in biostatistics in which Bayesian decision theory gives new insight in settings that seem inferential: variable selection in generalized linear models (with application to the construction of a cost-effective scale for measuring sickness at admission to hospital), and determining the efficacy of a vaccine against HIV. Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 13

14 Measuring Sickness at Admission Variable selection (choosing the best subset of predictors) in generalized linear models is an old problem, dating back at least to the 1960s, and many methods have been proposed to try to solve it; but virtually all of them ignore an aspect of the problem that can be important: the cost of data collection of the predictors. Case study 1. (Fouskakis and Draper, JASA, 2008; Fouskakis, Ntzoufras and Draper (FND), submitted, 2007a, 2007b). In the field of quality of health care measurement, patient sickness at admission is often assessed by using logistic regression of mortality within 30 days of admission on a fairly large number of sickness indicators (on the order of 100) to construct a sickness scale, employing standard variable selection methods (e.g., backward selection from a model with all predictors) to find an optimal subset of indicators. Such benefit-only methods ignore the considerable differences among the sickness indicators in cost of data collection, an issue that s crucial when admission sickness is used to drive programs (now implemented or Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 14

15 Choosing Utility Function (continued) under consideration in several countries, including the U.S. and U.K.) that attempt to identify substandard hospitals by comparing observed and expected mortality rates (given admission sickness). When both data-collection cost and accuracy of prediction of 30-day mortality are considered, a large variable-selection problem arises in which costly variables that do not predict well enough should be omitted from the final scale. There are two main ways to solve this problem you can (a) put cost and predictive accuracy on the same scale and optimize, or (b) maximize the latter subject to a bound on the former leading to three methods: (1) a decision-theoretic cost-benefit approach based on maximizing expected utility (Fouskakis and Draper, 2008), (2) an alternative cost-benefit approach based on posterior model odds (FND, 2007a), and (3) a cost-restriction-benefit analysis that maximizes predictive accuracy subject to a bound on cost (FND, 2007b). Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 15

16 The Data Data (Kahn et al., JAMA, 1990): p = 83 sickness indicators gathered on representative sample of n = 2, 532 elderly American patients hospitalized in the period with pneumonia; original RAND benefit-only scale based on subset of 14 predictors: Variable Cost (U.S.$) Correlation Good? Total APACHE II score (36-point scale) Age Systolic blood pressure score (2-point scale) Chest X-ray congestive heart failure score (3-point scale) Blood urea nitrogen APACHE II coma score (3-point scale) Serum albumin (3-point scale) Shortness of breath (yes, no) Respiratory distress (yes, no) Septic complications (yes, no) Prior respiratory failure (yes, no) Recently hospitalized (yes, no) Ambulatory score (3-point scale) Temperature Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 16

17 Decision-Theoretic Cost-Benefit Approach Approach (1) (decision-theoretic cost-benefit). Problem formulation: Suppose (a) the 30 day mortality outcome y i and data on p sickness indicators (x i1,..., X ip ) have been collected on n individuals sampled exchangeably from a population P of patients with a given disease, and (b) the goal is to predict the death outcome for n new patients who will in the future be sampled exchangeably from P, (c) on the basis of some or all of the predictors X j, when (d) the marginal costs of data collection per patient c 1,..., c p for the X j vary considerably. What is the best subset of the X j to choose, if a fixed amount of money is available for this task and you re rewarded based on the quality of your predictions? Since data on future patients are not available, we use a cross-validation approach in which (i) a random subset of n M observations is drawn for creation of the mortality predictions (the modeling subsample) and (ii) the quality of those predictions is assessed on the remaining n V = (n n M ) observations (the validation subsample, which serves as a proxy for future patients). Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 17

18 Utility Elicitation Here utility is quantified in monetary terms, so that data collection part of utility function is simply negative of total amount of money required to gather data on specified predictor subset (manual data abstraction from hardcopy patient charts will gradually be replaced by electronic medical records, but still widely used in quality of care studies). Letting I j = 1 if X j is included in a given model (and 0 otherwise), the data-collection utility associated with subset I = (I 1,..., I p ) for patients in the validation subsample is p U D (I) = n V c j I j, (6) where c j is the marginal cost per patient of data abstraction for variable j (the second column in the table above gave examples of these marginal costs). To measure the accuracy of a model s predictions, a metric is needed that quantifies the discrepancy between the actual and predicted values, and in this problem the metric must come out in monetary terms on a scale comparable to that employed with the data-collection utility. j=1 Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 18

19 Utility Elicitation (continued) In the setting of this problem the outcomes Y i are binary death indicators and the predicted values ˆp i, based on statistical modeling, take the form of estimated death probabilities. We use an approach to the comparison of actual and predicted values that involves dichotomizing the ˆp i with respect to a cutoff, to mimic the decision-making reality that actions taken on the basis of observed-versus-expected quality assessment will have an all-or-nothing character at the hospital level (for example, regulators must decide either to subject or not subject a given hospital to a more detailed, more expensive quality audit based on process criteria). In the first step of our approach, given a particular predictor subset I, we fit a logistic regression model to the modeling subsample M and apply this model to validation subsample V to create predicted death probabilities ˆp I i. In more detail, letting Y i = 1 if patient i dies and 0 otherwise, and taking X i1,..., X ik to be the k sickness predictors for this patient under model I, the usual sampling model which underlies logistic regression in this case is Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 19

20 Utility Elicitation (continued) (Y i p I i ) indep Bernoulli(p I i ), log( pi i ) = β 1 p I 0 + β 1 X i β k X ik. i We use maximum likelihood to fit this model (as a computationally efficient approximation to Bayesian fitting with relatively diffuse priors), obtaining a vector ˆβ of estimated logistic regression coefficients, from which the predicted death probabilities for the patients in subsample V are as usual given by [ ( )] 1 k ˆp I i = 1 + exp ˆβ j X ij, (8) where X i0 = 1 (ˆp I i j=0 may be thought of as the sickness score for patient i under model I). In the second step of our approach we classify patient i in the validation subsample as predicted dead or alive according to whether ˆp I i exceeds or falls short of a cutoff p, which is chosen by searching on a discrete grid from 0.01 to 0.99 by steps of 0.01 to maximize the predictive accuracy of model I. (7) Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 20

21 Utility Elicitation (continued) We then cross-tabulate actual versus predicted death status in a 2 2 contingency table, rewarding and penalizing model I according to the numbers of patients in the validation sample which fall into the cells of the right-hand part of the following table. Rewards and Penalties Predicted Counts Predicted Died Lived Died Lived Actual Died C 11 C 12 n 11 n 12 Lived C 21 C 22 n 21 n 22 The left-hand part of this table records the rewards and penalties in US$. The predictive utility of model I is then U P (I) = 2 2 C lm n lm. (9) l=1 m=1 To elicit the utility values C lm we reason as follows. Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 21

22 Utility Elicitation (continued) (1) Clearly C 11 (the reward for correctly predicting death at 30 days) and C 22 (the reward for correctly predicting living at 30 days) should be positive, and C 12 (the penalty for a false prediction of living) and C 21 (the penalty for a false prediction of death) should be negative. (2) Since it s easier to correctly predict that a person lives than dies with these data (the overall pneumonia 30 day death rate in the RAND sample was 16%, so a prediction that every patient lives would be right about 84% of the time), it s natural to specify that C 11 > C 22. (3) Since it s arguably worse to label a bad hospital as good than the other way around, one should take C 12 > C 21, and furthermore it s natural that the magnitudes of the penalties should exceed those of the rewards. (4) We completed the utility specification by eliciting information from health experts in the U.S. and U.K, first to anchor C 21 to the cost of subjecting a good hospital to an unnecessary process audit and then to obtain ratios relating the other C lm to C 21. Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 22

23 Utility Elicitation (continued) Since the utility structure we use is based on the idea that hospitals have to be treated in an all-or-nothing way in acting on the basis of their apparent quality, the approach taken was (i) to quantify the monetary loss L of incorrectly subjecting a good hospital to a detailed but unnecessary process audit and then (ii) to translate this from the hospital to the patient level. Rough correspondence may be made between left-hand part of contingency table above at patient level and hospital-level table with rows representing truth ( bad in row 1, good in row 2) and columns representing decision taken ( process audit in column 1, no process audit in column 2). Unnecessary process audits then correspond to cell (2, 1) in these tables (hospitals where a process audit is not needed will typically have an excess of patients who are predicted to die but actually live). Discussions with health experts in the U.S. and U.K. suggested that detailed process audits cost on the order of L =$5,000 per hospital (in late 1980s U.S. dollars), and RAND data indicated that the mean number of pneumonia patients per hospital per year in the U.S. at the time of the RAND quality of care study was Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 23

24 Utility Elicitation (continued) This fixed C 21 at approximately $5, = $69.6. Our health experts judged that C 12 should be the largest in absolute value of the C lm, and averaging across the expert opinions, expressed as orders of magnitude base 2, the elicitation results were C 12 C 21 = 2, C 11 C 21 = 1, and 2 C 22 C 21 = 1, finally yielding (C 8 11, C 12, C 21, C 22 ) = $(34.8, 139.2, 69.6, 8.7). The results in Fouskakis and Draper (2008) use these values; Draper and Fouskakis (2000) present a sensitivity analysis on the choice of the C lm which demonstrates broad stability of the findings when the utility values mentioned above are perturbed in reasonable ways. With the C lm in hand, the overall expected utility function to be maximized over I is then simply E [U(I)] = E [U D (I) + U P (I)], (10) where this expectation is over all possible cross-validation splits of the data. Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 24

25 Results The number of possible cross-validation splits is far too large to evaluate the expectation in (10) directly; in practice we therefore use Monte Carlo methods to evaluate it, averaging over N random modeling and validation splits. Results. We explored this approach in two settings: a Small World created by focusing only on the p = 14 variables in the original RAND scale (2 14 = 16, 384 is a small enough number of possible models to do brute-force enumeration of the estimated expected utility of all models), and the Big World defined by all p = 83 available predictors (2 83. = is far too large for brute-force enumeration; we compared a variety of stochastic optimization methods including simulated annealing, genetic algorithms, and tabu search on their ability to find good variable subsets). Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 25

26 Results: Small World Estimated Expected Utility Number of Variables The 20 best models included the same three variables 18 or more times out of 20, and never included six other variables; the five best models were minor variations on each other, and included 4 6 variables (last column in table on page 16). Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 26

27 Approach (2) The best models save almost $8 per patient over the full 14-variable model; this would amount to significant savings if the observed-versus-expected assessment method were applied widely. Approach (2) (alternative cost-benefit) Maximizing expected utility, as in Approach (1) above, is a natural Bayesian way forward in this problem, but (a) the elicitation process was complicated and (b) the utility structure we examine is only one of a number of plausible alternatives, with utility framed from only one point of view; the broader question for a decision-theoretic approach is whose utility should drive the problem formulation. It s well known (e.g., Arrow, 1963; Weerahandi and Zidek, 1981) that Bayesian decision theory can be problematic when used normatively for group decision-making, because of conflicts in preferences among members of the group; in the context of the problem addressed here, it can be difficult to identify a utility structure acceptable to all stakeholders (including patients, doctors, hospitals, citizen watchdog groups, and state and federal regulatory agencies) in the quality-of-care-assessment process. Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 27

28 Approach (2) (continued) As an alternative, in Approach (2) we propose a prior distribution that accounts for the cost of each variable and results in a set of posterior model probabilities which correspond to a generalized cost-adjusted version of the Bayesian information criterion (BIC). This provides a principled approach to performing a cost-benefit trade-off that avoids ambiguities in identification of an appropriate utility structure. Details. Bayesian parametric model comparison and variable selection are based on specifying a model m, its likelihood f(y θ m, m), the prior distribution of model parameters f(θ m m) and the corresponding prior model weight (or probability) f(m), where θ m is a parameter vector under model m and y is the data vector. Parametric inference is based on the posterior distribution f(θ m y, m), and quantifying model uncertainty by estimating the posterior model probability f(m y) is also an important issue. Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 28

29 Parametric Model Comparison Hence, when we consider a set of competing models M = {m 1, m 2,, m M }, we focus on the posterior probability of model m M, defined as f(y m)f(m) f(m y) = m l M f(y m l)f(m l ) = 1 P O ml,m = B ml,m m l M f(m l ) f(m) 1, m l M (11) where P O mi,m j = f(m i y) is the posterior model odds and B f(m j y) m i,m j Bayes factor for comparing models m i and m j. is the When we limit ourselves in the comparison of only two models we typically focus on P O mi,m j and B mi,m j, which have the desirable property of insensitivity to the selection of the model space M. By definition the Bayes factor is the ratio of the posterior model odds over the prior model odds; thus large values of B mi,m j (usually greater than 12, say) indicate strong posterior support of model m i against model m j. Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 29

30 Variable Selection in Logistic Regression The posterior model probabilities and integrated likelihoods f(y m i ) in (11) are rarely analytically tractable; we use a combination of Laplace approximations and Markov Chain Monte Carlo (MCMC) methodology to approximate posterior odds and Bayes factors. In the sickness-at-admission problem at issue here, we use a simple logistic regression model with response Y i = 1 if patient i dies and 0 otherwise. We further denote by X ij the sickness predictor variable j for patient i and by γ j an indicator, often used in Bayesian variable selection problems, taking the value 1 if variable j is included in the model and 0 otherwise; thus in this case M = {0, 1} p, where p is the total number of variables. In order to map the set of binary model indicators γ onto a model m we can use a representation of the form m(γ) = p i=1 2i 1 γ i. Hence the model formulation can be summarized as (Y i γ) [ ] pi (γ) η i (γ) = log 1 p i (γ) indep = Bernoulli[p i (γ)], p β j γ j X ij, (12) j=0 Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 30

31 Prior on Model Parameters η(γ) = X diag(γ) β = Xγ β γ, defining X i0 = 1 for all i = 1,..., n and γ 0 = 1 with prior probability one since here the intercept is always included in all models. Here p i (γ) is the death probability (which may be thought of as the sickness score) for patient i under model γ, η(γ) = [η 1 (γ),..., η n (γ)] T, γ = (γ 0, γ 1,..., γ p ) T, β = (β 0, β 1,..., β p ) T, and X = (X ij, i = 1,..., n; j = 0, 1,..., p); the vector β γ stands for the subvector of β which is included in the model specified by γ, i.e., β γ = (β i : γ i = 1, i = 0, 1,..., p), and is equivalent to the θ m vector defined above; similarly Xγ is the submatrix of X with columns corresponding to variables included in the model specified by γ. Prior on model parameters. We proceed in two steps: (1) First we build a prior on β that is a modified version of the unit information prior for this problem (to avoid Lindley s paradox); then (2) We adjust this prior for differences in marginal costs of variables. Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 31

32 Sensitivity to Prior Variance Step (1). One important problem in Bayesian model evaluation using posterior model probabilities is their sensitivity to the prior variance of the model parameters: large variance of the β γ (used to represent prior ignorance) will increase the posterior probabilities of the simpler models considered in the model space M (Lindley s paradox). We address this issue by using ideas proposed by Ntzoufras et al. (2003): we use a prior distribution of the form with prior covariance matrix given by Σγ = n f(β γ γ) = N(µ γ, Σγ) (13) [ I(βγ)] 1, where n is the total sample size and I(β γ ) is the information matrix I(β γ ) = X T γ W γxγ; here W γ is a diagonal matrix which in the Bernoulli case takes the form W γ = diag {p i (γ)[1 p i (γ)]}. Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 32

33 Unit Information Prior This is the unit information prior of Kass and Wasserman (1996), which corresponds to adding one data point to the data. Here we use this prior as a base, but we specify p i (γ) in the information matrix according to our prior information; in this manner we avoid (even minimal) reuse of the data in the prior. When little prior information is available, a reasonable prior mean for β γ is µ γ = 0. This corresponds to a prior mean on the log-odds scale of zero, from which a sensible prior estimate for all model probabilities is p i (γ) = 1/2; with this choice (13) becomes [ ( ) ] 1 f(β γ γ) = N 0, 4n X T γxγ. ( ) (14) This prior distribution can also be motivated by combining the idea of imaginary data with the power prior approach of Chen et al. (2000); it turns out that (14) introduces additional information to the posterior equivalent to adding one data point to the likelihood and therefore we support a priori the simplest model with a weight of one data point. Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 33

34 Laplace Approximation Step (2). To introduce costs we again proceed in two sub-steps: (2a) First we specify a Laplace approximation (and the BIC approximation that corresponds to it) for the posterior model odds in our problem, using the prior in Step (1), and (2b) Then we see how to adjust the approximations in Step (2a) to account for cost differences among the variables. Step (2a). We denote by P O kl the posterior odds of model γ (k) versus model γ (l) ; then we have [ ] 2 log P O kl = 2 log f(γ (k) y) log f(γ (l) y). (15) Following the approach of Raftery (1996), we can approximate the posterior distribution of a model γ using the following Laplace approximation: 2 log f(γ y) = 2 log f(y β γ, γ) 2 log f( β γ γ) dγ log(2π) log Ψγ 2 log f(γ) + O(n 1 ), (16) Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 34

35 Details where β γ is the posterior mode of f(β γ y, γ), dγ = p j=0 γ j is the dimension of the model γ, and Ψγ is minus the inverse of the Hessian matrix of h(β γ ) = log f(y β γ, γ) + log f(β γ γ) evaluated at the posterior mode β γ. Under the model formulation given by equation (12) and the prior distribution (14) we have that 1 Ψγ = 2 log f(y β γ, γ) β 2 2 log f(β γ γ) γ βγ = β β 2 γ γ βγ = β γ exp (Xγ,i β ) 1 γ = X T γ diag [ ( 1 + exp Xγ,i β )] n X γ, (17) γ where Xγ,i is row i of the matrix Xγ for i = 1,..., n. By substituting the prior (14) in expression (16) we get 2 log f(γ y) = 2 log f(y βγ, γ) + φ(γ) 2 log f(γ) + O(n 1 ), (18) Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 35

36 Penalized Log Likelihood Ratio where φ(γ) = 1 4n β T γ X T γxγ β γ + dγ log(4n) + log Ψ 1 γ X T γxγ. (19) From the above expression it s clear that the logarithm of a posterior model probability can be regarded as a penalized log-likelihood evaluated at the posterior mode of the model, in which the term φ(γ) 2 log f(γ) can be interpreted as the penalty imposed upon the log-likelihood. In pairwise model comparisons, we can directly use the posterior model odds (15), which can now be written as { } f(y β γ (k), γ (k) ) ( 2 log P O kl = 2 log f(y β + φ γ (k)) φ γ (l), γ (l) ) (γ (l)) 2 log f(γ(k) ) f(γ (l) ) + O(n 1 ). (20) Therefore, the comparison of the two models is based on a penalized log-likelihood ratio, where the penalty is now given by ψ(γ (k), γ (l) ) = φ(γ (k) ) φ(γ (l) ) 2 log f(γ(k) ) f(γ (l) ). Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 36

37 Decomposing the Penalty Term Each penalty term is divided into two parts: φ(γ) and 2 log f(γ). The first term, φ(γ), has its source in the marginal likelihood f(y γ) of model γ and can be thought of as a measure of discrepancy between the data and the prior information for the model parameters; the second part comes from the prior model probabilities f(γ). Indifference on the space of all models, usually expressed by the uniform distribution (i.e., f(γ) 1), eliminates the second term from the model comparison procedure, since the penalty term in (20) will then be based only on the difference of the first penalty terms φ(γ (k) ) φ(γ (l) ). For this reason the penalty term φ(γ) is the imposed penalty which appears in the penalized log-likelihood expression of the Bayes factor BF kl with a uniform prior on model space. A simpler but less accurate approximation of log P O kl can be obtained following the arguments of Schwartz (1978): Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 37

38 BIC Approximation [ ] f(y ˆβ γ (k), γ (k) ) 2 log P O kl = 2 log f(y ˆβ γ (l), γ (l) ) + ( ) d γ (k) d γ (l) log n 2 log f(γ(k) ) f(γ (l) ) + O(1) (21) = BIC kl 2 log f(γ(k) ) f(γ (l) ) + O(1), where BIC kl is the Bayesian Information Criterion for choosing between models γ (k) and γ (l) and ˆβ γ is vector of maximum likelihood estimates of β γ. Since BIC kl is an O(1) approximation, it might diverge from the exact value of the logarithm of the Bayes factor even for large samples; even so, it has often been shown to provide a reasonable measure of evidence (for finite n) and its straightforward calculation has encouraged its widespread use in practice. Step (2b). From the above argument and equations (18) and (20), it s clear that an additional penalty can be directly imposed on the posterior model probabilities and odds via the prior model probabilities f(γ). Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 38

39 Cost Adjustment Therefore we may use prior model probabilities to induce prior preferences for specific variables depending on their costs. For this reason we propose to use prior model probabilities of the form [ ( ) f(γ j ) exp γ ( ) ] j cj c 0 log n for j = 1,..., p, (22) 2 where c j is the marginal cost per observation for variable X j and (as will be seen below) the desire for our approach to yield a cost-adjusted generalization of BIC compels the definition c 0 = min{c j, j = 1,..., p}. We further assume that the constant term is included in all models by 2 log f(γ) = j=1 c 0 specifying f(γ 0 = 1) = 1, resulting in p c j p [ ( )] γ j log n dγ log n + 2 log 1 + n 1 1 c j 2 c 0. (23) c 0 If all variables have the same cost or we re indifferent concerning the cost then we can set c j = c 0 for j = 1,..., p, which reduces to the uniform prior on model space (f(γ) 1) and posterior odds equal to the usual Bayes factor. j=1 Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 39

40 Cost Adjustment (continued) When comparing two models γ (k) and γ (l), the additional penalty imposed on the log-likelihood ratio due to the cost-adjusted prior model 2 log [ ] f(γ (k) ) f(γ (l) ) = = probabilities is given by p j=1 ( γ (k) j [ Cγ (k) C γ (l) c 0 ) γ (l) cj ( ) j log n d c γ (k) d γ (l) log n 0 ( d γ (k) d γ (l)) ] log n, (24) where Cγ = p j=1 γ jc j is the total cost of model γ; thus two models of the same dimension and cost will have the same prior weight. In the simpler case where we compare two nested models that differ only on the status of variable j, the prior model ratio simplifies to [ ] ( ) f(γj = 1, γ \j ) cj 2 log = 1 log n, (25) f(γ j = 0, γ \j ) c 0 where γ \j is the vector of γ excluding element γ j. Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 40

41 Cost-Adjusted Laplace Approximation The above expression can be viewed as ( a prior ) penalty for including the cj variable j in the model, while the term c 0 1 can be interpreted as the proportional additional penalty imposed upon ( 2 log BF ) if the variable X j is included in the model due to its increased cost. Using the prior model odds (24) in the approximate posterior model odds (20) we obtain [ ] f(y βγ (k), γ (k) ) 2 log P O kl = 2 log f(y β + ψ(γ (k), γ (l) ) + O(n 1 ), (26) γ (l), γ (l) ) ψ(γ (k), γ (l) ) = 1 4n where the penalty term is given by ( βt γ (k)x T γ (k)x β γ (k) γ (k) β γ T (l)x T γ (l)x β γ (l) γ (l)) + ( ) d γ (k) d γ (l) log(4) + log log Ψ 1 γ (k) X T γ (k) X γ (k) Ψ 1 γ (l) X T γ (l) X γ (l) + C γ (k) C γ (l) c 0 log n. (27) Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 41

42 Cost-Adjusted BIC Finally we consider the BIC-based approximation (21) to the logarithm of the posterior model odds with the prior model odds (24), yielding ( ) 2 log P O kl = 2 log [ ] f(y ˆβγ (k), γ (k) ) f(y ˆβ γ (l), γ (l) ) + C γ (k) C γ (l) c 0 log n + O(1). (28) The penalty term dγ log n of model γ used in (21) has been replaced in the above expression by the cost-dependent penalty c 1 0 C γ log n; ignoring costs is equivalent to taking c j = c 0 for all j, yielding c 1 0 C γ = dγ, the original BIC expression. Therefore, we may interpret the quantity log n as the imposed penalty for each variable included in the model γ when no costs are considered (or when costs are equal). Moreover, this baseline penalty term is inflated proportionally to the cost ratio c j c 0 for each variable X j ; for example, if the cost of a variable X j is twice the minimum cost (c j = 2 c 0 ) then the imposed penalty is equivalent to adding two variables with the minimum cost. Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 42

43 MCMC Implementation For this reason, (28) can be considered as a cost-adjusted generalization of BIC when prior model probabilities of type (22) are adopted. MCMC implementation. As noted earlier, in our quality of care study with p = 83 predictors there are on the order of possible models. In such situations, sampling algorithms will not be able to estimate posterior model probabilities with high accuracy in a reasonable amount of CPU time due to the large model space. For this reason, we implemented the following two-step method: (1) First we use a model search tool to identify variables with high marginal posterior inclusion probabilities f(γ j y), and we create a reduced model space consisting only of those variables whose marginal probabilities are above a threshold value. According to Barbieri and Berger (2004) this method of selecting variables based on their marginal probabilities may lead to the identification of models with better predictive abilities than approaches based on maximizing posterior model probabilities. Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 43

44 MCMC Implementation (continued) Although Barbieri and Berger proposed 0.5 as a threshold value for f(γ j = 1 y), we used the lower value of 0.3, since our aim was only to identify and eliminate variables not contributing to models with high posterior probabilities. (2) Then we use a model search tool in the reduced model space to estimate posterior model probabilities (and the corresponding odds). To ensure stability of our findings we explored the use of two model search tools in step (1): A reversible-jump MCMC algorithm (RJMCMC), as implemented for variable selection in generalized linear models by Dellaportas et al. (2002) and Ntzoufras et al.(2003); and the MCMC model composition (MC 3 ) algorithm (Madigan and York, 1995). More specifically, we implemented reversible-jump moves within Gibbs for the model indicators γ j, by proposing the new model to differ from the current one in each step by a single term j with probability one. Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 44

45 MCMC Implementation (continued) The algorithm can be summarized as follows: (1) For j = 1,..., p, use RJMCMC to compare the current model γ with the proposed one γ with components γ j = 1 γ j and γ k = γ k for k j with probability one; the updating sequence of γ j is randomly determined in each step. (2) For j = 0,..., p, if γ j = 1 then generate model parameters β j from the corresponding posterior distribution f(β j β \j, γ, y), otherwise set β j = 0. In our context the MC 3 algorithm may be summarized by the following steps: (1) For j = 1,..., p, propose a move from the current model γ to a new one γ with components γ j = 1 γ j and γ k = γ k for k j with probability one; the updating sequence of γ j is randomly determined in each step. (2) Accept the proposed model γ with probability [ ] α = min 1, f(γ y) = min ( ) 1, P Oγ,γ f(γ y). Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 45

46 MCMC Implementation (continued) Since the posterior model odds P Oγ,γ used in MC3 are not analytically available here, we also explored two methods for calculating them approximating the acceptance probabilities with cost-adjusted Laplace (equation 26) and cost-adjusted BIC (equation 28) and in addition we further explored one additional form of sensitivity analysis: initializing the MCMC runs at the null model (with no predictors) and the full model (with all predictors). All of this was done both for the benefit-only analysis (specified by setting all variable costs equal) and the cost-benefit approach. In moving from the full to the reduced model space to implement step (1) of our two-step method, for both the benefit-only and cost-benefit analyses we found a striking level of agreement across (a) the two model search tools, (b) the two methods to approximate the acceptance probabilities in MC 3, and (c) the two choices for initializing the MCMC runs in the subset of variables defining the reduced model space; this made it unnecessary to perform a similar sensitivity analysis in step (2). Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 46

47 Results Results are therefore presented below only for RJMCMC (starting from the full model). Convergence of the RJMCMC algorithm was checked using ergodic mean plots of the marginal inclusion probabilities for the full model space and the posterior model probabilities for the reduced space. In what follows we refer to the cost-benefit results as RJMCMC, but we could equally well have used the term MC 3 with cost-adjusted BIC (or just cost-adjusted BIC for short), because the results from the two methods were in such close agreement. Results. The table below presents the marginal posterior probabilities of the variables that exceeded the threshold value of 0.30, in each of the benefit-only and cost-benefit analyses, together with their data collection costs (in minutes of abstraction time rather than US$), in the Big World of all 83 predictors. In both the benefit-only and cost-benefit situations our methods reduced the initial list of p = 83 available candidates down to 13 predictors. Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 47

48 Results (continued) Marginal Posterior Probabilities Variable Analysis Index Name Cost Benefit-Only Cost-Benefit 1 SBP Score Age Blood Urea Nitrogen Apache II Coma Score Shortness of Breath Day 1? Septic Complications? Initial Temperature Heart Rate Day Chest Pain Day 1? Cardiomegaly Score Hematologic History Score Apache Respiratory Rate Score Admission SBP Respiratory Rate Day Confusion Day 1? Apache ph Score Morbid + Comorbid Score Musculoskeletal Score Note that the most expensive variables with high marginal posterior probabilities in the benefit-only analysis were absent from the set of promising variables in the cost-benefit analysis (e.g., Apache II Coma Score). Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 48

49 Results (continued) Common variables in both analyses: X 1 + X 2 + X 3 + X 5 + X 12 + X 70 Benefit-Only Analysis Common Variables Additional Model Posterior k Within Each Analysis Variables Cost Probabilities P O 1k 1 X 4 + X 15 + X 37 + X 73 +X 8 +X 27 +X X 8 +X X X 27 +X Cost-Benefit Analysis Common Variables Additional Model Posterior k Within Each Analysis Variables Cost Probabilities P O 1k 1 X 46 + X 51 +X 49 +X X 14 +X 49 +X X 13 +X 49 +X X 13 +X 14 +X 49 +X X 14 +X X X 37 +X X 13 +X 14 +X X Introduction to Bayesian Data Analysis 5: Biostatistical applications of Bayesian decision theory 49

Bayesian Decision Theory in Biostatistics

Bayesian Decision Theory in Biostatistics David Draper (joint work with Dimitris Fouskakis, Ioannis Ntzoufras and Ken Pietz) Department of Applied Mathematics and Statistics University of California, Santa