Incorporating cost in Bayesian variable selection, with application to cost-effective measurement of quality of health care

Similar documents
Bayesian Decision Theory in Biostatistics

5: Biostatistical Applications of Bayesian Decision Theory

BAYESIAN VARIABLE SELECTION USING COST-ADJUSTED BIC, WITH APPLICATION TO COST-EFFECTIVE MEASUREMENT OF QUALITY OF HEALTH CARE

arxiv: v1 [stat.ap] 17 Aug 2009

Or How to select variables Using Bayesian LASSO

Power-Expected-Posterior Priors for Variable Selection in Gaussian Linear Models

Power-Expected-Posterior Priors for Variable Selection in Gaussian Linear Models

A note on Reversible Jump Markov Chain Monte Carlo

Bayesian Analysis of Bivariate Count Data

BAYESIAN MODEL CRITICISM

Power-Expected-Posterior Priors for Variable Selection in Gaussian Linear Models

STA 4273H: Statistical Machine Learning

Penalized Loss functions for Bayesian Model Choice

On Bayesian model and variable selection using MCMC

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

Sparse Linear Models (10/7/13)

MS-C1620 Statistical inference

The STS Surgeon Composite Technical Appendix

Monte Carlo in Bayesian Statistics

Illustrating the Implicit BIC Prior. Richard Startz * revised June Abstract

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Proteomics and Variable Selection

Lecture Notes 1: Decisions and Data. In these notes, I describe some basic ideas in decision theory. theory is constructed from

Statistical Methods in Particle Physics Lecture 1: Bayesian methods

Bayesian hypothesis testing for the distribution of insurance claim counts using the Gibbs sampler

BAYESIAN ANALYSIS OF CORRELATED PROPORTIONS

Problems with Penalised Maximum Likelihood and Jeffrey s Priors to Account For Separation in Large Datasets with Rare Events

Machine Learning Linear Classification. Prof. Matteo Matteucci

Causal Inference with Big Data Sets

MCMC: Markov Chain Monte Carlo

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Model Comparison. Course on Bayesian Inference, WTCN, UCL, February Model Comparison. Bayes rule for models. Linear Models. AIC and BIC.

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

L applicazione dei metodi Bayesiani nella Farmacoeconomia

Investigation into the use of confidence indicators with calibration

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Logistic Regression. Advanced Methods for Data Analysis (36-402/36-608) Spring 2014

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Specification of prior distributions under model uncertainty

Bayesian non-parametric model to longitudinally predict churn

Robust Bayesian Variable Selection for Modeling Mean Medical Costs

Markov Chain Monte Carlo methods

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Analysing geoadditive regression data: a mixed model approach

Quantile POD for Hit-Miss Data

Part 8: GLMs and Hierarchical LMs and GLMs

Model comparison. Patrick Breheny. March 28. Introduction Measures of predictive power Model selection

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

An Extended BIC for Model Selection

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Three-group ROC predictive analysis for ordinal outcomes

Machine Learning for OR & FE

Undirected Graphical Models

You know I m not goin diss you on the internet Cause my mama taught me better than that I m a survivor (What?) I m not goin give up (What?

Hmms with variable dimension structures and extensions

Lecture 6: Model Checking and Selection

Introduction to Bayesian Data Analysis

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bayesian Nonparametric Regression for Diabetes Deaths

Bayesian modelling of football outcomes. 1 Introduction. Synopsis. (using Skellam s Distribution)

Frailty Modeling for Spatially Correlated Survival Data, with Application to Infant Mortality in Minnesota By: Sudipto Banerjee, Mela. P.

A generalization of the Multiple-try Metropolis algorithm for Bayesian estimation and model selection

The Bayesian Approach to Multi-equation Econometric Model Estimation

Posterior Model Probabilities via Path-based Pairwise Priors

Bayesian Model Specification: Toward a Theory of Applied Statistics

Efficient adaptive covariate modelling for extremes

Building a Prognostic Biomarker

An Introduction to Path Analysis

Bayesian Networks in Educational Assessment

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Post-Selection Inference

Estimating the marginal likelihood with Integrated nested Laplace approximation (INLA)

MODEL AVERAGING by Merlise Clyde 1

Classification 1: Linear regression of indicators, linear discriminant analysis

Niche Modeling. STAMPS - MBL Course Woods Hole, MA - August 9, 2016

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Tuning Parameter Selection in L1 Regularized Logistic Regression

Modelling geoadditive survival data

Strong Lens Modeling (II): Statistical Methods

Health utilities' affect you are reported alongside underestimates of uncertainty

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California

Bayes Factors, posterior predictives, short intro to RJMCMC. Thermodynamic Integration

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke

Reconstruction of individual patient data for meta analysis via Bayesian approach

Stat 5101 Lecture Notes

ECE521 week 3: 23/26 January 2017

Bayesian Modeling Using WinBUGS

Using modern statistical methodology for validating and reporti. Outcomes

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

Assessing Regime Uncertainty Through Reversible Jump McMC

Odds ratio estimation in Bernoulli smoothing spline analysis-ofvariance

Obnoxious lateness humor

Variable selection for model-based clustering

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Transcription:

Incorporating cost in Bayesian variable selection, with application to cost-effective measurement of quality of health care Dimitris Fouskakis, Department of Mathematics, School of Applied Mathematical and Physical Sciences, National Technical University of Athens, Athens, Greece; e-mail: fouskakis@math.ntua.gr. Joint work with: Ioannis Ntzoufras & David Draper Department of Statistics Athens University of Economics and Business Athens, Greece; e-mail: ntzoufras@aueb.gr Department of Applied Mathematics and Statistics University of California Santa Cruz, USA; e-mail: draper@ams.ucsc.edu Presentation is available at: www.math.ntua.gr/ fouskakis/conferences/hwu/hwu.pdf.

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 2 Synopsis 1. Motivation. 2. Model Specification. 3. Decision Theoretic Cost-Benefit Analysis. 4. Bayesian Cost-Benefit Analysis. 5. Utility versus Cost-Adjusted BIC. 6. Discussion.

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 3 1 Motivation Health care quality measurements Indirect method: input-output approach. Construct a model on hospital outcomes (e.g., mortality within 30 days of admission) after adjusting for differences in inputs (sickness at admission). Compare observed and expected outcomes to infer for the health care quality. Data collection costs are available for each variable (measured in minutes or monetary units). We wish to incorporate cost in our analysis in order to reduce data collection costs but also have a well-fitted model.

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 4 Available data Data come form a major U.S. study constructed by RAND Corporation, with n = 2, 532 pneumonia patients (Keeler, et al., 1990). Response variable: mortality within 30 days of admission Covariates: p = 83 sickness indicators Construct a sickness scale using a logistic regression model. Benefit - Only Analysis (no costs): Classical variable selection techniques to find an optimal subset of 10-20 indicators. The initial list of p = 83 sickness indicators was reduced to 14 significant predictors (Keeler, et al., 1990).

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 5 The 14-Variable Rand Pneumonia Scale The RAND admission sickness scale for pneumonia (p = 14 variables), with the marginal data collection costs per patient for each variable (in minutes of abstraction time). Variable Cost Variable Cost 1 Systolic Blood Pressure Score (2-point scale) (Minutes) 0.5 8 Septic Complications (yes, no) 2 Age 0.5 9 Prior Respiratory Failure (yes, no) (Minutes) 3 Blood Urea Nitrogen 1.5 10 Recently Hospitalized (yes, no) 2.0 4 APACHE II Coma Score (3-point scale) 5 Shortness of Breath Day 1 (yes, no) 6 Serum Albumin Score (3-point scale) 7 Respiratory Distress (yes, no) 3.0 2.0 2.5 12 Initial Temperature 0.5 1.0 17 Chest X-ray Congestive Heart Failure Score (3-point scale) 1.5 18 Ambulatory Score (3-point scale) 1.0 48 Total APACHE II Score (36-point scale) 2.5 2.5 10.0

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 6 Two different approaches The RAND Benefit - Only approach is sub-optimal: it does not consider differences in cost of data collection among available predictors. We propose a Cost - Benefit Analysis, in which variables are chosen only when they predict well enough given how much they cost to collect. In problems such as this, in which there are two desirable criteria that compete, and over which a joint optimisation must be achieved, there are two main ways to proceed: Both criteria can be placed on a common scale, and optimisation can occur on that scale (strategy (a)). One criterion can be optimised, subject to a bound on the other (strategy (b)).

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 7 Three methods for solving this problem (1) (strategy (a)) Draper and Fouskakis (2000) and Fouskakis and Draper (2002, 2008) proposed an approach to this problem based on Bayesian Decision Theory. They used stochastic optimisation methods to find (near-) optimal subsets of predictor variables that maximize an expected utility function which trades off data collection cost against predictive accuracy. (2) (strategy (a)) In this work, as an alternative to (1), we propose a prior distribution that accounts for the cost of each variable and results in a set of posterior model probabilities which correspond to a Generalized Cost-Adjusted version of the Bayesian Information Criterion (Fouskakis, Ntzoufras and Draper, 2009a). (3) (strategy (b)) We also implement a Cost - Restriction - Benefit Analysis, where the search is conducted only among models whose cost does not exceed a budgetary restriction (Fouskakis, Ntzoufras and Draper, 2009b), by the usage of a Population - Based Trans - Dimensional RJMCMC Method. Here we present results from methods (1) (Decision Theoretic Cost-Benefit Analysis) and (2) (Bayesian Cost-Benefit Analysis).

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 8 2 Model Specification Logistic regression model with Y i = 1 if patient i dies after 30 days of admission. X ij : j sickness predictor variable for the i patient. m γ = (γ 1,...,γ p ) T. γ j : Binary indicators of the inclusion of the variable X j in the model. Model space M = {0, 1} p ; p = total number of variables considered. Hence the model formulation can be summarized as (Y i γ) ( ) pi (γ) η i (γ) = log 1 p i (γ) indep = Bernoulli(p i (γ)), p β j γ j X ij, j=0 η(γ) = X diag(γ) β = Xγ βγ.

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 9 3 Decision Theoretic Cost-Benefit Analysis Utility Elicitation (1) We take Bayesian decision-theoretic approach (based on maximisation of expected utility). Utility function has 2 components: quantifying data collection costs and predictive successes and failures. Data-collection utility: p available sickness indicators X j ; γ j = 1 if X j is included in subset (0 otherwise). Dividing n patients at random into modeling and validation subsamples of size n M and n V, respectively, data-collection cost associated with subset γ = (γ 1,...,γ p ) for patients in validation subsample is U D (γ) = n V p c j γ j, (1) j=1 where c j is marginal cost per patient of data abstraction for variable j.

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 10 Utility Elicitation (2) Predictive Utility: (1) Apply logistic regression model, obtained from modeling subsample, to validation subsample to create predicted death probabilities ˆp γ i predictor subset γ. for patients using given (2) Classify patient i in the validation subsample as predicted dead or alive according to whether ˆp γ i exceeds or falls short of a cutoff p, which is chosen - by searching on a discrete grid from 0.01 to 0.99 by steps of 0.01 - to maximize the predictive accuracy of model γ. We then cross-tabulate actual versus predicted death status in a 2 2 contingency table, rewarding and penalizing model γ according to the numbers of patients in the validation sample which fall into the cells of the right-hand part of the following table.

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 11 Utility Elicitation (3) Rewards and Penalties Counts Predicted Predicted Died Lived Died Lived Actual Died C 11 C 12 n 11 n 12 Lived C 21 C 22 n 21 n 22 The predictive utility of model γ is then U P (γ) = 2 2 C lm n lm. (2) l=1 m=1 See Fouskakis and Draper (2008) for details on eliciting the utility values C lm.

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 12 Utility Elicitation (4) The overall expected utility function to be maximised over γ is then simply E[U(γ)] = E[U D (γ) + U P (γ)], (3) where this expectation is over all possible cross-validation splits of the data. The number of possible cross-validation splits is far too large to evaluate the expectation in (3) directly; in practice we therefore use Monte Carlo methods to evaluate it, averaging over N random modeling and validation splits.

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 13 Results We explored this approach in two settings: a Small World created by focusing only on the p = 14 variables in the original RAND scale (2 14 = 16, 384 is a small enough number of possible models to do brute-force enumeration of the estimated expected utility of all models). The Rand scale is nowhere near optimal when data collection costs are considered along with predictive accuracy: Estimated Expected Utility -16-14 -12-10 -8 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Number of Variables

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 14 The best model in this case is the one with the following 4 variables: 1. Systolic Blood Pressure - X 1, 2. Blood Urea Nitrogen - X 3, 3. APACHE II Coma Score - X 4 and 4. Shortness of Breath Day 1 Score - X 5. The 20 best models include the same 3 variables 18 or more times out of 20, and never include 6 other variables; the 5 best models are minor variations on each other, and include 4 6 variables. The best models save almost $8 per patient over the full 14-variable model (significant savings if input-output approach applied widely). the Big World defined by all p = 83 available predictors (2 83. = 10 25 is far too large for brute-force enumeration; we compared a variety of stochastic optimisation methods - including simulated annealing, genetic algorithms, and tabu search - on their ability to find good variable subsets).

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 15 Drawback of the decision-theoretic approach Maximising expected utility, as we did before, is a natural Bayesian way forward in this problem, but (a) the elicitation process was complicated and difficult and (b) the utility structure we examine is only one of a number of plausible alternatives, with utility framed from only one point of view; the broader question for a decision-theoretic approach is whose utility should drive the problem formulation. It is well known (e.g., Arrow, 1963; Weerahandi and Zidek, 1981) that Bayesian decision theory can be problematic when used normatively for group decision-making, because of conflicts in preferences among members of the group; in the context of the problem addressed here, it can be difficult to identify a utility structure acceptable to all stakeholders (including patients, doctors, hospitals, citizen watchdog groups, and state and federal regulatory agencies) in the quality-of-care-assessment process.

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 16 4 Bayesian Cost-Benefit Analysis The aim is to identify well fitted models after taking into account the cost of each variable. Therefore we need to estimate posterior model probabilities f(γ y) = f(γ) f(y βγ, γ)f(βγ γ)dβγ γ {0,1} p f(γ ) f(y βγ, γ )f(βγ γ )dβγ after introducing a prior on model space f(γ) depending on the cost.

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 17 4.1 Preliminaries: Posterior model odds and penalty functions Information criteria (1) Information criterion for model γ f(y βγ) is the maximum likelihood. IC(γ) = 2 log f(y βγ, γ) + dγf dγ dimension of the model (number of parameters) F penalty for each model parameter used/estimated. dγf is the total penalty implemented to the maximum likelihood due to the use of a model with dγ parameters. Model with minimum IC is indicated as the best. The above criterion is a penalised likelihood measure.

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 18 Information criteria (2) When comparing two models γ (k) and γ (l) then IC kl = IC(γ (k) ) IC(γ (l) ) = 2 log f( y βγ (k), γ(k)) f ( y βγ (l), γ(l)) + ( dγ d ) (k) γ F (l) = Deviance kl + ( dγ (k) d γ (l) ) F We select model γ (k) if IC kl < 0, and model γ (l) if IC kl > 0.

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 19 Posterior model probabilities and information criteria The posterior model probability of a model γ is given by f(γ y) = f(y γ)f(γ) where f(y γ) is the marginal likelihood of model γ given by f(y βγ, γ)f(βγ γ)dβγ f(γ) prior probability of model γ It can be rewritten as 2 log f(γ y) = 2 log f(y γ) + [ 2 log f(γ)] IC(γ) = 2 log f(y βγ, γ) + dγf

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 20 Posterior model odds and information criteria Similarly if we consider the posterior odds of model γ (k) versus model γ (l). Then ( f(y γ (k) ) ) PO kl = f(γ(k) ) f(y γ (l) ) f(γ (l) ) = B kl PrO kl, B kl is the Bayes factor of model γ (k) versus model γ (l) (ratios of marginal likelihoods). PrO kl is the prior odds of model γ (k) versus model γ (l). It can be rewritten as 2 log PO kl = 2 log B kl + [ 2 log PrO lk ] IC kl = 2 log f( y βγ (k), γ(k)) f ( y βγ (l), γ(l)) + ( dγ (k) d γ (l) ) F

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 21 Uniform prior on model space If the prior model probabilities are defined via a negative function of the model dimension, then the prior model odds ξ kl = 2 log PrO kl = 2 log f(γ(k) ) f(γ (l) ) can be also interpreted as the extra penalty imposed to the Bayes factor. If the (usual) uniform prior distribution is used then ξ kl = 0 and PO kl = B kl for all models γ k, γ l M where M is the set of all models under consideration (model space). Bayesian benefit-only analysis can be assumed using the uniform prior on model space and hence base our variable selection procedure on Bayes factors.

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 22 Prior model odds interpretation Well-known rough approximation of log B kl (Schwartz, 1978): 2 log B kl = BIC kl + O(1) 2 log PO kl = BIC kl + ξ kl + O(1) = Deviance kl + (dγ d (k) γ (l)) log n + ξ kl + O(1) (4) where BIC kl is the Bayesian Information Criterion (e.g., Kass and Wasserman, 1996; Raftery, 1995, 1996) for choosing between models γ (k) and γ (l). BIC penalty equal to F = log n for each parameter used. The overall (posterior) penalty imposed to the deviance measure will be equal to (dγ (k) d γ (l)) log n + ξ kl.

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 23 4.2 Prior distributions Prior on model parameters βγ γ Normal ( ( ) ) 1 0, 4n X T γxγ Low information prior defined by Ntzoufras, Delaportas and Forster (2003). Can be derived using the power prior of Chen et al. (2000) and imaginary data supporting the simplest model included in our model space. It gives weight to the prior equal to one data-point. It is equivalent to the Zellner s g-prior (with g = 4n) used for normal regression models.

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 24 A cost-penalised prior on model space (1) Preliminaries We propose to specify our prior model probabilities via cost-dependent penalties for each variable. We denote by c j the cost of X j covariate and by c = (c 1, c 2,...,c p ) the vector of the costs of all variables under consideration. To specify this prior we define a baseline cost c 0 which is assumed to be a low acceptable cost for the collection of the data of a covariate. The cost of each variable can be then written as c j = k j c 0. For the Bayesian benefit only analysis we are using a uniform prior on model space.

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 25 A cost-penalised prior on model space (2) The five criteria We specify our prior distribution on γ to satisfy the following five criteria: (a) The prior must be unaffected to transformations c α c with α > 0, so that conversion between time and money or different monetary units (e.g., dollars and euros) leaves the prior unchanged; (b) the extra penalty ξ 1 for adding a variable X j with baseline cost c 0 is zero; (c) the extra penalty ξ 2 for adding a variable X j with cost c j = κ c 0 for some κ > 1 equals the BIC penalty of (κ 1) variables with cost c 0 ; (d) the extra penalty ξ 3 for adding any variable X j is greater or equal to zero; and (e) if all the variables have the same cost, then the prior must reduce to the uniform prior on γ.

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 26 A cost-penalised prior on model space (3) The five criteria - interpretation (a) ensures that the prior is invariant with respect to the manner in which cost is measured. (b) ensures that the penalty for adding a variable X j with baseline cost c 0 is the same as in the benefit-only analysis. (c) ensures that the posterior model odds will still have a BIC-like behavior. The induced extra penalty will be equal to the relative difference between the cost of X j and a variable with cost equal to c 0. (d) ensures that the cost-benefit analysis will support more parsimonious models than the corresponding ones supported by the benefit-only analysis. (e) requires that our prior should reproduce the benefit-only analysis if all costs are equal.

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 27 A cost-penalised prior on model space (4) The prior The following Theorem provides the only prior that meets the above five requirements, and defines the choice of c 0. Theorem 1. If a prior distribution f(γ) satisfies requirements (a-e) above, then it must be of the form f(γ j ) exp [ γ j 2 ( ) ] cj 1 log n c 0 where c j is the marginal cost per observation for variable X j and c 0 = min{c j, j = 1,...,p}. For proof see Fouskakis, Ntzoufras and Draper (2009a). for j = 1,...,p, (5)

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 28 4.3 Posterior model odds Cost-adjusted generalisation of BIC Under the above prior, if we consider the BIC-based approximation (4) then ( f(y ˆβγ 2 log PO kl = 2 log (k), ) γ(k) ) f(y ˆβγ (l), + C γ C (k) γ (l) log n + O(1). (6) γ(l) ) c 0 where Cγ = p j=1 γ jc j is the cost of model γ.

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 29 The penalty term dγ log n of model γ used in (4) has been replaced in the above expression by the cost-dependent penalty c 1 0 C γ log n; Ignoring costs is equivalent to c j = c 0 for all j, yielding c 1 0 C γ = dγ, the original BIC expression. We may interpret log n as the imposed penalty for each variable included in the model when no costs are considered. This baseline penalty term is inflated proportionally to the cost ratio c j c 0 for each X j ; for example, if the cost of a variable X j is twice the minimum cost (c j = 2 c 0 ) then the imposed penalty is equivalent to adding two variables with the minimum cost. For all these reasons, (6) can be considered as a cost-adjusted generalisation of BIC when prior model probabilities of type (5) are adopted.

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 30 4.4 Implementation and results Implementation details The procedure 1. Run RJMCMC (Green, 1995) for 100K iterations in the full model space. 2. Eliminate non-important variables (with marginal probabilities < 0.30) forming a new reduced model space. 3. Run RJMCMC for 100K iterations in the reduced model space to estimate posterior model odds and best models. Two setups: 1. Benefit only analysis (uniform prior on model space). 2. Cost - Benefit Analysis (cost penalised prior on model space).

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 31 Preliminary Results: Marginal Probabilities f(γ j = 1 y) Variable Variable Benefit Cost-Benefit Index Variable Name Cost Analysis Analysis 1 Systolic Blood Pressure (SBP) Score 0.50 0.99 0.99 2 Age 0.50 0.99 0.99 3 Blood Urea Nitrogen 1.50 1.00 0.99 4 Apache II Coma Score 2.50 1.00 5 Shortness of Breath Day 1 1.00 0.97 0.79 8 Septic Complications 3.00 0.88 12 Initial Temperature 0.50 0.98 0.96 13 Heart Rate Day 1 0.50 0.34 14 Chest Pain Day 1 0.50 0.39 15 Cardiomegaly Score 1.50 0.71 27 Hematologic History Score 1.50 0.45 37 Apache Respiratory Rate Score 1.00 0.95 0.32 46 Admission SBP 0.50 0.68 0.90 49 Respiratory Rate Day 1 0.50 0.81 51 Confusion Day 1 0.50 0.95 70 Apache ph Score 1.00 0.98 0.98 73 Morbid + Comorbid Score 7.50 0.96 78 Musculoskeletal Score 1.00 0.54 Number of variables 13 13

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 32 Reduced Model Space: Posterior Model Probabilities/Odds Common variables in both analyses: X 1 + X 2 + X 3 + X 5 + X 12 + X 70 Benefit-Only Analysis Common Variables Additional Model Posterior k Within Each Analysis Variables Cost Probabilities P O 1k 1 X 4 + X 15 + X 37 + X 73 +X 8 +X 27 +X 46 22.5 0.3066 1.00 2 +X 8 +X 27 22.0 0.1969 1.56 3 +X 8 20.5 0.1833 1.67 4 +X 27 +X 46 19.5 0.0763 4.02 5 17.5 0.0383 8.00 Cost-Benefit Analysis Common Variables Additional Model Posterior k Within Each Analysis Variables Cost Probabilities P O 1k 1 X 46 + X 51 +X 49 +X 78 7.5 0.1460 1.00 2 +X 14 +X 49 +X 78 7.5 0.1168 1.27 3 +X 13 +X 49 +X 78 7.5 0.0866 1.69 4 +X 13 +X 14 +X 49 +X 78 8.0 0.0665 2.20 5 +X 14 +X 49 7.0 0.0461 3.17 6 +X 49 6.5 0.0409 3.57 7 +X 37 +X 78 7.5 0.0382 3.82 8 +X 13 +X 14 +X 49 7.5 0.0369 3.96 9 +X 13 6.5 0.0344 4.25 above 3%. posterior odds of the best model within each analysis versus the current model k.

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 33 Reduced Model Space: Comparisons Comparison of measures of fit, cost and dimensionality between the best models in the reduced model space of the benefit-only and cost-benefit analysis; percentage difference is in relation to benefit-only. Analysis Difference Benefit-Only Cost-Benefit (%) Minimum Deviance 1553.2 1635.8 +5.3 Median Deviance 1564.5 1644.8 +5.1 Cost 22.5 7.5 66.7 Dimension 13 10 23.1

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 34 5 Utility versus Cost-Adjusted BIC Method Variable Utility RJMCMC Cost Posterior Index Name (Minutes) Good? Good? Probability 1 Systolic Blood Pressure Score 0.5 0.99 2 Age 0.5 0.99 3 Blood Urea Nitrogen 1.5 1.00 4 APACHE II Coma Score 2.5 1.00 5 Shortness of Breath Day 1 (yes, no) 1.0 0.99 6 Serum Albumin Score (3 point scale) 1.5 0.55 7 Respiratory Distress (yes, no) 1.0 0.92 8 Septic Complications (yes, no) 3.0 0.00 9 Prior Respiratory Failure (yes, no) 2.0 0.00 10 Recently Hospitalized (yes, no) 2.0 0.00 12 Initial Temperature 0.5 0.95 17 Chest X-ray Congestive 2.5 0.00 Heart Failure Score 18 Ambulatory Score 2.5 0.00 48 Total APACHE II Score 10.0 0.00 It s clear that the Utility and Cost-Adjusted BIC approaches have reached nearly identical conclusions in the Small World of p = 14 predictors.

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 35 With p = 83 the agreement between the two methods is also strong (although not as strong as with p = 14): using a star system for variable importance given in Fouskakis, Ntzoufras and Draper (2009a), 60 variables were ignored by both methods, 8 variables had identical star patterns, 3 variables were chosen as important by both methods but with different star patterns, 10 variables were marked as important by the utility approach and not by RJMCMC, and 2 variables were singled out by RJMCMC and not by utility: thus the two methods substantially agreed on the importance of 71 (86%) of the 83 variables. Median p Method Model Cost Deviance LS CV 14 X 1 + X 2 + X 3 + X 4 + X 5 + X 6 + X 7 + X 12 9.0 1654 0.329 RJMCMC X 1 + X 2 + X 3 + X 4 + X 5 + X 7 + X 12 7.5 1676 0.333 Utility X 1 + X 3 + X 4 + X 5 5.5 1726 0.342 83 RJMCMC Utility X 1 + X 2 + X 3 + X 5 + X 12 +X 46 + X 49 + X 51 + X 70 + X 78 7.5 1645 0.327 X 1 + X 3 + X 4 + X 12 +X 46 + X 49 + X 57 6.5 1693 0.336 To the extent that the two methods differ, the utility method favors models that cost somewhat less but also predict somewhat less well.

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 36 6 Discussion The fact that the two methods may yield somewhat different results in high-dimensional problems does not mean that either is wrong; they are both valid solutions to similar but not identical problems. Both methods lead to noticeably better models (in a cost-benefit sense) than frequentist or Bayesian benefit-only approaches, when - as is often the case - cost is an issue that must be included in the problem formulation to arrive at a policy-relevant solution. In comparing two or more models, to say whether one is better than another I have to face the question: better for what purpose? This makes model specification a decision problem: I need to either (a) elicit a utility structure that s specific to the goals of the current study and maximise expected utility to find the best models, or (b) (if (a) is too hard, e.g., because the problem has a group decision character) I can look for a principled alternative (like the cost-adjusted BIC method described here) that approximates the utility approach while avoiding ambiguities in utility specification.

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 37 Authors related work Draper D, Fouskakis D (2000). A case study of stochastic optimization in health policy: problem formulation and preliminary results. Journal of Global Optimization, 18, 399 416. Fouskakis D, Draper D (2002). Stochastic optimization: a review. International Statistical Review, 70, 315 349. Fouskakis D, Draper D (2008). Comparing stochastic optimization methods for variable selection in binary outcome prediction, with application to health policy. Journal of the American Statistical Association, 103, 1367 1381. Fouskakis D, Ntzoufras I, Draper D (2009a). Bayesian variable selection using cost-adjusted BIC, with application to cost-effective measurement of quality of health care. Annals of Applied Statistics, 3, 663 690. Fouskakis D, Ntzoufras I, Draper D (2009b). Population Based Reversible Jump MCMC for Bayesian Variable Selection and Evaluation Under Cost Limit Restrictions. Journal of the Royal Statistical Society C (Applied Statistics), 58, 383 403.

23rd September 2009: School of Mathematical and Computer Sciences, Heriot-Watt University 38 Additional References Arrow KJ (1963). Social Choice and Individual Values. Wiley. New York. Chen MH, Ibrahim JG and Shao QM (2000). Power prior distributions for generalized linear models. Journal of Statistical Planning and Inference, 84, 121-137. Green P (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82, 711 732. Keeler E, Kahn K, Draper D, Sherwood M, Rubenstein L, Reinisch E, Kosecoff J, Brook R (1990). Changes in sickness at admission following the introduction of the Prospective Payment System. Journal of the American Medical Association, 264, 1962 1968. Ntzoufras I, Dellaportas P, Forster JJ (2003). Bayesian variable and link determination for generalized linear models. Journal of Statistical Planning and Inference, 111, 165 180. Weerahandi S and Zidek JV (1983). Elements of multi-bayesian decision theory. Annals of Statistics, 11, 1032-1046.