A MODERN STATISTICAL APPROACH TO QUALITY IMPROVEMENT IN HEALTH CARE USING QUANTILE REGRESSION JARROD E. DALTON

Size: px

Start display at page:

Download "A MODERN STATISTICAL APPROACH TO QUALITY IMPROVEMENT IN HEALTH CARE USING QUANTILE REGRESSION JARROD E. DALTON"

Sharon Edwards
5 years ago
Views:

1 A MODERN STATISTICAL APPROACH TO QUALITY IMPROVEMENT IN HEALTH CARE USING QUANTILE REGRESSION by JARROD E. DALTON Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Dissertation Advisor: Dr. Ralph O Brien Department of Epidemiology and Biostatistics CASE WESTERN RESERVE UNIVERSITY January, 2013

2 We hereby approve the thesis/dissertation of Jarrod E. Dalton candidate for the Doctor of Philosophy degree*. Ralph O Brien, PhD (chair, PhD thesis advisor) Mark Schluchter, PhD Tomas Radivoyevitch, PhD John Barnard, PhD Abdus Sattar, PhD Robert Elston, PhD July 24, 2012 *We also certify that written approval has been obtained for any proprietary material contained therein. 2

3 To my parents William and Connie Dalton, whose gifts to me over the years are too many to list, and to my wife Jamie, whose encouragement and love kept this project on track. Without these people in my life, this work would not be possible.

4 Contents 1 Introduction Baseline Risk Adjustment Linking Processes and Outcomes Quantile Regression Flexible recalibration of binary clinical prediction models Introduction Recalibration Methodology Comparing Models on Calibration An Example Discussion Baseline Risk Adjustment for In-Hospital Mortality using ICD-9-CM Codes Introduction

5 3.2 Methods Model Development Calibration Comparator Model Model Performance and Reliability Results Discussion Quantile Regression Fundamentals Quantiles Quantile Estimation as an Optimization Problem Median as a Minimizer of Absolute Deviations Estimating a Median via Linear Programming Extension to other quantiles Quantile Regression Model Formulation and Interpretation Estimation of Conditonal Quantiles Variability Estimates Covariance Matrix Approximations Wald Tests and Confidence Intervals Testing the Location-Shift Hypothesis

6 4.4.4 Example R Software Sparse Multiquantile Regression via Second-Order Fused Lasso Introduction Methodology Model Fit Criterion Optimization Selection of Regularization Parameters Model Degrees of Freedom Simulation Analysis Application to Real Dataset Conclusion Conclusion Future Work Bibliography 158 6

7 List of Figures 2.1 Calibration curves as a function of the estimated risk score, among the 20% validation cohort. A histogram of the risk score is overlayed below; the curves are plotted over the middle 99% of the distribution Harrell calibration plot of the raw risk scores in the 20% validation cohort. A histogram of the raw risk scores is overlayed below the plot

8 2.3 Calibration of two sets of updated risk scores in independent test data, based respectively on a cubic spline recalibration model and a traditional linear-logistic recalibration model. Histograms of the two sets of updated risk scores are overlayed below; calibration curves are shown for the middle 99% of the distributions of the risk scores (respectively). The displayed calibration curves were fit to the sets updated risk scores using natural cubic splines Study flow diagram

9 3.2 Calibration curves displaying the relationship between observed outcomes and model predictions. On the x-axis is the risk score (given as a predicted probability, and on the y-axis is an observed-to-expected (O/E) odds ratio. Perfect calibration implies an O/E odds ratio of 1 across the risk spectrum. An O/E odds ratio of 0.1 among patients with predicted mortality risk of 10 3, for example, implies that observed mortality was 10 times less likely for these patients than predicted by the model (thus the risk score was too high). Histograms of the risk scores underlie each panel. Calibration curves are truncated to the middle 99% of the data. Panel (a) displays the calibration of raw scores from the logistic model, within the random 20% calibration cohort. Correcting predictions based on these curves was insufficient to ensure complete calibration in the prospective 2009 data (b). However, re-calibration within the 2009 data based on the curve in (b) yielded favorable calibration for both models (c)

10 3.3 Scatterplot of 50,000 randomly sampled discharges from the 2009 data, displaying the ratio of predicted odds of mortality (calculated as the predicted odds under the AllCodeRisk model divided by the predicted odds under the POARisk model) as a function of the POARisk score. Each of the risk scores had been re-calibrated to the 2009 data. Fidelity of the All- CodeRisk model to the POARisk model in terms of characterizing individual patient risk is represented by the horizontal line at an RPO of 1.0 (dashed line). Quantile regression curves displaying the median, first and third quartiles, and middle 95% of the data as a function of POARisk (fit using the entire 2009 sample) are overlaid. The plot shows that the AllCodeRisk model produces risk estimates that are too low for the majority of the patients, though risk estimates among high-risk patients were more consistent

11 3.4 Kernel density plot of the percent difference in hospital performance under the two risk adjustment models (defined as the percent change in hospital observed-to-expected mortality ratio, AllCodeRisk vs. POARisk). Hospitals with fewer than 30 mortalities were excluded from the analysis. Performance under the AllCodeRisk model was within ±20% of that under the POARisk model for 89.6% of the hospitals, which was less than the pre-specified criterion of at least 95% of hospitals ECDF of randomly-generated Poisson(3) data (top panel) Functions r i (u) defining the absolute deviations of each unique observed value of the random Poisson(3) data, y i, i = 1, 2,..., 200; (bottom panel) The objective function ψ(u) = 1 n n i=1 r i(u), minimized at ˆM = Check functions ρ τ (y u) for τ {0.10, 0.50, 0.75} Density plot of total charges associated with heart valve replacement surgery

12 4.5 (top panel) Predictions Q Y X1 (τ) and quantile treatment effects β τ,1 from univariable quantile regression models for an outcome Y based on a dichotomous predictor X 1, for τ {0.1, 0.5, 0.9}. (bottom panel) Quantile treatment effects β τ,1 as a function of τ. The OLS estimate of the treatment effect β 1 (dashed line) is overlayed. The fact that the quantile treatment effects vary implies a violation of the location shift assumption; thus the OLS model is inadequate Deciles of the distribution of hospital charges associated with valve replacement surgery, by ischemic heart disease status Quantile regression coefficient profiles as a function of τ Search procedure for c 1 and c 2. Starting at the unconstrained or full model (triangle), find the test data fit criterion for a series of c 2 values (blue arrow), saving the combination of constraint values that minimizes the criterion (green dots). Decrement c 1 by and repeat the search along the series of c 2 values. If the minimum test data fit criterion is less than the prior minimum, proceed; otherwise stop (green X ) and perform a fine grid search around the neighborhood of the initial minimum (blue diamond)

13 5.2 Histograms of ν for simulated analyses under varying sample sizes Histograms of estimated degrees of freedom under the SMR model, from simulated analyses under varying sample sizes Quantile regression coefficient estimates for the Hospital Compare data modeling the risk-adjusted hospital 30-day mortality rate associated with patients treated for pneumonia. Estimates and pointwise 95% confidence intervals (standard errors estimated using non-parametric bootstrap resampling) are presented as white diamonds and red shaded areas, respectively, while estimates from the SMR model are given by the connected black points Adjusted quantiles of post-diagnosis survival among Non-Hodgkin s Lymphoma patients versus age of diagnosis

14 A Modern Statistical Approach to Quality Improvement in Health Care using Quantile Regression Abstract by JARROD E. DALTON Quality is difficult to measure and compare among medical providers. First, an appropriate metric must be chosen among many potential process-based and patient outcome measures. Further, when comparing providers, one must take into account the fact that patients are not homogeneous across providers. Accounting for the latter with clinical risk adjustment models is a complicated and controversial topic, as providers are increasingly paid based on their performance with respect to risk-adjusted quality measures. A modern approach to hospital quality improvement is developed by expanding on D.R. Cox s 1958 methodology for calibrating binary outcomes; using this calibration methodology and recent efficient generalized linear modeling algorithms to develop a risk-adjustment model for in-hospital morality with data on 20 million inpatient discharges from the State of California between

15 15 and 2008; applying this model to obtain 2009 observed-to-expected mortality ratios for the hospitals across California; and evaluating whether or not performance-dependent relationships between hospital process changes and observed-to-expected mortality ratios exist, based on a novel sparse multiquantile regression technique that incorporates a fused-lasso-type penalty.

16 Chapter 1 Introduction Fee-for-service systems in many current health care reimbursement programs (such as Medicare) may actually produce disincentives for ensuring the quality of services rendered [1]. The Patient Protection and Affordable Care Act passed by Congress and signed into law by President Obama in stipulated that, effective January 1, 2015, payment to physicians is to be based on the quality and not on the volume of care provided. Through this mandate, supporters of the bill argued, greater economic focus on quality might result in better health outcomes while at the same time reducing the costs associated with care. 1 H.R th Congress: Patient Protection and Affordable Care Act. (2009). In GovTrack.us (database of federal legislation). Retrieved March 25, 2012, from 16

17 17 However, the health care industry is relatively unique in that those responsible for selecting providers are largely independent of those responsible for paying for services rendered. Therefore, changing incentives with respect to only the reimbursement processes might have a lesser economic impact than additionally affecting demand for services. Along these lines, performance comparison tools such as the U.S. Department of Health and Human Services Hospital Compare Website 2 have enabled patients to make informed decisions among potential providers based on various process- and outcomes-based performance measures. Quality reporting efforts such as this have been facilitated by a rise in the availability of various large and highquality clinical and administrative data registries. While controversy exists among various health care stakeholders regarding which processes to measure and how to measure them, the statistical issues involved with measuring process adherence among various providers are relatively benign. In contrast, the choice of health outcomes to monitor for quality-of-care is relatively straightforward from a clinical standpoint but objectively comparing outcomes is fraught with statistical difficulties. The primary challenge involved with comparing outcomes is that providers vary with respect to their patients severity of disease and the severity of procedures they use to treat their patients. Thus, some form of baseline 2 Accessed March 24, 2012.

18 18 risk adjustment is necessary in order to better associate differences among providers in outcome rates with differences in quality of care [2]. 1.1 Baseline Risk Adjustment A multitude of techniques for making risk-adjusted comparisons among providers exists. For binary outcomes such as in-hospital mortality, a common technique is to estimate for each provider an observed-to-expected outcome ratio. The expected number of events used in the denominator of this calculation is assigned by the risk adjustment model being employed; specifically, the expected number of events is the sum of individual patients predicted event probabilities. Given this framework, hospital performance is thus closely tied to the underlying risk adjustment model. Correspondingly, precise and unbiased estimation of baseline risk is a crucial component of any useful risk adjustment model. Often, the relative merits of risk adjustment models are established by their ability to discriminate between patients who experience the outcome(s) in question and patients who do not, for instance, by using the concordance index (or C-statistic) [3]. While it is indeed important for useful risk adjustment models to maximize the predictive ability of baseline patient characteristics and planned procedures, these models are often developed for

19 19 patient populations that exhibit a broad spectrum of risk (such as the US inpatient population or the population of Medicare patients) and as a consequence can achieve consistently high C-statistics. In addition to discrimination, estimates from optimal risk adjustment models should calibrate well with observed outcomes. In the context of binary outcomes, this means that those with a predicted probability of p should exhibit an outcome incidence of p for all p between 0 and 1. Compared to discrimination, unfortunately, model calibration is less extensively documented in the literature. For example, there is at least a perception that currently available outcomes-based measures might not adequately adjust for risk, especially in sicker patients [4]. Providers may tend to avoid treating patients who have risk-adjusted estimates of adverse event rates that providers perceive are too low. Likewise, they may cherry-pick those patients who have risk-adjusted outcome rates that seem to be too high [5]. While calibration among high-risk patients is clearly important due to the fact that these patients are most likely to influence observed and expected outcomes, calibration among low-risk patients is also important since hospital outcome risk is typically low for the vast majority of patients. Due to these large numbers, miscalibrations among low-risk patients therefore also have the potential to influence aggregate measures of risk-adjusted outcomes. In circumstances where calibration is actually considered, standardly

20 20 implemented calibration techniques are inadequate for identifying departures from model calibration among low-risk patients. For example, among the 1.3 million cases participating in the National Surgical Quality Improvement Program between 2005 and 2011 (representing over 400 hospitals), the overall incidence of 30-day mortality was 1.6%. With this low overall incidence, it is reasonable to expect that a highly-predictive risk adjustment model would yield predicted probabilities of less than 2% for a large proportion of patients. A smooth plot of the observed incidence on the y-axis versus the expected incidence on the x axis [6] would limit the assessment of model calibration among these patients to the extreme lower left corner of the figure. The first objective of this dissertation was therefore to develop a flexible method for calibrating binary outcomes on the log-odds scale. This method extends the work by D. R. Cox in 1958 [7] which considered a linear-logistic calibration equation and tested whether or not the intercept and slope of this linear-logistic equation are 0 and 1, respectively. Another issue with existing risk adjustment models is loose distinction between which patient characteristics are baseline or present-onadmission (POA) versus which are hospital-acquired. Several highly-discriminative risk adjustment models [8, 9, 10, 11] incorporate administrative data on diagnoses and procedures, such as the Current Procedural Terminology [12] and

21 21 the International Classification of Diseases, 9th Revision, Clinical Modification (ICD-9-CM) [13]. While administrative datasets are rich (due in large part to the fact that they underlie reimbursement procedures), there has not traditionally been a mechanism in place for documenting the timing of diagnoses, which is a crucial aspect of determing whether or not hospitals are treating or causing patient morbidity. The Defecit Reduction Act of mandated that by 2008 all diagnoses be coded with POA indicators. Unfortunately, health care systems have been slow to comply and consequently, large and representative data registries that can be used to construct risk adjustment models which incorporate POA indicators have only recently become more available. A third issue arising with many existing clinical risk adjustment models is the lack of utilization of modern predictive modeling techniques. It is well known in the statistics community that purposely biasing regression parameter estimates toward zero a technique known as shrinkage often leads to gains in predictive accuracy among external observations (i.e., observations that were not used for model development). Hoerl in 1970 [14] introduced one of the earliest shrinkage methods. His ridge regression technique penalized the model fit criterion by an amount 3 S th Congress: Deficit Reduction Act of (2005). In GovTrack.us (database of federal legislation). Retrieved March 25, 2012, from

22 22 proportional to the L 2 -norm of the coefficient vector β. Ridge regression tends to average correlated predictors in order to produce increased prediction accuracy. A related shrinkage technique named the lasso, which was proposed by Tibshirani in 1996 [15], penalized the L 1 -norm of β. Lasso regression tends to shrink coefficients for predictors unrelated to the response all the way to zero, effectively performing a form of smooth variable selection while increasing predictive accuracy for external observations. Park and Hastie [16, 17] extended the lasso penalty to the general exponential family and to the proportional hazards regression modeling frameworks. In 2005, Zou and Hastie [18] combined the ridge and lasso shrinkage estimators by incorporating a penalty of α times the ridge penalty plus (1 α) times the lasso penalty, for some α between 0 and 1. They showed that this elastic net estimator often resulted in superior prediction performance relative to the ridge and lasso estimators. Finally, Friedman et al. in 2010 [19] developed an extremely efficient algorithm for obtaining the entire solution path for generalized linear models incorporating elastic net penalties, allowing practical implementation on a large scale. The second objective of this dissertation was to develop a baseline risk index for in-hospital mortality, using administrative data from the State of California. These data are relatively unique, since the State of California has been collecting POA indicators associated

23 23 with patients diagnoses longer than most other states; for this risk index, 20.0 million discharge records from the California State Inpatient Database (CA-SID) are used to develop the model and an additional 4.0 million discharge records from the 2009 CA-SID are used to prospectively validate the model. Elastic net shrinkage is used to maximize prediction accuracy for independent observations. The newly-developed calibration technique (Objective #1) is employed to ensure that risk estimates were unbiased across all predicted probabilities. 1.2 Linking Processes and Outcomes Measuring, reporting, and comparing outcomes may be the most important step toward unlocking rapid outcome improvement and making good choices about reducing costs [20]. Without objective (risk-adjusted) outcome measurement, it is impossible to know if the care clinicians deliver is good or bad [21]. Despite ongoing efforts to improve outcome measurement, the assessment of quality of care has largely focused on defining and measuring adherence to process-based performance criteria, with the presumption that these processes are associated with improved clinical outcomes. Along these lines, there already has been skepticism regarding the effectiveness of the quality measurement industry [22, 23]. Without proof

24 24 that improvement in certain processes of care actually helps patients have better outcomes, what motivation do caregivers have to undertake costly and complex efforts to improve processes other than to score highly on public report cards? It may be likely that improvement in outcomes depends on both localized and system-wide process initiatives [21]. Localized quality improvement is affected by the decisions of health care professionals during the course of care as well as by the policy decisions of hospital leadership. The responsibility of defining best practices rests on providers at all levels of care; effective quality improvement research can support these efforts by revealing differences in processes which separate the providers with the best outcomes from providers with the worst outcomes. This type of research is a crucial link in the ongoing plan-do-studyact cycle of quality improvement first conceived by Walter A. Shewhart and popularized in industrial settings by W. Edwards Deming [24, 25]. Though a practice-measure-improve analogue has been proposed for the medical setting [26, 27], the main idea of research completing the cycle of quality improvement remains. Unfortunately, scientific evidence linking process improvement to outcomes is rare [28], despite an increase in the incentives and/or requirements for providers to monitor adherence to various types of process indi-

25 25 cators (e.g., the U.S. Centers for Medicare and Medicaid Services maintains nearly 300 quality indicators as part of their voluntary Physician Quality Reporting System 4 ). An example of such evidence is the work of Peterson et al. in 2006, who found that increasing adherence rates to a composite of nine practice guidelines established by the American College of Cardiology Foundation and the American Heart Association by 10% was associated with a risk-adjusted odds ratio [95% confidence interval] for in-hospital mortality of 0.90 [0.84, 0.97] among patients with acute coronary syndromes [29]. The relationship between processes and outcomes may be complex. It may not suffice to assume, for instance, that a strategic initiative aimed at increasing the quality and quantity of hand-washing would have the same effect on postoperative infectious outcomes across all providers. One might hypothesize that this initiative would have greatest impact among hospitals with the worst risk-adjusted infection rates and little impact among hospitals with already low rates, since the latter group may already be practicing appropriate hand-washing protocols. 4 Accessed March 26, 2012.

26 Quantile Regression Quantile regression [30] is a technique for estimating selected quantiles of a response variable s distribution, conditional on one or more covariates. For instance, regressing the 90th percentile of the distribution of hospital observed-to-expected infection ratios on a set of process measurements might reveal which processes are most important for avoiding unnecessarily high infection rates (compared to other hospitals). Likewise, performing the same analysis with respect to the 10th percentile of this distribution might reveal which processes are most important for maintaining a relatively low infection rate. These sets of processes, as alluded to above, may not be similar. Though quantile regression has existed for several decades, its application has been limited to roughly the last 10 years. One reason for this delay is that the method relies on linear programming techniques in order to find optimal estimates for model coefficients. Linear programming and general convex optimization are not standard components of graduate curricula in statistics and biostatistics. Another reason for the delay is that quantile regression modules for standard statistical analysis software packages had not been developed until the last 10 years. The third objective of this dissertation was to provide a review of quantile regression, starting from fundamental theory which restates the problem of univariate quantile esti-

27 27 mation to a linear programming problem, utilizing basic linear programming modules in R to arrive at optimal quantile estimates, and extending these concepts to the general regression setting. The R package quantreg [31], which incorporates modern and more efficient linear programming algorithms and allows for easier implementation of the methodology, is introduced. Traditional quantile regression treats each desired conditional quantile as its own optimization problem. However, the fitted quantile regression parameter estimates for the 0.50 quantile are likely similar to the fitted parameter estimates for the 0.51 quantile of the response distribution (for example). Restructuring the optimization problem so that sets of regression coefficients for all quantiles of interest are estimated simultaneously - and restricting these regression coefficients in such a way that penalizes excess quantile-to-quantile fluctuations in fitted coefficients for a given predictor - might yield more consistent results in small samples. The fourth objective of this dissertation was to develop a sparse multi-quantile regression technique which restricts parameter estimates according to a second-order fused lasso penalty applied to the joint model fitting criterion. The fused lasso penalty shrinks parameters all the way to zero for unrelated predictors, while at the same time shrinking the quantile-toquantile differences in parameter estimates towards a constant for predictors that are related to the outcome (specifically, for regions of quantiles in which

28 28 there is a relationship between a given predictor and the outcome). The new multiquantile regression modeling technique is then applied in a real-world setting relating hospital process improvement metrics to riskadjusted outcomes. Data from the US Department of Health and Human Services Hospital Compare registry and from the 2010 U.S. Census are analyzed to compare risk-adjusted rates of 30-day mortality associated with pneumonia care among various hospitals. Predictors include various process indicators defining best practices for treating pneumonia, population density in the area of the hospital, minority population percentage, and a relative measure of Medicare spending per patient.

29 Chapter 2 Flexible recalibration of binary clinical prediction models 2.1 Introduction In addition to the relative risks of outcome imposed by various types of exposures, physicians and patients seek measures of absolute risk of outcome to inform care decisions [32, 33]. Likewise, researchers and policy makers are interested in classifying patients with respect to risk of outcome and evaluating the diagnostic utility of novel biomarkers [34]. It is therefore not a surprise that - with advances in statistical methodology and computational capacity, and with the establishment of many large and often disease-specific 29

30 30 observational data registries in medicine - we have seen clinical prediction modeling emerge as a prominent area of research over recent decades, with applications to various types of outcomes [35]. Binary outcomes are perhaps the most prevalent type of outcome studied in medicine, and methods specifically designed for predicting binary outcomes and evaluating model performance in external data abound [33, 6, 36, 37, 38, 39]. Beyond evaluation, adjustments of prediction rules to better characterize future data should be considered [40]. In judging the quality of binary clinical prediction models for independent data, two important considerations are discrimination (the model s ability to separate events from non-events) and calibration (the agreement between model predictions and observed outcome incidences) [6]. Diamond illustrated how these two aspects do not necessarily go hand in hand [41]. Discrimination is important in both prognostic and diagnostic models since those applying the models are typically interested in predicting who among a population is most likely to experience the outcome in question or have a disease in question. Likewise, calibration is important in both settings. When the goal of the model is to diagnose an existing condition, well-calibrated probabilities can better inform treatment decisions (for example, an invasive procedure to treat a certain disease might only be warranted if there is reasonable certainty that the patient truly has the disease in question).

31 31 Ensuring well-calibrated probabilities in prognostic models enables practitioners to give patients accurate assessments of their risk of future outcomes, for example. Curiously, however, calibration in clinical prediction models is repeatedly not given proper attention. This fact has been a primary source of concern, for example, in the interpretation and implementation of payfor-performance programs in health care. Despite the existence of several highly-discriminative risk adjustment models for monitoring quality of care [8, 9, 42, 11, 10, 43], providers tend to avoid treating high-risk patients for fear of the underlying risk adjustment models inability to adequately describe risk among high-risk patients [4]. In other words, they suspect that the models do not calibrate well with actual outcomes. It is crucial, therefore, that the models are evaluated for calibration and any departures from calibration are corrected in order to accurately represent risk in external populations. Only when prognostic models are well-calibrated should their discriminative ability be of concern. Hosmer and Lemeshow [44] developed a goodness-of-fit test for overall model calibration. This was based on grouping patients into a certain number of categories (typically, ten) with respect to their predicted probability and comparing the observed number of events within each group to the expected number of events (defined as the sum of the individual predicted

32 32 probabilities within a given group). While this test can indicate the presence of miscalibration, it is dependent on both sample size and the choice of cutpoints; further, the test alone does not prescribe a solution for the miscalibration. Cox [7] proposed a method for testing calibration and simultaneously estimating a recalibration equation, using external data. Briefly, a new logistic model is developed to relate the expected log-odds of the outcome (based on the prognostic model) to the observed log-odds using a linear equation. Details are given in Section 2 below. The model is deemed to calibrate well if the intercept and slope of the linear equation are 0 and 1, respectively. If the model systematically over- or under-estimates the probability of outcome (known as calibration-in-the-large ), the intercept will be different from zero. Likewise, there should be a one-to-one relationship between expected and observed probabilities; departures from this requirement are reflected as a slope different from one [45]. In practice, however, such a linear-logistic assumption may be inadequate to appropriately describe the nature of miscalibration. In this article Cox s method is extended to allow for a more flexible family of recalibration functions which optionally are covariate-dependent.

33 Recalibration Methodology We partition the clinical prediction modeling process into three phases, each with its own dataset (which are independent of one another, e.g., as given by random partitioning of the overall dataset prior to analysis). The first dataset, called the training dataset, is used to fit the prediction model. The second, called the validation dataset, is used to recalibrate the model. The third, called the test dataset, is used to externally evaluate model performance among data that had no influence on the finalized model itself. This type of partitioning is consistent with the framework described by Hastie et al. for building and evaluating prediction models [46] (though technically speaking one may wish to further partition the second dataset in order to select model tuning parameters such as penalization factors). Now, let us assume that, in our validation dataset, we have obtained a collection of n expected probabilities p i (0, 1) for a binary event Y i {0, 1}, where i = 1, 2,..., n. We note that there are no restrictions on the p i otherwise; regardless of how they were derived from the training cohort (e.g., using logistic regression, random forests, or any of the many other methods for binary prediction), we only require that they are positive numbers less than one. Thus, we are concerned with the agreement between the expected probability p i and the true probabilities p i = Pr(Y i = 1).

34 34 Define a risk score on the log-odds scale as a monotone transformation of the predicted probabilities p i : [ ] p i RS i = log (1 p i ) (2.1) Cox s linear-logistic calibration model is given by: [ ] p i log (1 p i ) = β 0 + β 1RS i + error i. (2.2) In this model, standard Wald chi-squared tests [47] of model coefficients can be used to test whether or not β 0 = 0 and β1 = 1. Steyerberg (2009) suggested restating Cox s model by introducing an offset term for the risk score [35]: [ ] p i log (1 p i ) = β 0 + (β 1 + 1)RS i + error i = RS i + β 0 + β 1 RS i + error i. (2.3) That is, we introduce a term for RS i whose coefficient is fixed at 1 (thus, β 1 = β 1 + 1). Arranging terms, then, we obtain log(γ i ) = β 0 + β 1 RS i + error i, (2.4)

35 35 where γ i is defined as the observed-to-expected odds ratio, i.e., γ i = p i (1 p i ) p i (1 p i ). (2.5) This framework enables generalizing the calibration equation for more complex functions of the risk score, for instance, by including in the calbration model a factor representing sequential risk strata, polynomial terms, or spline smoothers. In general, let H be an n k matrix representing a k-dimensional basis expansion of the risk score. That is, [ ] H i = h 1 (RS i ) h 2 (RS i ) h k (RS i ). (2.6) In Cox s linear-logistic model, H i = [h 1 (RS i )] = [RS i ]. If we are considering discrete risk strata represented by cutpoints {ξ 1, ξ 2,..., ξ k }, then we might have h 1 (RS i ) = I(ξ 1 RS i < ξ 2 ), h 2 (RS i ) = I(ξ 2 RS i < ξ 3 ), and so forth. Various smoothers can similarly be represented under this framework. See Hastie et al. [46] for details. Using H, we estimate the observed-to-expected odds ratio γ i for a given risk score by the offset logistic regression model log( γ i ) = α + H β. (2.7)

36 36 A global likelihood ratio test [44] of model (2.7) against a null model with only the offset term (and no intercept) can be used to test for overall miscalibration. In the presence of general miscalibration, we can test whether or not the form of this miscalibration is more complex than a simple location shift. This is equivalent to testing H 0 : β = 0. Let Σ β be the estimated variance-covariance matrix of β. The Wald statistic W = β T Σ 1 β β (2.8) has an asymptotic χ 2 k distribution under the null hypothesis. Graphical assessments of calibration can be made based on model (2.7) by plotting log( γ) against the risk score; perfect calibration in this plot would represent a horizontal line at log( γ i ) = 0. To recalibrate the model, we obtain the updated risk scores RSi, defined as [ ] RSi p i = log (1 p i ) = RS i + log( γ i ) (2.9) and thus the updated predicted probability p i is the inverse logit p i = exp {RS i + log( γ i )} 1 + exp {RS i + log( γ i )}. (2.10) Finally, note that model (2.7) can be re-used in order to assess the adequacy

37 37 of recalibration in the test dataset, and to further recalibrate predictions if departures from overall calibration remain. 2.3 Comparing Models on Calibration A natural measure of overall miscalibration for a model - which will be called the miscalibration index hereafter - is the total size of the log( γ i ): Q = log( γ) 2 = n (log( γ i )) 2. (2.11) i=1 The asymptotic behavior of Q is dependent on the structural form of the calibration model; however, given that this structural form is pre-specified and fixed, Q can be useful as a relative quantity for comparing two models. Let RS 1i and RS 2i be any two risk scores which estimate the log-odds of Y i. These could be from models fit to the training data, from externallydeveloped risk scoring algorithms, or after applying various recalibration methods. Fit respective calibration curves based on Equation 2.7 in order to estimate log( γ 1i ) and log( γ 2i ). Based on these quantities, we obtain risk-score-specific miscalibration indexes using Equation Thus, we have two miscalibration indexes Q 1 and Q 2. To compare the models, we define a

38 38 miscalibration ratio as R = Q 1 Q 2. (2.12) A bootstrap resampling routine [48] can be used to approximate the sampling distribution of R and obtain an empirical confidence interval. 2.4 An Example To demonstrate the methodology, data from one million hospital inpatient stays were obtained from the National Inpatient Sample 1 (250,000 per year). The overall dataset was randomly divided into a training set (60%), a validation dataset (20%), and a test dataset (20%). Using the training data, a logistic regression model for in-hospital mortality was estimated based on predictors age, gender, and a factor defined by combining the patients prinicpal diagnosis with their principal procedure. The U.S. Agency for Research and Quality s Clinical Classifications Software 2 was used to characterize the diagnoses and procedures. Procedures for a given diagnosis were combined if there were less than 1,000 observations in the overall dataset represented by that combination of diagnosis and procedure. Diagnoses were then com- 1 HCUP Databases. Healthcare Cost and Utilization Project (HCUP). October Agency for Healthcare Research and Quality, Rockville, MD. 2 HCUP CCS. Healthcare Cost and Utilization Project (HCUP). October Agency for Healthcare Research and Quality, Rockville, MD.

39 39 Figure 2.1: Calibration curves as a function of the estimated risk score, among the 20% validation cohort. A histogram of the risk score is overlayed below; the curves are plotted over the middle 99% of the distribution. bined using the same minimum cell size. R software version for 64-bit Microsoft Windows (The R Foundation for Statistical Computing, Vienna, Austria, was used to perform the analysis. The size of all tests was fixed at Figure 2.1 displays estimated calibration curves, based on both a traditional linear-logistic fit and a six-degree-of-freedom natural cubic regression spline fit of the form (2.7). Predictions were in general bimodal,

40 40 with a relatively large group of inpatient stays clustered around a log-odds of 20, i.e., a predicted probability on the order of one in a billion; these stays were mainly associated with obstetric care, psychiatric evaluation, and other minor procedures. However, the spline fit indicates that the model significantly underestimated risk among these stays, with log( γ) as high as 10 to 12. If the true log-odds is closer to -9 than as the calibration model indicates - then predicted mortality is more on the order of one in ten thousand. In comparison, a Harrell calibration plot [6] (i.e., a smooth plot of the predicted probabilities against the observed incidences) indicates generally good calibration of the raw risk scores (Figure 2.2). Overall, the likelihood ratio test revealed significant miscalibration (P<0.0001) and the multiparameter Wald test for miscalibration more complex than a simple location shift (Equation 2.8) was also significant (P<0.0001). The linear-logistic fit, on the other hand, failed to indicate departures from calibration; the estimated intercept [95% confidence interval] was [ , 0.020] (P=0.17, Wald test) and the estimated slope was [-0.041, 0.005] (P=0.12). Using the independent test dataset, recalibrated risk scores RSi were estimated based on the cubic spline and the linear-logistic calibration models, respectively. Another calibration curve was fit to each set of updated risk scores to graphically evaluate whether or not the recalibration models

41 Figure 2.2: Harrell calibration plot of the raw risk scores in the 20% validation cohort. A histogram of the raw risk scores is overlayed below the plot. 41

42 42 successfully achieved their goal. These new calibration curves were fit using six-degree-of-freedom natural cubic regression splines. Results are shown in Figure 2.3. The recalibration based on smoothing resulted in updated risk scores that more closely represented risk than they did prior to recalibration, as evidenced by the fact that the test-data calibration curve was generally closer to the horizontal line given by log( γ) = 0. The global likelihood ratio test did not identify significant miscalibration (P=0.09). In contrast, both the global test and the multiparameter Wald test evaluating the shape of miscalibration were each significant when applied to the updated risk scores obtained from the linear-logistic recalibration model (P < ). To compare the two models on calibration in the test dataset, we prespecified cutpoints ξ = {.00001,.00003,.0001,.0003,.001,.003,.01,.03,.1,.3} on the probability scale and re-estimated log( γ 1i ) (cubic spline recalibration model) and log( γ 2i ) (linear-logistic recalibration model) for sequential risk strata as characterized by ξ. The cutpoints ξ were chosen since they are roughly equidistant on the logit scale. The miscalibration index for the risk scores that were recalibrated using cubic splines was Q 1 = , while the miscalibration index for the risk scores that were recalibrated using a linear-logistic model was Q 2 = , corresponding to an estimated miscalibration ratio of R = In

43 43 other words, the risk scores that were recalibrated using cubic splines were 34.6% better calibrated in the test dataset than the risk scores that were recalibrated using the linear-logistic model. An empirical estimate of the 95% confidence interval for R based on bootstrap resampling was [0.651, 0.657]. 2.5 Discussion A simple but flexible recalibration method for binary prediction models has been presented, which extends the results by Cox [7] by allowing for a broader family of functions to characterize calibration. This method works by introducing an offset representing the predicted log-odds of the outcome into the model, thus establishing a regression model for the logarithm of the observedto-expected odds ratio for the event as a function of the risk score. This representation is also appealing because it facilitates more straightforward graphical assessments of calibration; perfect calibration is given by a horizontal line at zero, instead of a diagonal line through zero. When perfect calibration is represented by a diagonal line, it can be more difficult to identify potentially important deviations [49]. Analogously, when the predicted probabilities are used to assess calibration - as in the method described by Harrell (2001) [6] - potentially important deviations for low or high predicted probabilities can be difficult to ascertain graphically.

44 44 Figure 2.3: Calibration of two sets of updated risk scores in independent test data, based respectively on a cubic spline recalibration model and a traditional linear-logistic recalibration model. Histograms of the two sets of updated risk scores are overlayed below; calibration curves are shown for the middle 99% of the distributions of the risk scores (respectively). The displayed calibration curves were fit to the sets updated risk scores using natural cubic splines.

45 45 Steyerberg et al. (2004) [50] describe an unreliability index for the linear-logistic calibration model as the difference in 2 log likelihood between a linear-logistic model with the intercept and slope estimated as free parameters and a model with the intercept and slope fixed at 0 and 1, respectively. This concept was extended to the general calibration framework presented in this article by comparing the likelihood for the fitted calibration model against the likelihood for a model which included only an offset term for the risk score (and no intercept). This difference in likelihoods is useful for testing overall calibration, but a better measure of general calibration is perhaps based on the total size of the estimated observed-to-expected log odds ratios. While the latter measure is dependent on the choice of parameterization for the calibration model, it is nonetheless useful for comparing two risk scores on calibration performance, given a pre-specified parameterization that is common between the two risk scores (such as discrete risk strata). A test for assessing whether or not observed miscalibration is more complex than overall calibration-in-the-large was proposed. This amounts to a standard Wald test for a collection of model coefficients. If this test is not significant, then the miscalibration might be easily correctable by estimating an intercept-only calibration model (which includes an offset for the risk score). However, the possibility of calibration model misspecification cannot be excluded from consideration.

46 46 After the calibration model is fit to the validation dataset, recalibrating the risk score amounts to adding the calibration model prediction to the original risk score. Calibration of the corrected scores can then be evaluated by re-fitting the calibration model in an independent set of data. In summary, a more flexible method for assessing calibration of binary prediction models and recalibrating model predictions, as well as a measure of relative calibration performance between competing models, were presented.

47 Chapter 3 Baseline Risk Adjustment for In-Hospital Mortality using ICD-9-CM Codes 3.1 Introduction All stakeholders in health care patients, insurers, governments, and providers are now demanding that we rigorously assess quality. Further, these quality measurements have come to influence both patients selection of providers and payment for services [51, 52]. As such, ensuring that comparisons among physicians and among institutions are fair is critical. Consequently, risk ad- 47

48 48 justment models abound [11, 9, 8, 53, 54, 55]. Alarmingly, there is evidence that many of these models are inconsistent in terms of their characterizations of risk [56, 57]. Which model is best for risk adjustment is a difficult question to answer. Often, the relative merits of risk adjustment models are established by their ability to discriminate between patients who experience the outcome(s) in question and patients who do not, for instance, by using the concordance index (or C-statistic) [3]. Generally speaking, models which are developed for patient populations that exhibit a broad spectrum of risk (such as the US inpatient population or the population of Medicare patients) can discriminate outcomes very well; it is thus rather unsurprising that these models achieve very high concordance indices. Sessler and colleagues developed a risk adjustment methodology which incorporated all diagnoses and procedures associated with each stay, and showed that their models discriminated outcomes comparably or better than both patient demographic characteristics and the Charlson Comorbidity Index [53]. Their risk stratification indices (RSIs) are based on International Classification of Disease, 9th Revision, Clinical Modification (ICD-9-CM) codes and were estimated and validated using the Medicare Provider Analysis and Review (MEDPAR) database. Since elderly patients are generally sicker than younger patients, probability estimates obtained from these models might not accurately rep-

49 49 resent actual risk in external populations. (Though, concern regarding this type of model calibration is hardly unique to these models.) Furthermore, the ICD-9-CM diagnosis and procedure codes in the MEDPAR database did not incorporate present-on-admission (POA) indicators. To the extent that new and problematic diagnoses (such as hospital-acquired infection) occur after admission, these risk adjustment models may over-estimate baseline risk associated with each stay. This results in artificially lower (better) quality of care metrics such as risk-adjusted mortality rate. Furthermore, the degree by which quality-of-care metrics are under-estimated may differ if certain hospitals have a greater proportion of in-hospital complications (i.e., are of poorer quality) than others [58, 59, 60]. On the other hand, hospital-acquired diagnoses could also be beneficial in terms of risk of outcomes. Ensuring that expected risk is characterized using only pre-existing conditions may therefore be important. Along these lines, POA indicators could inform risk adjustment of hospital outcomes by eliminating diagnoses representing hospital-acquired complications from risk-adjustment algorithms [61] (as well as protective diagnoses whose absence prior to admission would indicate higher baseline risk than that predicted by ignoring their POA status). However, with POA coding yet to be implemented in some locations (despite the fact that the practice was mandated by Congress in 2005), risk adjustment models that do not rely on POA indicators might

50 50 still be useful provided they are not yielding biased or inaccurate estimates. The primary goal of this research is thus to develop and prospectively validate a baseline risk index for in-hospital mortality (which we denote as POARisk ) using only POA diagnoses, principal procedures, and secondary procedures occurring exclusively prior to the date of the principal procedure; secondarily, we will assess the reliability of a similarly-derived risk index which ignored the POA status of diagnoses and timing of procedures (thus including all diagnoses and procedures), in terms of accurately representing the primary baseline risk index. 3.2 Methods Under authorization by the US Agency for Healthcare Research and Quality, we obtained data on 24 million inpatient discharges from the California State Inpatient Database (CA-SID). This registry represents a census of discharges occurring within the state. POA indicators are captured for all ICD-9-CM diagnosis codes; likewise, date of procedure (relative to the admission date) is captured for all ICD-9-CM procedure codes. We used data from to derive our models, while the data from 2009 were used to prospectively test them (see Model Performance and Reliability below). The data were further split randomly to facilitate a

51 51 two-step modeling procedure. Eighty percent of those discharges were used for initial model development and the remaining 20% were used to perform an initial calibration or bias-correction of risk estimates produced by the initial model (we say an initial calibration because we feel that calibration should constantly be addressed whenever the model is applied in external populations; see the Model Development and Calibration subsections below for details). The only exclusion we made was for those who did not undergo a procedure; thus our models sought to characterize in-hospital mortality risk for all inpatients undergoing at least one procedure. A summary of discharges included and excluded, as well as how the included discharges were partitioned for the purposes of our study, is provided in Figure Model Development To develop the initial POARisk model, we used logistic regression with inhospital mortality as the response and a collection of predictors derived from the ICD-9-CM diagnosis and procedure codes. Considered as inputs to our model were the POA diagnosis codes, the principal procedure code, and any secondary procedure codes for which the date of procedure was prior (but not equal) to the date of the principal procedure. We also used patients age

52 Figure 3.1: Study flow diagram. 52

53 53 and gender (age was represented by two predictors - one which estimated risk for infants less than 1 year old and another linear term for the rest of the patients). The ICD-9-CM codes are hierarchical in nature. For example, acute myocardial infarctions are coded with diagnosis code 410.XX; the fourth digit further classifies these diagnoses based on the location (e.g., 410.2X refers to the inferolateral wall), and the fifth digit specifies the episode of care. As such, many of the five-digit codes lacked sufficient representation for inclusion in our logistic model: an aggregation routine was needed in order to have predictors with adequate cell sizes. Similar in fashion to the original RSI [11], we aggregated these sparselyrepresented diagnoses by truncating the fifth digit off of the corresponding ICD-9-CM diagnosis code. Codes with fewer than 1,000 discharges per year on average in the 80% model development cohort were truncated to four digits (for this average calculation we excluded the year 2004 as there were a number of new codes introduced the next year). The process was repeated, truncating sparsely-represented four-digit codes to three digits. Three-digit codes represented by fewer than 1,000 discharges per year were not included in the model. A comparable aggregation algorithm was implemented for the procedure codes, though we note that procedure codes are represented by a maximum of four digits and the base codes are only two digits; thus we

54 54 aggregated procedure codes from four to three to two digits based on the 1,000 discharges per year criterion. We used an elastic net approach to fit logistic models based on the aggregated predictors [18]. The elastic net is a shrinkage methodology devised to ensure protection against over-fitting a model to the development cohort. The term shrinkage comes from the fact that regression coefficients are purposely biased toward zero; this action has been shown to improve prediction accuracy in external cohorts (specifically, the elastic net encourages highly-correlated predictors to be averaged while at the same time encouraging irrelevant predictors to be removed from the model altogether) [46, 15]. Removing variables in this manner has been shown to have favorable statistical properties over traditional methods such as stepwise variable selection [15]. To fit these models, we used the R statistical software package glmnet developed by Friedman et al. (2010) [19] (on R version for 64-bit Linux, The R Project for Statistical Computing, Vienna, Austria) Calibration With pay-for-performance pressures, providers tend to avoid high-risk patients for fear of the underlying risk adjustment models inability to adequately describe their risk of outcome [4]. In other words, there is either a

55 55 perceived or real lack of agreement between predicted probability of an outcome produced by the model in question and the actual probability of the outcome in a new set of patients, i.e., a lack of model calibration. Alas, calibration is often overlooked in risk adjustment modeling [62]; even when calibration is considered, it is often as a model diagnostic [63] instead of as a prescription for adjusting the model estimates to remove any biases introduced by the lack of calibration. And furthermore, calibration in the patient population for which a model is developed is no guarantee for calibration in prospective and/or external patient populations. Therefore, we initially calibrated our model using the randomlyreserved 20% calibration cohort, with the intention that calibration is to be again assessed and corrected whenever the model is used in new data. This calibration was done by fitting a logistic regression model with in-hospital mortality as the response and the risk score (i.e., model prediction on the log-odds scale) as the only predictor [7, 64]. An offset term - that is, a predictor for which the coefficient is fixed at a value of one - was used in the model in order to represent miscalibrations as deviations from a horizontal line at zero (see Figure 3.2). Restricted cubic splines were used to allow for nonlinearities, resulting in a robust calibration curve which corrected initial mortality risk estimates based on the actual incidences observed in the calibration cohort. Predicted log-odds of mortality were then adjusted based on

56 56 this calibration curve to yield the final POARisk score Comparator Model Our research objective was to evaluate whether or not the absence of POA indicators precludes accurate and unbiased estimation of patients baseline risk of mortality. To study this hypothesis, we developed a second model (which we denote AllCodeRisk ). For this model, we employed the same strategy as in the primary POARisk model (including the initial calibration step). The only difference between the two models was the inclusion of all diagnosis codes regardless of whether or not they were POA and all procedure codes regardless of when they were performed during the stay within the AllCodeRisk model Model Performance and Reliability We used the 2009 CA-SID to prospectively evaluate the performance of the POARisk and AllCodeRisk models. Each risk score underwent a second calibration step whereby the risk scores were modified to calibrate specifically with the 2009 data, using the same calibration methodology as described above. (Hereafter, reference to the POARisk and AllCodeRisk scores in the context of the 2009 data refers to these recalibrated scores.) Discrimina-

57 57 Figure 3.2: Calibration curves displaying the relationship between observed outcomes and model predictions. On the x-axis is the risk score (given as a predicted probability, and on the y-axis is an observed-to-expected (O/E) odds ratio. Perfect calibration implies an O/E odds ratio of 1 across the risk spectrum. An O/E odds ratio of 0.1 among patients with predicted mortality risk of 10 3, for example, implies that observed mortality was 10 times less likely for these patients than predicted by the model (thus the risk score was too high). Histograms of the risk scores underlie each panel. Calibration curves are truncated to the middle 99% of the data. Panel (a) displays the calibration of raw scores from the logistic model, within the random 20% calibration cohort. Correcting predictions based on these curves was insufficient to ensure complete calibration in the prospective 2009 data (b). However, re-calibration within the 2009 data based on the curve in (b) yielded favorable calibration for both models (c).

58 58 tive ability of the corrected scores was evaluated by estimating respective concordance indices. To evaluate the ability of the AllCodeRisk model to accurately approximate the POARisk model in terms of individual patient risk, we calculated a ratio of predicted odds (RPO) for each patient as the odds of mortality under the AllCodeRisk model divided by the odds under the POARisk model. If the AllCodeRisk model accurately approximates the POARisk model, then the RPOs would be tightly clustered around a value of 1.0. We thus pre-supposed that adequate approximation would be represented if risk under the AllCodeRisk model was within ±50% of risk under the POARisk model for at least 95% of patients (i.e., an RPO between 0.5 and 1.5). We also evaluated this accuracy specific to different levels of patient risk (as defined by the POARisk model) by making a Bland-Altman-type plot (i.e., a scatterplot of the RPO vs. POARisk) [49]. On this plot we overlayed quantile regression curves approximating the median, inter-quartile range, and middle 95% of the data as a function of POARisk. Finally, we performed an analysis whereby hospital performance was compared under each model. Hospital performance was defined as an observed-to-expected mortality ratio, with the expected number of mortalities differentially defined for the two models. For a given model, the expected number of mortalities was calculated as the sum of individual patients pre-

59 59 dicted probabilities of mortality. The percent difference in O/E ratio was then calculated for each hospital and analyzed using a histogram (for this analysis, we excluded hospitals for whom there were fewer than 30 inpatient mortalities over the year 2009). Adequate approximation of hospital performance via the AllCodeRisk model was pre-defined as having at least 95% of hospitals with an O/E ratio under the AllCodeRisk model that was within ±20% of that defined by the POARisk model. 3.3 Results Of the 20 million discharges in the CA-SID, 7.3 million were associated with inpatient stays for which no procedures were performed. Removing these discharges and randomly partitioning the data, we used 10.1 million discharges (80%) for fitting the logistic models and 2.5 million (20%) for estimating the calibration curves. Aggregation of the ICD-9-CM codes based on the cell size criterion of 1,000 patients per year (on average, for the years ) resulted in 2,476 predictors for the POARisk model (1,807 diagnosis-related predictors, 666 procedure-related predictors, and 3 demographic-related predictors) and 2,584 predictors for the AllCodeRisk model (1,870 predictors, 711 predictors, and 3 predictors, respectively). The elastic net logistic regression modeling algorithm removed 501/2,476 (20.2%) and 494/2,584 (19.1%)

60 60 irrelevant predictors, respectively. Calibration of the raw risk scores among the randomly-reserved 20% initial calibration cohort was poor (Figure 3.2a). Both risk scores overestimated risk for patients with predicted probabilities roughly between 10 4 and 10 2 and for patients with predicted probabilities roughly greater than 0.6. Using these calibration curves to correct the model estimates and applying the corrected model to the 2009 discharges, we found that calibration generally improved for both models but was still less than ideal (Figure 3.2b). However, re-calibration based on these curves resulted in risk estimates that unbiasedly represented true risk among the 2009 data (Figure 3.2c). C- statistics for the POARisk and AllCodeRisk models (as re-calibrated to the 2009 data) were and 0.981, respectively, indicating a high degree of discriminative ability. In the 2009 data, individual patient risk estimates based on the All- CodeRisk model tended to depart from those obtained from the POARisk model. Adequate approximation of baseline risk from the AllCodeRisk model - as defined by a risk estimate within 50% of that obtained from the POARisk model (i.e., an RPO between 0.5 and 1.5) - occurred in 15.8% of patients, which was lower than the pre-specified criterion of >95% of patients. Risk estimates were lower in the AllCodeRisk model than in the POARisk model for 92.5% of patients. The median RPO [middle 95%] was 0.25 [ ],

61 61 i.e., the predicted odds of mortality under the AllCodeRisk model was 0.25 times the predicted odds under the POARisk model or lower for 50% of the 2009 discharges. Inconsistency in risk estimates generally held across the entire risk spectrum (Figure 3.3). In the analysis of hospital performance under the two models, we excluded 125/424 (29.5%) due to the fact that they had fewer than 30 deaths in 2009; thus we analyzed 299 hospitals. The median percent change in observed-to-expected mortality ratio (using the AllCodeRisk model vs. using the POARisk model) was -0.1% (Figure 3.4). Ninety five percent of the hospitals had a percent change between -20.8% and +30.6%, implying that variability in performance was not adequately low (based on our pre-defined criterion of ±20% for at least 95% of hospitals); 89.6% of hospitals had percent differences smaller than ±20%. 3.4 Discussion Risk adjustment is fraught with difficulties. Choosing the right risk adjustment model to gauge quality has financial implications. To a large extent, statistical performance should guide this decision. We developed two highlyaccurate models for in-hospital mortality, based on differing sets of ICD-9-CM codes. The POARisk model used only codes that would resemble our best

62 Figure 3.3: Scatterplot of 50,000 randomly sampled discharges from the 2009 data, displaying the ratio of predicted odds of mortality (calculated as the predicted odds under the AllCodeRisk model divided by the predicted odds under the POARisk model) as a function of the POARisk score. Each of the risk scores had been re-calibrated to the 2009 data. Fidelity of the All- CodeRisk model to the POARisk model in terms of characterizing individual patient risk is represented by the horizontal line at an RPO of 1.0 (dashed line). Quantile regression curves displaying the median, first and third quartiles, and middle 95% of the data as a function of POARisk (fit using the entire 2009 sample) are overlaid. The plot shows that the AllCodeRisk model produces risk estimates that are too low for the majority of the patients, though risk estimates among high-risk patients were more consistent. 62

BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY

BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY Ingo Langner 1, Ralf Bender 2, Rebecca Lenz-Tönjes 1, Helmut Küchenhoff 2, Maria Blettner 2 1