Robust Bayesian Variable Selection for Modeling Mean Medical Costs

Size: px

Start display at page:

Download "Robust Bayesian Variable Selection for Modeling Mean Medical Costs"

Gyles Day
5 years ago
Views:

1 Robust Bayesian Variable Selection for Modeling Mean Medical Costs Grace Yoon 1,, Wenxin Jiang 2, Lei Liu 3 and Ya-Chen T. Shih 4 1 Department of Statistics, Texas A&M University 2 Department of Statistics, Northwestern University 3 Division of Biostatistics, Washington University in St. Louis 4 Department of Health Services Research, The University of Texas MD Anderson Cancer Center

2 Rising Medical Costs Medical costs in the U.S. rise rapidly: at a faster pace than general inflation for decades For the first time, healthcare spending per capita had exceeded $10,000 in 2016 (Keehan et al. 2016) Rapid growth in medical costs in the U.S. has been a major policy concern: Affordable Care Act (Obama Care) and its repealing by President Trump. Analyzing medical cost data has become increasingly important because understanding the impact of any cost containment initiative will require accurate and reliable statistical inference of medical costs.

3 Healthcare spending in 2016, % of GDP (China: 6.2%)

4 Medical Cost Data Medical cost data have been collected routinely by hospitals, government agencies, and health insurance companies. Modeling medical costs is of great interest in health economics study Goal: to identify the risk factors of medical costs and ascertain the most cost-effective treatment, which in turn, can assist policy makers in maximizing health benefits for individuals and society.

5 Medical Expenditure Panel Survey (MEPS) The Medical Expenditure Panel Survey (MEPS), funded by AHRQ, is the most complete source of data on the cost and use of health care and health insurance coverage with three components: Household, Insurance/Employer, and Medical Provider. Survey design: a probability sample designed to provide nationally representative estimates of health care use, medical expenditures, health attitudes, and health status Data structure: an overlapping panel, and each individual has at most 2 years of data. More information available at It is publicly available, FREE!

6 Challenges in Modeling Medical Costs Medical costs are often right-skewed and heteroscedastic. Robust Chen et al. (2013) and Chen et al. (2016) developed models that make assumptions on the mean cost only and avoid restrictive assumptions on higher order moments. However, they did not take into account variable selection in their predictive models. Parsimonious Bayesian approach can provide posterior probability ratios for selected models. Informative

7 Model Consider a log-linear mean model without any additional distribution assumption for a response variable. E (Y i X i ) = µ (X i, β) = e X i T β X i = (1, x i1, x i2,, x ip ) T, β = (β 0, β 1, β 2,, β p ) T, for i = 1,, n, (1)

8 Asymptotic normal estimate ˆβ based on the full model Estimating equation (also the score equation) for Poisson distribution: S(β) = n i=1 X i ( Y i e X T i β ) = 0. ˆβ is asymptotic normal for the true parameter β of interest, even if the true model of Y i is NOT Poisson. Naive Poisson likelihood is typically globally concave, and the maximizer ˆβ is unique and easy to compute.

9 Sandwich formula Since the Poisson model is a naive probability model, the asymptotic variance V of ˆβ should be estimated by a robust sandwich formula. ˆV = ( n i=1 X i X T i e X i T 1 ( n ( ˆβ) X i Y i e X i T ( n i=1 i=1 X i X T i e X i T ˆβ ) 1 ) ( ) ) ˆβ Y i e X i T T ˆβ X T i

10 Posterior probability Suppose we have asymptotic normal parameter estimate ˆβ of the parameter β, so that ˆβ β, U N(β, V ). Together with a normal prior β U N(0, U), we can integrate exactly p(ˆβ U) = ˆβ U N(0, U + V ). 1 det(2π(u + V )) exp{ 1 2 ˆβ T (U + V ) 1 ˆβ} Then the posterior distribution for model U can be obtained from p(u ˆβ) p(ˆβ U)p(U).

11 Spike-or-Slab (SorS) priors β Spike-or-Slab (SorS) priors β N(0, U) where U = diag(u 0, u 1,...u p ), where u j is the prior variance of β j, taking either a prespecified small value (spike variance), or a prespecified large value (slab variance). This strategy can approximately select components β j if u j is large, and neglect β j if u j is small. Let u 0 be large so we always have the intercept in the model.

12 How to choose spike and slab variance? u j = var(β j ) u j /V jj {a (spike variance), A (slab variance)} where V jj is the jth diagonal element of V = Var(ˆβ). The role of a and A is like tuning parameters. Fix A to be relatively large, to avoid unnecessary penalization on selected coefficients. Set a to be a small but positive value, rather than setting to zero, to absorb negligible nonzero coefficients into the spike distribution. In practice, choose a from cross-validation and set A as n. In simulation and real data application, we performed 10-fold cross-validation to select an optimal tuning parameter a based on the RMSE (Root Mean Square Error).

13 Model selection procedure Step 1. Calculate Z-statistics for each variable in a full model using Sandwich Variance Estimator (p < n). Z j = ˆβ j / ˆV jj for j = 1,..., p. Step 2. Rank all p variables by the absolute value of Z-statistics.

14 Step 3. Compare the posterior probability of p different candidate models in Z-scope: Φ Z = { M (1), M (2),..., M (p) }, where M (j) = { k : k {1,..., p}, Z k Z (j) } for each j = 1,..., p. Basically, each candidate model contains the largest top d of Z j = ˆβ j / ˆV jj s, with the choices of large u j s (u j = A ˆV jj for large A), and the other p d of the u j s are taken to be small (as u j = a ˆV jj for small a). Then the posterior p(u ˆβ) is computed and compared for each of the p candidate models and we can identify the best model with the highest posterior probability p(u ˆβ).

15 Simulation setting R = 100 and n = 1000 data sets 4 true variables among p = 50 (including intercept) where µ i = e X T i β. β = (2, 2, 2, 2, 2, 0,, 0).. 2 by 2 design: 2 heteroscedasticity levels and 2 correlation structures for covariates Heteroscedasticity Level: (Case 1) Y i Gamma(µ i, 1), that is, E(Y i ) = µ i, V (Y i ) = µ i. (Case 2) Y i Gamma(1, 1/µ i ), that is, E(Y i ) = µ i, V (Y i ) = µ 2 i. Correlation structure for predictors (Independent) x 1,..., x p s are iid from Bernoulli(0.5). (Correlated) x 1,..., x p Bernoulli(0.5) with corr(x j, x k ) = 0.5 j k for 1 j, k p.

16 Model Comparison Full model (without variable selection) LASSO sslasso (spike-and-slab lasso generalized linear models) 1 1 Tang et al. (2017), BhGLM R package

17 Model Comparison Criteria p ( ) 2 RMSE of coefficient estimates: j=0 β j ˆβ j /(p + 1). Selected model size. Coverage probability (Cov)= R r=1 I (M ˆM (r) )/R, Percentage of correct zeros p j=1 (Cor0)= R r=1 Percentage of incorrect zeros (Inc0)= R r=1 I ( ˆβ (r) j = 0, β j = 0)/[R(p p 0 )], p j=1 I ( ˆβ (r) j = 0, β j 0)/[Rp 0 ], Exact selection probability (Ext)= R r=1 I (M = ˆM (r) )/R, where M and ˆM (r) denote a true model and a selected model at rth generated dataset, respectively. Average accuracy rate of variable selection (Acc) = p j=1 I (ˆγ j = γ j )/p, where γ j = I (β j 0) and ˆγ j = I ( ˆβ j 0).

18 RMSE of cost estimates (prediction of Y )

19 RMSE of coefficient estimates (estimation of β)

20 Model size (selection performance)

21 Cov Cor0 Inc0 Ext Acc Oracle Full independent LASSO predictors sslasso Case 1 SorS var(y i ) = µ i Full correlated LASSO predictors sslasso SorS Full independent LASSO predictors sslasso Case 2 SorS var(y i ) = µ 2 i Full correlated LASSO predictors sslasso SorS

22 Summary of Simulation Studies In summary, SorS performs better than the other comparable methods in terms of selection, prediction, and estimation

23 Our Subset of MEPS data n = 3, 376 and p = 33 from 2014 MEPS full year consolidated data. The mean medical cost is $10, 321 and the median is $4, The standard deviation is $17, 966. Correlations between covariates ( 0.55, 0.6).

24 Application to MEPS data To assess the performance of variable selection methods: randomly sample half for training data and use the remaining for test data. (repeated 100 times) RMSE Model Size SorS sslasso LASSO Full SorS sslasso LASSO Full

25 Variable Selection: Informativeness The model with 10 variables has the highest posterior probability. The second best model with 11 variables is 21% as likely as the best model. The third best model with 12 variables is 5.7% as likely as the best model.

26 Application to MEPS data The model with the largest posterior probability: Estimate s.e. p-value (Intercept) < HOSPEXP < INSCOV < DIABETES < PCS < ANYLMT < EMERG < CANCER STRK CORHRT EDUCAT <

27 Interpretation of Results Hospitalization (HOSPEXP) and emergency room visit (EMERG) increase the medical cost by a large percentage (3.4 times and 1.3 times, respectively). Heart and blood vessel disease (CORHRT and STRK), body movement disorder (ANYLMT), cancer, and diabetes are all significantly associated with annual medical costs. Gender and race variables are not selected in our model. Expected medical costs are higher for the higher educated: more educated individuals tends to care more about their health, and probably are more likely to have regular checkups and spend more for better treatment. Physical Composite Scores (PCS) variable, a quality of life measure, has a negative association with medical costs.

28 Discussion Simultaneously account for severe skewness and heteroscedasticity of the medical cost data. Fit the data in the original scale without any transformation of the response variable. Goals: Robust, Parsimonious and Informative. Limitation: applicable to relatively low-dimension and large sample data. Analysis of Medical Cost Data: Statistical and Econometric Tools" (with Tina Shih) under contract with Cambridge University Press

29 Thank you!

30 Chen, J., Liu, L., Zhang, D. and Shih, Y.-C. T. (2013) A flexible model for the mean and variance functions, with application to medical cost data. Statistics in Medicine 32(24): Chen, J., Liu, L., Zhang, D., Shih, Y.-C. T. and Severini T. (2016) A flexible model for correlated medical costs, with application to medical expenditure panel survey data. Statistics in Medicine 35: Jiang, W. and Li, X. (2004) Consistent model selection based on parameter estimates. Journal of Statistical Planning and Inference 121: Tang, Z., Shen, Y., Zhang, X. and Yi, N. (2017) The spike-and-slab lasso generalized linear models for prediction and associated genes detection. Genetics 205(1): Zheng, X. and Loh, W.-Y. (1995) Consistent variable selection in linear models. Journal of the American Statistical Association 90(429):

31 Performance of SorS for each a Exact Selection Probability Case 1 indep Case 1 corr Case 2 indep Case 2 corr a

32 Performance of SorS for each a Model Size Case 1 indep Case 1 corr Case 2 indep Case 2 corr a

33 Prior variance and shrinkage Simple example: Y β N(X β, σ 2 I n ) β N(0, c 2 σ 2 I p ) Then, posterior probability is ( ( β y N X T X + 1 ) 1 ( c 2 I p X T Y, σ 2 X T X + 1 ) ) 1 c 2 I p, E(β ˆβ) = = ( X T X + 1 ) 1 c 2 I p X T Y ( I p + 1 ( ) ) 1 1 c 2 X T X ˆβ As c 2 0, the shrinkage is larger.

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Consistent high-dimensional Bayesian variable selection via penalized credible regions Howard Bondell bondell@stat.ncsu.edu Joint work with Brian Reich Howard Bondell p. 1 Outline High-Dimensional Variable