Controlling for latent confounding by confirmatory factor analysis (CFA) Blinded Blinded

Size: px

Start display at page:

Download "Controlling for latent confounding by confirmatory factor analysis (CFA) Blinded Blinded"

Elwin Pierce
5 years ago
Views:

1 Controlling for latent confounding by confirmatory factor analysis (CFA) Blinded Blinded 1

2 Background Latent confounder is common in social and behavioral science in which most of cases the selection mechanism is neither fully known nor perfect measured. To measure latent confounder, multiple indicators need to be included to enhance both reliability and validity of the measurement. But, condition on many covariates may cause estimation problem or inefficiency. In this paper, we will investigate how confirmatory factor analysis (CFA) could be used for controlling for selection bias caused by a latent confounding variable. Confirmatory factor analysis (CFA) Methods for Latent Confounding Control Confirmatory factor analysis (CFA) incorporates the subjective knowledge of latent structure, observed measurements and measurement errors to estimate latent variables (Bollen, 1989). CFA derives latent variable as linear combinations of independent observed variables. The factor loadings indicate the relative importance of each factor to the latent variable. Factor scores are composite score that provide information about each unit s predicted placement on the latent factors. In this paper, we use Bartlett Scores which have the advantage of producing unbiased estimates of true factor scores (Hershberger, 2005). Data-Based Covariate Selection We define covariate selection as the process by which a subset X " X identified with the aim of satisfying ignorability. We use the theoretical framework proposed by de Luna et al. (2011) referred to as X % because it retains outcome-only predictors, which increases efficiency (Brookhart et. al, 2006). To implement conditional independence testing, we used max-min parents and children algorithm for Gaussian Bayesian networks (GBN) to select target covariates (Scutari, 2010). Hybrid Approach and Kitchen Sink Approach The hybrid approach is a combination of covariate selection and factor analysis insofar as it involves conditioning on the covariate set selected by covariate selection together with the estimated factor score. The rationale for developing this approach is that it avoids information loss of factor analysis and simultaneously prevents bias caused by omitting important covariates missed by covariate selection, while still allowing for dimension reduction of the covariate space. The kitchen sink approach involves including all measured pre-treatment covariates without any pre-processing; this approach will be set as benchmark for comparison in our study. Research Questions The three motivation questions for this simulation studies are as follow. 2

3 1. Is it possible for confirmatory factor analysis (CFA) to successfully control for selection bias caused by latent confounders? Furthermore, how do sample size and factor structure and loadings contribute? 2. How does CFA compare to other methods with respect to bias and efficiency? 3. What are sufficient conditions to successfully implement CFA for latent confounding control? Monte Carlo Simulation Design The data generating process was motivated by the model specification for CFA. Data were generated as follows: F ' = N(0,1) X ' = a ' F ' + ε ' X 3 = a 3 F ' + ε 3. X '4 = a '4 F ' + ε '4 log 8 PS 1 PS < = b 4 + b ' F ' Z = Bernoulli(PS) Y ' = β I β ' F ' + ε ' Y 4 = β I + β ' F ' + ε 4 Where ε LM'.'4 ~ N(0, Q1 a L 3 ) and ε ', ε 4 ~ N(0,1) See Figure 1. for a graphical representation. The confounding coefficients are set as follows: b 4 = 0, b ' = 2 and β 4 = 0, β ' = 2. The simulation design factors are: Sample sizes: 100, 500, Strengths of factor loadings. (1). Strong vs. weak (2). Medium vs. weak [a ' = 0.9, a 3 = 0.1., a '4 = 0.1], [a ' = 0.9, a 3 = 0.9., a '4 = 0.1],. [a ' = 0.9, a 3 = 0.9,., a '4 = 0.9]. [a ' = 0.5, a 3 = 0.1., a '4 = 0.1], [a ' = 0.5, a 3 = 0.5., a '4 = 0.1],. 3

4 [a ' = 0.5, a 3 = 0.5,., a '4 = 0.5]. (3). Small vs. weak [a ' = 0.3, a 3 = 0.1., a '4 = 0.1], [a ' = 0.3, a 3 = 0.3., a '4 = 0.1],. [a ' = 0.3, a 3 = 0.3,., a '4 = 0.3]. The analysis factor is based on the four different approaches to deal with a latent confounder: covariate selection, confirmatory factor analysis, hybrid method and kitchen-sink approach. The primary simulation outcomes are the bias and MSE of regression treatment effect estimation. We also quantify the number of covariates selected with the covariate selection method. 100 replications were run for each cell of the study. Simulation Results Results of the simulation study are reported in Tables 1 to 9. The true value of the treatment effect was 2 units (cf., Figure 1). The overall sample size, which was varied from 100 to 500 to 1000, did not have an appreciable influence on bias reduction. There is a clear interaction between the strength of the strong factor loadings and the number of strong factor loadings. When the strong loadings are all set to 0.3 (Table 4), the bias ranges from about 2.3 (or 115% of the treatment effect), when none of the indicators is strong, to about 1.5 (75%), when all of them are. It is worth noting that a standardized loading of 0.3 is a commonly used cutoff for retaining an indicator as important in exploratory factor analysis. When the strong loadings set to 0.9 (Table 6), the bias ranges from about 2.3 (115%) to about 0.1 (5%). Thus, regardless of method, full (or nearly full) bias reduction is only possible when (a) all indicators are strongly related to the latent factor and (b) the strengths of those relationship are very high. Dimension reduction results for covariate selection are summarized graphically in Figures 2 to 4. As expected, the covariate selection method tended to select fewer indicators when fewer had large loadings and more indicators were selected when more were generated with large loadings. Any differences among the four methods were relative minor in comparison to differences due to the magnitudes of factor loadings. Finally, we note that the hybrid approach performed well across all conditions. Conclusions Our first conclusion is that factor scores from a confirmatory factor analysis may be used to reduce the dimension of a set of manifest indicators without a detrimental loss in capacity for bias reduction. Here we underscore the point that factor analysis reduced the dimension of the indicator space from ten down to one, whereas, the other methods used either all ten indicators or used some number selected by GBN, typically between five and ten. In practice, to reduce the dimension of indicators, researchers often take sum scores of inventories that are not meant to be summed. What our results show is that it may be acceptable to use factor scores instead. 4

5 The second conclusion is the importance of working with valid indicators that truly measure the latent construct they purport to. These results perhaps may be used as a warning for researchers considering using untested and unvalidated items as proxies for latent constructs in observational study settings. 5

6 Reference Angrist, J. D., & Pischke, J.-S. (2009). Mostly harmless econometrics: An empiricists companion. Princeton: Princeton University Press. Beck, A. T., Steer, R. A., & Carbin, M. G. (1988). Psychometric properties of the Beck Depression Inventory: Twenty-five years of evaluation. Clinical psychology review, 8(1), Brookhart, M. A., Schneeweiss, S., Rothman, K. J., Glynn, R. J., Avorn, J., & Stürmer, T. (2006). Variable selection for propensity score models. American journal of epidemiology, 163(12), Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley. DiStefano, C., Zhu, M., & Mindrila, D. (2009). Understanding and using factor scores: Considerations for the applied researcher. Practical Assessment, Research & Evaluation, 14(20), De Luna, X., Waernbaum, I., & Richardson, T. S. (2011). Covariate selection for the nonparametric estimation of an average treatment effect. Biometrika, 98(4), Hershberger, S. L. (2005). Factor scores. In B. S. Everitt and D. C. Howell (Eds.) Encyclopedia of Statistics in Behavioral Science. (pp ). New York: John Wiley. Kaplan, D. (1999). An extension of the propensity score adjustment method for the analysis of group differences in MIMIC models. Multivariate Behavioral Research, 34(4), Kupek, E. (2013). Detection of Unknown Confounders by Bayesian Confirmatory Factor Analysis. Advanced Studies in Medical Sciences, 1(3), Pearl, J., & Verma, T. (1991). A theory of inferred causation. KR, 91,

7 Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, Rubin, D. B. (1978). Bayesian inference for causal effects: The role of randomization. Annals of Statistics, 6, Rubin, D. B. (1980). Randomization analysis of experimental data: The Fisher Randomization test comment. Journal of the American Statistical Association, 75, Rubin, D. B. (1990). Formal models of statistical inference for causal effects. Journal of Statistical Planning and Inference, 25, Rubin, D. B., & Thomas, N. (1996). Matching using estimated propensity scores: relating theory to practice. Biometrics, Steiner, P. M., Cook, T. D., & Shadish, W. R. (2011). On the importance of reliable covariate measurement in selection bias adjustments using propensity scores. Journal of Educational and Behavioral Statistics, 36(2), Schafer, J., & Kang, J. (2008). Average causal effects from nonrandomized studies: A practical guide and simulated example. Psychological Methods, 13, Scutari, M. (2010). bnlearn: Bayesian network structure learning. R package. Schneeweiss, S., Rassen, J. A., Glynn, R. J., Avorn, J., Mogun, H., & Brookhart, M. A. (2009). High-dimensional propensity score adjustment in studies of treatment effects using health care claims data. Epidemiology (Cambridge, Mass.), 20(4), 512. Shortreed, S. M., & Ertefaie, A. (2017). Outcome-adaptive lasso: Variable selection for causal inference. Biometrics, 73(4),

Figure 1. Data_generating process for simulation study; F1 is the latent confounder; a s are the factor loadings for each constructs. 12 0.

8 Figure 1. Data_generating process for simulation study; F1 is the latent confounder; a s are the factor loadings for each constructs vs vs vs.0.1 Median number of covariates selected Good Index Figure 2. Median number of covariates selected when sample size n = 100; Good Index is the number of high loading items included. 8

9 0.9 vs vs vs Median number of covariates selected Good Index Figure 3. Medium sample size for covariate selection methods when n=500; Good Index is the number of high loading items included vs vs vs.0.1 Median number of covariates selected Good Index Figure 4. Medium sample size for covariate selection methods when n=1000; Good Index is the number of high loading items included. 9

10 Small Sample Size n=100 Methods Good. Index Bias S.D MSE All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select

11 Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid Table 1: This table is the result of bias, standard deviation (S.D) and MSE of estimated treatment effect with sample size=100; Good. Index are the number of measures with factor loading=0.3 included; Good. Index=0 means factor loading=0.1; Methods are different approaches to deal with latent confounding: All indicate kitchen sink approach, Hybrid is the hybrid approach, Select is the covariate selection approach and Factor is the factor analysis approach. Methods Good. Index Bias S.D MSE All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor

12 Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid Table 2: This table is the result of bias, standard deviation (S.D) and MSE of estimated treatment effect with sample size=100; Good. Index are the number of measures with factor loading=0.5 included; Good. Index=0 means factor loading=0.1; Methods are different 12

13 approaches to deal with latent confounding: All indicate kitchen sink approach, Hybrid is the hybrid approach, Select is the covariate selection approach and Factor is the factor analysis approach. Methods Good. Index Bias S. D MSE All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor

14 Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid Table 3: This table is the result of bias, standard deviation (S.D) and MSE of estimated treatment effect with sample size=100; Good. Index are the number of measures with factor loading=0. included; Good. Index=0 means factor loading=0.1; Methods are different approaches to deal with latent confounding: All indicate kitchen sink approach, Hybrid is the hybrid approach, Select is the covariate selection approach and Factor is the factor analysis approach. Medium Sample Size n=500 Methods Good. Index Bias S.D MSE All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All

15 Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid Table 4: This table is the result of bias, standard deviation(s.d) and MSE of estimated treatment effect with sample size=500; Good. Index are the number of measures with factor loading=0.3 included; Good. Index=0 means factor loading=0.1; Methods are different approaches to deal with latent confounding: All indicate kitchen sink approach, Hybrid is the hybrid approach, Select is the covariate selection approach and Factor is the factor analysis approach. 15

16 Methods Good. Index Bias S.D MSE All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid

17 All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid Table 5: This table is the result of bias, standard deviation (S.D) and MSE of estimated treatment effect with sample size=500; Good. Index are the number of measures with factor loading=0.5 included; Good. Index=0 means factor loading=0.1; Methods are different approaches to deal with latent confounding: All indicate kitchen sink approach, Hybrid is the hybrid approach, Select is the covariate selection approach and Factor is the factor analysis approach. Methods Good. Index Bias S.D MSE All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select

18 Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid Table 6: This table is the result of bias, standard deviation(s.d) and MSE of estimated treatment effect with sample size=500; Good. Index are the number of measures with factor loading=0.9 included; Good. Index=0 means factor loading=0.1; Methods are different approaches to deal with latent confounding: All indicate kitchen sink approach, Hybrid is the hybrid approach, Select is the covariate selection approach and Factor is the factor analysis approach. 18

19 Large Sample Size n=1000 Methods Good. Index Bias S.D MSE All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select

20 Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid Table 7: This table is the result of bias, standard deviation(s.d) and MSE of estimated treatment effect with sample size=1000; Good. Index are the number of measures with factor loading=0.3 included; Good. Index=0 means factor loading=0.1; Methods are different approaches to deal with latent confounding: All indicate kitchen sink approach, Hybrid is the hybrid approach, Select is the covariate selection approach and Factor is the factor analysis approach. Methods Good. Index Bias S.D MSE All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor

21 Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid Table 8: This table is the result of bias, standard deviation(s.d) and MSE of estimated treatment effect with sample size=1000; Good. Index are the number of measures with factor loading=0.5 included; Good. Index=0 means factor loading=0.1; Methods are different approaches to deal with latent confounding: All indicate kitchen sink approach, Hybrid is the hybrid approach, Select is the covariate selection approach and Factor is the factor analysis approach. 21

22 Methods Good. Index Bias SD MSE All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid All

23 Factor Select Hybrid All Factor Select Hybrid All Factor Select Hybrid Table 9: This table is the result of bias, standard deviation (S.D) and MSE of estimated treatment effect with sample size=1000; Good.Index are the number of measures with factor loading=0.9 included; Good. Index=0 means factor loading=0.1; Methods are different approaches to deal with latent confounding: All indicate kitchen sink approach, Hybrid is the hybrid approach, Select is the covariate selection approach and Factor is the factor analysis approach. 23

DATA-ADAPTIVE VARIABLE SELECTION FOR

DATA-ADAPTIVE VARIABLE SELECTION FOR CAUSAL INFERENCE Group Health Research Institute Department of Biostatistics, University of Washington shortreed.s@ghc.org joint work with Ashkan Ertefaie Department