Using Mplus individual residual plots for. diagnostics and model evaluation in SEM

Similar documents
Plausible Values for Latent Variables Using Mplus

SRMR in Mplus. Tihomir Asparouhov and Bengt Muthén. May 2, 2018

Auxiliary Variables in Mixture Modeling: Using the BCH Method in Mplus to Estimate a Distal Outcome Model and an Arbitrary Secondary Model

Bayesian Analysis of Latent Variable Models using Mplus

Latent variable interactions

Computationally Efficient Estimation of Multilevel High-Dimensional Latent Variable Models

An Introduction to Path Analysis

Strati cation in Multivariate Modeling

An Introduction to Mplus and Path Analysis

Scatter plot of data from the study. Linear Regression

Nesting and Equivalence Testing

Specifying Latent Curve and Other Growth Models Using Mplus. (Revised )

Multiple Group CFA Invariance Example (data from Brown Chapter 7) using MLR Mplus 7.4: Major Depression Criteria across Men and Women (n = 345 each)

Application of Plausible Values of Latent Variables to Analyzing BSI-18 Factors. Jichuan Wang, Ph.D

Centering Predictor and Mediator Variables in Multilevel and Time-Series Models

Scatter plot of data from the study. Linear Regression

Variable-Specific Entropy Contribution

Path Analysis. PRE 906: Structural Equation Modeling Lecture #5 February 18, PRE 906, SEM: Lecture 5 - Path Analysis

Misspecification in Nonrecursive SEMs 1. Nonrecursive Latent Variable Models under Misspecification

Using Bayesian Priors for More Flexible Latent Class Analysis

Ruth E. Mathiowetz. Chapel Hill 2010

Mixture Modeling. Identifying the Correct Number of Classes in a Growth Mixture Model. Davood Tofighi Craig Enders Arizona State University

Introduction to Confirmatory Factor Analysis

Multilevel Structural Equation Modeling

Supplementary materials for: Exploratory Structural Equation Modeling

Latent Variable Centering of Predictors and Mediators in Multilevel and Time-Series Models

Title. Description. Remarks and examples. stata.com. stata.com. Variable notation. methods and formulas for sem Methods and formulas for sem

Single and multiple linear regression analysis

Dynamic Structural Equation Models

Mplus Short Courses Day 2. Growth Modeling With Latent Variables Using Mplus

AMS 7 Correlation and Regression Lecture 8

Mplus Short Courses Topic 3. Growth Modeling With Latent Variables Using Mplus: Introductory And Intermediate Growth Models

Data Analysis and Statistical Methods Statistics 651

Statistical View of Least Squares

Equivalent Models in the Context of Latent Curve Analysis

INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y

Multilevel Regression Mixture Analysis

Continuous Time Survival in Latent Variable Models

A Study of Statistical Power and Type I Errors in Testing a Factor Analytic. Model for Group Differences in Regression Intercepts

Simple Linear Regression

Chapter 19. Latent Variable Analysis. Growth Mixture Modeling and Related Techniques for Longitudinal Data. Bengt Muthén

Simple Linear Regression

22S39: Class Notes / November 14, 2000 back to start 1

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

Measuring the fit of the model - SSR

Impact of serial correlation structures on random effect misspecification with the linear mixed model.

Multilevel regression mixture analysis

Inferences for Regression

INTRODUCTION TO STRUCTURAL EQUATION MODELS

The Model Building Process Part I: Checking Model Assumptions Best Practice

Bivariate data analysis

Advances in Mixture Modeling And More

Introduction to LMER. Andrew Zieffler

MATH11400 Statistics Homepage

Supplemental material for Autoregressive Latent Trajectory 1

Lecture 14 Simple Linear Regression

Structural Equation Modeling

Model Assumptions; Predicting Heterogeneity of Variance

Inference with Simple Regression

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

y response variable x 1, x 2,, x k -- a set of explanatory variables

Evaluation of structural equation models. Hans Baumgartner Penn State University

Model Checking and Improvement

Analysis of Bivariate Data

A discussion on multiple regression models

Student Guide: Chapter 1

Confirmatory Factor Analysis: Model comparison, respecification, and more. Psychology 588: Covariance structure and factor models

Remedial Measures, Brown-Forsythe test, F test

Measurement Invariance (MI) in CFA and Differential Item Functioning (DIF) in IRT/IFA

AN INVESTIGATION OF THE ALIGNMENT METHOD FOR DETECTING MEASUREMENT NON- INVARIANCE ACROSS MANY GROUPS WITH DICHOTOMOUS INDICATORS

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Bayesian SEM: A more flexible representation of substantive theory

Business Statistics. Lecture 10: Correlation and Linear Regression

NELS 88. Latent Response Variable Formulation Versus Probability Curve Formulation

How well do Fit Indices Distinguish Between the Two?

Topic 12 Overview of Estimation

Advanced Structural Equations Models I

4 Multiple Linear Regression

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Growth Mixture Model

Important note: Transcripts are not substitutes for textbook assignments. 1

Growth Mixture Modeling and Causal Inference. Booil Jo Stanford University

EVALUATION OF STRUCTURAL EQUATION MODELS

Dynamic Structural Equation Modeling of Intensive Longitudinal Data Using Multilevel Time Series Analysis in Mplus Version 8 Part 7

STA 216, GLM, Lecture 16. October 29, 2007

Latent class analysis for intensive longitudinal data, Hidden Markov processes, Regime switching models and Dynamic Structural Equations in Mplus

Announcements: You can turn in homework until 6pm, slot on wall across from 2202 Bren. Make sure you use the correct slot! (Stats 8, closest to wall)

Empirical Validation of the Critical Thinking Assessment Test: A Bayesian CFA Approach

Exam Applied Statistical Regression. Good Luck!

Unit 6 - Introduction to linear regression

CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA

STAT 7030: Categorical Data Analysis

CHAPTER 3. SPECIALIZED EXTENSIONS

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

Richard N. Jones, Sc.D. HSPH Kresge G2 October 5, 2011

PIRLS 2016 Achievement Scaling Methodology 1

Sociology 6Z03 Review II

bmuthen posted on Tuesday, August 23, :06 am The following question appeared on SEMNET Aug 19, 2005.

Regression-Discontinuity Analysis

Transcription:

Using Mplus individual residual plots for diagnostics and model evaluation in SEM Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 20 October 31, 2017 1 Introduction A variety of plots are available in Mplus 7.2 that can be used to visually evaluate the fit of the model. We can inspect the fit on a more detailed level where individual observations are plotted. Such level of detail usually can not be obtained when standard SEM fit statistics are used such as chi-square statistics and CFI/TLI indices. These standard SEM fit statistics are usually based on a global model fit, i.e., a fit for the entire population as a whole rather than fit for individual observations. In this paper we provide several examples where the plots can be used to quickly discover patterns in the data that are not accounted for in the model. 1

Consider for example a SEM model that contains the following relationship between the dependent variables Y, latent variables η and covariates X Y = ν + λη + βx + ε η = α + γx + ζ where ν, α, λ, γ, and β are model parameters where ν and α are vectors and λ, γ and β are matrices. Mplus can be used to compute the factor scores ˆη = E(η Y, X) and the model estimated/predicted values for Y, which we denote by Ŷ. Mplus computes two versions for the predicted values of Y. The first one uses both the factor score and the covariates, i.e., Ŷ = ˆα + ˆλˆη + ˆβX. The second method for evaluating the predicted Y values is Ŷ = E(Y X). The second version of predicted value is commonly used in statistical practice while the first version is more useful in a SEM setting because it is more directly related to the actual estimated model and uses also the factor scores to predict the Y values. The first method however has a slight caveat. Because the factor score ˆη already incorporates the information contained in the dependent variable Y, then Ŷ will also contain information from Y. Therefore we will be using Y to predict Y which can in itself be problematic 2

as it lacks the rigor of pure statistical conditional expectation. However, when the latent variables η are measured well with a sufficient number of accurate measurements, the dependence between ˆη and a single Y variable, i.e., a particular indicator, will be sufficiently small so that it can be ignored. On a more technical level the purity of the Ŷ definition guarantees that Cov(Ŷ, ε) = 0 while Cov(Ŷ, ε) 0. In most cases we assume that the systematic/predicted part of the variable is independent of its residual. Note however that if the measurement instrument is good Cov(Ŷ, ε) 0. Thus any violation of the above equation could be interpreted as a model misspecification. Estimates for the individual level residuals ε can also be formed using either of the estimated Y values Y res = Y Ŷ Y res = Y Ŷ. Thus a common check that can be done using graphing utilities is to inspect that Y res and Ŷ are independent. 3

In this paper we provide some simple examples where inspecting visually the scatter plots between the dependent variables Y, the estimated values Ŷ, the residual values Y res, the estimated factor scores ˆη, and the covariates X can lead to useful discoveries and model modifications that could be overlooked otherwise. For our examples we use simulated data so that the illustration plots are somewhat clearer than those that might be found in real practical examples. The methodology, however, applies to real examples as well. The plots do not infer statistical significance. This should come as usual from formal statistical testing. The plots however can be used to more meaningfully adjust the model. Model modification indices can sometimes produce large values for model parameters without clear justification. Usually model modifications that originate from a scatter plot are quite easy to interpret and explain. The power to detect model misspecifications with plots naturally will depend on the person analyzing the scatter plot and thus a formal statement regarding plot power is not possible. Presumably, however, the power will be lower than with other purely numerical statistical tools. Usually the sample size does little to improve the effect of a particular scatter plot. For example doubling the number of plotted points, i.e., replicating each plotted point, will not increase or make any difference in the plot, while any statistical test will increase its power. Our ability to detect visually a missing effect depends on the size of the effect rather than its statistical significance. The examples provided in this paper are by no means exhaustive. These are just a few simple illustrations. Other scatter plots can be examined to 4

cross-validate the estimated model. 2 Non-linear trend In this example we generate data according to a factor analysis model with one factor with 10 indicator variables and one covariate. All loadings λ are set to 1, all intercept parameters are set to 0, all residual variances are set to 1, all direct effect coefficients β are set to 0. We generate the data with a non-linear effect on the factor η = γx + F (X) + ζ, where X is a standard normal variable, γ = 1 and F (X) uses of the these 3 forms: 0, X 2, exp(x). As we vary the function F we get 3 different samples. We generate the data sets with 500 observations. Next we estimate a standard factor analysis model where the non-linear trend is not accounted for, i.e., we estimate the same model as the data generating model but without the predictor F (X). We plot the estimated factor score ˆη against the predictor X. The plots for the 3 samples are given in Figures 1-3. The results are very clear. In the first sample the model is adequate, i.e., the linear trend between η and X is correct and the plot reflects the linear trend that is in the model. In the second sample Figure 2 shows a quadratic relationship between η and X. Thus the model is insufficient and there is a need to add X 2 as a predictor of η. In the third sample Figure 3 shows an exponential relationship between η and X. Thus the model is insufficient and there is a need 5

Figure 1: Sample 1 - linear trend, F (X) = 0 to add exp(x) as a predictor of η. None of the standard fit statistics, such as the chi-square test, CFI, and TLI detected the misfit/inadequanecy of the model in the second and the third samples. In addition the plots can actually provide guidance regarding the type of non-linearity that should be explored. The quadratic v.s. the exponential trend are clearly distinguishable. 3 Using residual values to detect direct effects In this section we generate a sample as in the previous section with 10 indicator variables, one factor and 5 standard normal covariates X 1,..., X 5. All 6

Figure 2: Sample 2 - quadratic trend, F (X) = X 2 Figure 3: Sample 3 - exponential trend, F (X) = Exp(X) 7

γ i coefficients are set to 1. We also add one direct effect to the model generation from X 1 to Y 1, i.e., β 11 = 1. We estimate the MIMIC model without the direct effect, i.e., assuming that β 11 = 0. We consider 4 different scatter plots: Y 1 v.s. X 1 given in Figure 4, Y 2 v.s. X 1 given in Figure 5, Y 1,res v.s. X 1 given in Figure 6, Y 2,res v.s. X 1 given in Figure 7. Both Figure 4 and 5 show a linear trend between Y i and X 1. These trends do not indicate any misspecifcation because they are implied by the estimated model. However, according to the estimated model which didn t account for the direct effect from X 1 to Y 1 the residual variables should be independent of X 1. Figure 6 shows a linear dependence between Y 1,res and X 1. This implies an unaccounted direct effect from X 1 to Y. Figure 7 shows that Y 2,res is independent of X 2 as expected and implies no misspecification. Therefore it is important to examine scatter plots for the residual variables and not just the observed variables. This type of misspecification is easily detectable however with other methods such as modification indices as well as by examining the model estimated variance covariance matrix for Y i and X i and the observed variance covariance matrix for these variables. This comparison is obtained in Mplus with the RESIDUAL option of the OUTPUT command. 8

Figure 4: Y 1 v.s. X 1 Figure 5: Y 2 v.s. X 1 9

Figure 6: Y 1,res v.s. X 1 Figure 7: Y 2,res v.s. X 1 10

4 Using residual scatter plots to detect residual covariances In this section we generate a sample as in the previous section with 10 indicator variables, one factor and without any covariates. We introduce to the data generation model a residual correlation of 0.5 between Y 1 and Y 2. The data however is analyzed without the residual correlation. We consider the scatter plot between Y 1,res and Y 2,res to see if the misspecification will be detected. This scatter plot is presented in Figure 8, while in Figures 9 we present the scatter plots between Y 1,res and Y 3,res for comparative purposes. Figure 8 shows that Y 1,res and Y 2,res are positively correlated and there is a linear trend between the two with a positive slope. This clearly suggest that the residual correlation between Y 1 and Y 2 should be added to the model. On the other hand Figure 9 shows that Y 1,res and Y 3,res appear to be independent which is correctly reflected in the model. In this example the misspecification was detected by using scatter plots between the residual variables. A scatter plot between the observed variables would not be able to detect this misspecification because both the model and the scatter plot indicate a positive correlation between any pair of indicator variables. This type of misspecification can also be detected using modification indices or by comparing the estimated and observed variance covariance matrices given in the residual output. 11

Figure 8: Y 1,res v.s. Y 2,res Figure 9: Y 1,res v.s. Y 3,res 12

5 Using scatter plots to detect heterogeneity in the population For this example we generate the data set as in the previous section using a factor analysis model with 10 indicators and one factor. The parameters are as in the previous sections except for the residual variances of the indicator variables which we set now at 0.1. In addition we generate the data from a two-class model with two equal classes where the factor loadings for Y 2 in the second class is 0. All other loadings are 1. We estimate a model where the latent class variable is ignored, i.e., we estimate a simple factor analysis model with the 10 indicators and a single factor. We consider the scatter plot of Y i v.s. Ŷ i. Figure 10 contains the scatter plot Y 1 v.s. Ŷ 1 and Figure 11 contains the scatter plot Y 2 v.s. Ŷ 2. In Figure 10 we see a good match between the observed and the estimated values. In Figure 11 however that is not the case. We can clearly see the two patterns in the data, indicating an underlying heterogeneity in the population and the potential for Mixture modeling to improve the model and provide a better fit for the data. In this example as well, the usual fit statistics such as the chi-square test, CFI and TLI are misleading because they indicate a nearly perfect fit and overlook the drastic discrepancy illustrated in Figure 11. 13

Figure 10: Y 1 v.s. Ŷ 1 Figure 11: Y 2 v.s. Ŷ 2 14

6 Discussion Residual analysis in regression is a well established statistical tool, however, in structural equation modeling this is not so. Because the latent variables in the model are not observed, to construct the residuals Y res we need to use the factor score estimates for the latent variables. In this note we show that despite this complication, the residuals can successfully be used to evaluate model fit and discover needed model modifications. Bollen and Arminger (1991) explore other applications of SEM residual analysis such as outlier detection and individual observation test of fit using standardized individual residuals. Wang, Brown and Bandeen-Roche (2005) explore residual analysis in a Mixture context for model diagnostics including determining the number of latent classes. Raykov (2005) uses residual analysis for evaluating local goodness of fit in SEM. Residual analysis can also benefit from the use of plausible values for the latent variables, see Asparouhov and Muthén (2010), based on Bayesian estimation. Plausible values would allow us to use the residuals for further modeling that goes beyond standard SEM techniques and expands the utility of this methodology. References [1] Asparouhov, T. & Muthén, B. (2010). Plausible values for latent variables using Mplus. Technical Report. http://statmodel.com/download/plausible.pdf [2] Bollen, K.A. & Arminger, G. (1991). Observational residuals in factor 15

analysis and structural equation models, Sociological methodology, 235-262 [3] Raykov, T. (2005). Residuals in Structural Equation, Factor Analysis, and Path Analysis Models. Encyclopedia of Statistics in Behavioral Science. [4] Wang, C.P., Brown, H. & Bandeen-Roche, K. (2005). Residual Diagnostics for Growth Mixture Models, Journal of the American Statistical Association, 100, 1054-1076. 16