Improving Model Selection in Structural Equation Models with New Bayes Factor Approximations

Size: px

Start display at page:

Download "Improving Model Selection in Structural Equation Models with New Bayes Factor Approximations"

Shannon Lynch
6 years ago
Views:

1 Improving Model Selection in Structural Equation Models with New Bayes Factor Approximations Kenneth A. Bollen Surajit Ray Jane Zavisca Jeffrey J. Harden April 11, 11 Abstract Often researchers must choose among two or more structural equation models for a given set of data. Typically, selection is based on having the highest chi-square p-value or the highest fit index such as the or. Though a common situation, there is little evidence on the performance of these fit indices in choosing between models. In other statistical applications, Bayes Factor approximations such as the are commonly used to select between models, but these are rarely used in SEMs. This paper examines several new and old Bayes Factor approximations along with some commonly used fit indices to assess their accuracy in choosing the true model among a broad set of false models. The results show that the Bayes Factor Approximations outperform the other fit indices. Among these approximations one of the new ones, SP, is particularly promising. The commonly used chi-square p-value and the,, and do much worse. Keywords: Bayes Factor Structural Equation Models Empirical Bayes Information Criterion Model Selection Department of Sociology, University of North Carolina, Chapel Hill, NC 27599, bollen@unc.edu Department of Mathematics and Statistics, Boston University, Boston, MA 02446, sray@bu.edu Department of Sociology, University of Arizona, Tucson, AZ 85721, janez@u.arizona.edu Department of Political Science, University of North Carolina, Chapel Hill, NC 27599, jjharden@unc.edu

2 1 Introduction Structural equation models (SEMs) refer to a general statistical model that incorporates ANOVA, multiple regression, simultaneous equations, path analysis, factor analysis, growth curve analysis, and many other common statistical models as special cases. SEMs are well-established in sociology and the social and behavioral sciences and are beginning to spread to the health and natural sciences (Bentler and Stein 1992; Shipley 00; Schnoll, Fang, and Manne 04; Beran and Violato, e.g.,[). A key issue in the use of SEMs is determining the fit of the model to the data and comparing the relative fit of two or more models to the same data. An asymptotic chi-square distributed test statistic that tests whether a SEM exactly reproduces the covariance matrix of observed variables was the first test and index of model fit (Jöreskog 1969, 1973). However, given the approximate nature of all models and the enhanced statistical power that generally accompanies a large sample size, the chi-square test statistic routinely rejects models in large samples regardless of their merit as approximations to the underlying process. Starting with Tucker and Lewis (1973), Bentler and Bonett (19), and Jöreskog and Sörbom (1979), a variety of fit indices have been developed to supplement the chi-square test statistic (e.g., see Bollen and Long 1993). Controversy surrounds these alternative fit indices in that often the sampling distribution of the fit index is not known, sometimes the fit index tends to increase as sample size increases, and there is disagreement about the optimal cutoff values for a good fitting model. Bayes Factors play a role in comparing the fit of models in multiple regression, generalized linear models, and several other areas of statistical modeling. Yet despite this common use, Bayes Factors have received little attention in SEM literature. Cudeck and Browne (1983) and Bollen (1989) give only brief mention to Schwarz s (1978) Bayesian Information Criterion () as an approximation to the Bayes Factor. Raftery (1993, 1995) provides more discussion of the in SEMs, as do Haughton, Oud, and Jansen (1997). However, it is rare to see the or Bayes Factors in applications of SEMs, despite their potential value in comparing the fit of competing models. 1

3 The primary purpose of this paper is to compare several widely used measures of SEM model fit mentioned above with several different approximations to Bayes Factors, including two new approximations: the Information Matrix-Based Information Criterion (I) and the Scaled Unit Information Prior Bayesian Information Criterion (SP). In the course of presenting the I and SP, we explain the potential use of Bayes Factors in model selection of SEMs. As evidence of this usefulness, we demonstrate in a simulation study that Bayes Factor approximations perform much better in SEM model selection than more commonly-used methods, and that I and SP are among the best-performing fit statistics under several conditions. The next section introduces the SEM equations and assumptions and reviews the current methods of assessing SEM model fit. Following this is a section in which we present the Bayes Factor, the, and the I and SP. We then describe our Monte Carlo simulation study that examines the behavior of various model selection criteria for a series of models that are either correct, underspecified, or overspecified. Finally, we conclude with a discussion of the implications of the results and recommendations for researchers. 2 Structural Equation Models and Fit Indices SEMs have two primary components: a latent variable model and a measurement model. We write the latent variable model as: η = α η + Bη + Γξ + ζ (1) The η vector is m 1 and contains the m latent endogenous variables. The intercept terms are in the m 1 vector of α η. The m m coefficient matrix B gives the effect of the ηs on each other. The N latent exogenous variables are in the N 1 vector ξ. The m N coefficient matrix Γ contains the coefficients for ξ s impact on the ηs. An m 1 vector ζ contains the disturbances for each latent endogenous variable. We assume that E(ζ) = 0, COV (ζ, ξ) = 0, and we assume that the disturbance for each equation is homoscedastic and nonautocorrelated across cases, although the variances of ζs from different equations can differ and these ζs can correlate across equations. 2

4 The m m covariance matrix Σ ζζ has the variances of the ζs down the main diagonal and the across equation covariances of the ζs on the off-diagonal. The N N covariance matrix of ξ is Σ ξξ. The measurement model is: y = α y + Λ y η + ε (2) x = α x + Λ x ξ + δ (3) The p 1 vector y contains the indicators of the ηs. The p m coefficient matrix Λ y (the factor loadings ) give the impact of the ηs on the ys. The unique factors or errors are in the p 1 vector ε. We assume that E(ε) = 0 and COV (η, ε) = 0. The covariance matrix for the ε is Σ εε. There are analogous definitions and assumptions for measurement Equation 3. for the q 1 vector x. We assume that ζ, ε, and δ are uncorrelated with ξ and in most models these disturbances and errors are assumed to be uncorrelated among themselves, though the latter assumption is not essential. 1 We also assume that the errors are homoscedastic and nonautocorrelated across cases. Define z = [y x ], then the covariance matrix and mean of z as a function of the model parameters are Σ(θ) = E[(z µ z )(z µ z ) ] (4) = Λ y(i B) 1 (ΓΣ ξξ Γ + Σ ζζ )(I B) 1 Λ y + Σ εε Λ x Σ ξξ Γ (I B) 1 Λ y Λ y (I B) 1 ΓΣ ξξ Λ x Λ x Σ ξξ Λ x + Σ δδ µ(θ) = E[z] (5) = α y + Λ y (I B) 1 (α η + Γκ) α x + Λ x κ 1 We can modify the model to permit such correlations among the errors, but do not do so here. 3

5 where θ is a vector that contains all of the parameters (i.e., coefficients, variances, and covariances) that are estimated in a model and κ is the vector of means of ξ. Σ(θ) and µ(θ) are the implied covariance matrix and the implied mean vector of the observed variables, z. These implied matrices are written as functions of the parameters of the SEM. For a valid model, they imply that the covariance matrix (Σ) and the mean vector (µ) of the observed variables, z, are exactly reproduced by the corresponding implied matrices so that Σ = Σ(θ) (6) and µ = µ(θ). (7) The log likelihood of the maximum likelihood estimator, derived under the assumption that z comes from a multinormal distribution, is ln L(θ) = K 1 2 ln Σ(θ) 1 2 [z µ(θ)] Σ 1 (θ)[z µ(θ)], (8) where K is a constant that has no effect on the values of θ that maximize the log likelihood. The chi-square test statistic derives from a likelihood ratio test of the hypothesized model to a saturated model where the covariance matrix and mean vector are exactly reproduced. The null hypothesis of the chi-square test is that Equations 6 and 7 hold exactly. In large samples, the likelihood ratio (LR) test statistic follows a chi-square distribution when the null hypothesis is true. Excess kurtosis can affect the test statistic, though there are various corrections for excess kurtosis (Satorra and Bentler 1988; Bollen and Stine 1992). In practice, the typical outcome of this test in large samples is rejection of the null hypothesis that represents the model. This often occurs even when corrections for excess kurtosis are made. At least part of the explanation is that the null hypothesis assumes exact fit whereas in practice we anticipate at least some inaccuracies in our modeling. The significant LR test simply reflects what we know: our model is at best an approximation. 4

6 In response to the common rejection of the null hypothesis of perfect fit, researchers have proposed and used a variety of fit indices. The Incremental Fit Index (, Bollen 1989), Comparative Fit Index (), Tucker-Lewis Index (, 1973), Relative Noncentrality Index (, Bentler and Bonett 19), and Root Mean Squared Error of Approximation (, Steiger and Lind 19) are just a few of the many indices of fit that have been proposed. Though the popularity of each has waxed and waned, there are problems characterizing all or most of these. One is that there is ambiguity as to the proper cutoff value to signify that a model has an acceptable fit. Different authors recommend different standards of good fit and in nearly all cases the justification for cutoff values are not well founded. Second, the distributions of most of these fit indices are unknown. Thus, with the exception of the, confidence intervals for the fit indices are not provided. Third, the fit indices that are based on comparisons to a baseline model (e.g.,,, ) can behave somewhat differently depending on the fitting function used to estimate a model. This suggests that standards of fit should be adjusted depending on the fitting function employed. Another issue is that the means of the sampling distributions of some measures tend to be larger in bigger samples than in smaller ones even if the identical model is fitted. Thus, cutoff values for such indices might need to differ depending on the size of the sample. Another set of issues arise when comparing competing models for the same data. If the competing models are nested, then a LR test that is asymptotically chi-square distributed is available. However, in such a case, the issue of statistical power in small samples remains. Furthermore, the LR test is not available for nonnested models. Finally, other factors such as excess kurtosis can impact the LR test. The situation is not made better by employing other fit indices. Though we can take the difference in the fit indices across two or more models, there is little guidance on what to consider as a sufficiently large difference to conclude that one model s fit is better than another. In addition, the standards might need to vary depending on the estimator employed and the size of the sample for some of the fit indices. Nonnested models further complicate these comparisons since the rationale for direct comparisons of nonnested models for most of these fit indices is not well-developed. 5

7 This brief overview of the LR test and the other fit indices in use in SEMs reveals that there are drawbacks to the current methods of assessing the fit of models. It would be desirable to have a method to assess fit that worked for nested and nonnested models and that had firm rationale in statistical theory. From a practical point of view, it is desirable to have a measure that is calculable using standard SEM software. In the next section, we examine the Bayes Factor and methods to approximate it that address these criteria. 3 Approximating Bayes Factors in SEMs The Bayes Factor expresses the odds of observing a given set of data under one model versus an alternative model. For any two models 1 and 2, the Bayes Factor, B 12, is the ratio of P (Y M k ) for two different models where P (Y M k ) is the probability of the data Y given the model, M k. More explicitly, the Bayes Factor, B 12, is B 12 = P (Y M 1) P (Y M 2 ) (9) Conceptually, the Bayes Factor is a relatively straightforward way to compare models, but it can be complex to compute. The marginal likelihoods must be calculated for the competing models. Following the Bayes Theorem, the marginal likelihood (also known as the integrated likelihood) can be expressed as a weighted average of the likelihood of all possible parameter values under a given model. P (Y M k ) = P (Y θ k, M k )P (θ k M k ) d θ k, () where θ k is vector of parameters of M k, P (Y θ k, M k ) is the likelihood under M k, and P (θ k M k ) is the prior distribution of θ k. Closed-form analytic expressions of the marginal likelihoods are difficult to obtain, and numeric integration is computationally intensive with high dimensional models. Approximation of this integral via the Laplace method is a common approach to solving this problem. Technical 6

8 details of the Laplace method can be found in many sources, including Kass and Raftery (1995). In brief, a second order Taylor expansion of the log P (Y M k ) around the Maximum Likelihood (ML) estimator θ k leads to the following expression, where l( ˆθ k ) is the log likelihood function (i.e. log P (Y θ k, M k )), ĪE( θ k ) is the expected information matrix, d k is the dimension (number of parameters) of the model, and O(n 1/2 ) is the approximation error, and N is the sample size. log P (Y M k ) = l( θ k ) + log P ( θ k M k ) + d k 2 log(2π) d k 2 log(n) 1 2 log Ī E ( θ k ) + O(n 1/2 ). (11) This equation forms the basis for several approximations to the Bayes Factor. Here we briefly present the equations for several approximations, including Schwarz s, Haughton s, and our two new approximations, the I and the SP. Derivations and justifications for these equations can be found in a companion technical paper (reproduced in the appendix). Here our goal is simply to present the final equations, show their relationships to each other, and explain how they can be calculated for SEM models using standard software. The Standard Bayesian Information Criterion (). Schwarz s (1978) drops the second, third, and fifth terms of Equation 11, on the basis that in large samples, the first and fourth terms will dominate in the order of error. Ignoring these terms gives log P (Y M k ) = l( θ k ) d k 2 If we multiply this by 2, the for say M 1 and M 2 is calculated as log(n) + O(1). (12) = 2[l( ˆθ 2 ) l( ˆθ 1 )] (d 2 d 1 ) log(n). (13) An advantage of the is that it does not require specification of priors and it is readily calculable from standard output of most statistical packages. However, the approximation does not always work well. Problematic conditions include small sample, extremely collinear explanatory 7

9 variables, or when the number of parameters increases with sample size. For other critiques see Winship (1999) and Weakliem (1999). Haughton s (H). A variant on the retains the third term of Equation 11 (Haughton 1988). We call this enhanced approximation the H (Haughton labels it *). log P (Y M k ) = l( θ k ) + d k 2 log(2π) d k log(n) + O(1) (14) 2 = l( θ k ) d k 2 log( n 2π ) + O(1). In comparing models M 1 and M 2 and multiplying by 2 this leads to H = 2[l( ˆθ 2 ) l( ˆθ ( n ) 1 )] (d 2 d 1 ) log. (15) 2π The H performed better in model selection in a simulation study for SEMs (Haughton, Oud, and Jansen 1997). The Information Matrix-Based (I). Following Haughton s lead, we see no reason to throw away information that is easily calculable. The estimated expected information matrix Ī E ( θ k ) is output by many SEM packages (it can be computed as the inverse of the covariance matrix for parameter estimates). Taking advantage of this, we preserve the fifth term in Equation log P (Y M k ) = l( θ k ) d ( k n ) 2 log 2π 1 2 log Ī E ( θ k ) + O(1). For models M 1 and M 2 and multiplying by 2, the Information matrix based Bayesian information Criterion (I) is given by I = 2[l( ˆθ 2 ) l( ˆθ ( n ) 1 )] (d 2 d 1 ) log 2π log ĪE( θ 2 ) + log ĪE( θ 1 ). (16) The Scaled Unit Information Prior (SP). An alternative derivation of Schwarz s 2 Note that this is very similar to Kashyap s (1982) approximation, which uses log(n) rather than log( n 2π ). 8

10 makes use of prior distribution of θ k. It is possible to arrive at Equation 12 by utilizing a unit information prior, which assumes that the prior has approximately the same information as the information contained in a single data point. Critics have argued that the unit information prior is too flat to reflect a realistic prior. This prior also penalizes complex models too heavily, making overly conservative toward the null hypothesis (Weakliem 1999; Kuha 04; Berger et al. 06). Some researchers have proposed centering priors around the MLE ( θ) of M k, which favors more complex models heavily. We propose to use a scaled information prior that is also centered around the MLE, but scaled so that the prior extends out to the likelihood, and does not place all the mass around the complex model. The final analytical expression is as follows, where I o ( θ k ) is the observed information matrix evaluated at θ k. log P (Y M k ) = l( ˆθ k ) d k 2 + d k 2 (log p k log( ˆθ k T I O ( ˆθ k ) ˆθ k )) (17) [ ]) = SP = 2(l( ˆθ 1 ) l( ˆθ 2 )) d 1 (1 log + d 2 (1 log [ d 2 ˆθ T 2 I O ( ˆθ 2 ) ˆθ 2 ]) d 1 ˆθ T 1 I O ( ˆθ 1 ) ˆθ 1. (18) For more details on the calculation of the scaling factor (which is model dependent and thus leads to an empirical Bayes prior) and the derivation of the analytical expression below, see the appendix. The essential point for the practitioner is that the prior is implicit in the information and does not need to be explicitly calculated. All elements of the SP formula are output by standard SEM software. Theoretically, the I and the SP should perform at least as well as the or the H, and better under certain circumstances. The I incorporates an additional term that makes it closer to the original Laplace approximation. To the degree that this term contributes to distinguishing between models, it holds the potential to improve accuracy. This might occur in smaller samples where this term is not dominated by the other terms in the approximation. The SP employs a more flexible prior with more realistic assumptions than the unit information 9

11 prior on which the standard is based. << Add something on why this matters, esp. in SEM case.>> However, despite this theoretical justification for using Bayes Factor approximations with SEMs, there is little empirical evidence as to their effectiveness. The only study of Bayes Factor approximations in selecting SEMs is the Haughton, Oud, and Jansen (1997) research mentioned above. These authors simulate a three factor model with six indicators and compare minor variants on the true structure to determine how successful the Bayes Factor approximations and several fit indices are in selecting the true model. Their simulation results find that the Bayes Factor approximations as a group are superior to the other p-value of the chi-square test and the other fit indices in correctly selecting the true generating model. They find that H does the best among the Bayes Factor approximations. Among the other fit indices and the chi-square p-value, the and RFI have the best correct model selection percentages, though their success is lower than the Bayes Factor approximations. The Haughton, Oud, and Jansen (1997) study is valuable in several respects, not the least of which in its demonstration of the value of the Bayes Factor approximations. As they acknowledge, it is unclear whether their results extend beyond the relatively simple model that they examined. In addition, the structure of most of the models that they considered were quite similar to each other. The performance of the Bayes Factor approximations for models with very different structures is unknown. Also their study examined the single model that was best fitting according to their fit measures. The distance of the best fitting model from the second best is not discussed. Our simulation analysis seeks to test the generality of their results by looking at larger models with alternative models that range from mildly different to very different structures. In addition, we provide the first evidence on the performance of the I and SP in the selection of SEMs. We also include several popular fit indices including the p-value of the chi-square test statistic.

12 4 Simulation Study To test the performance of these various approximations to the Bayes Factor, we fit a variety of models to two sets of simulated data. We use models designed by Paxton et al. (01), which has been the basis for several other simulation studies. This will enable comparisons to the results for other fit indices that use the same design, but different fit indices. Standard SEM assumptions hold in these models: exogenous variables and disturbances come from normal distributions and the disturbances follow the assumptions presented in Section 2 above. Graphical depictions of each models and the specific parameters used to generate the simulated data can be found in Paxton et al. (01). The first simulation (herefore designated SIM1) was based on the following underlying population model. The structural model consists of three latent variables linked in a chain. η η 1 ψ 11 η 2 = β η 2 + ψ 22 0 β 32 0 η 3 η 3 The measurement model contains 9 observed variables, 3 of which crossload on more than one latent variable. Scaling is accomplished by fixing λ = 1 for a single indicator loading on each ψ 33 latent variable y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 = λ λ λ 41 λ λ 62 λ 63 0 λ 72 λ η 1 η 2 η 3 + δ 1 δ 2 δ 3 δ 4 δ 5 δ 6 δ 7 δ 8 y λ 93 δ 9 11

13 For each simulated data set, we fit a series of misspecified models, listed in Table 1. This range of models tests the ability of the different fit statistics to detect a variety of common errors. Parameters were misspecified so as to introduce error into both the measurement and latent variable components of the models. For example, factor loadings for indicators of the latent variables are incorrectly added or dropped in models 2 4, correlated errors between indicators are introduced in models 5 and 7, and latent variables are artificially added or dropped in models (with corresponding changes in the measurement model). [Insert Table 1 here] The second simulation (SIM2) is based on a generating model that is identical to SIM1 in the structural model, but contains an additional two observed indicators for each latent variable. η η 1 ψ 1 η 2 = β η 2 + ψ 2 0 β 32 0 η 3 η 3 ψ 3 12

14 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y y 11 y 12 y 13 y 14 y 15 = λ λ λ λ λ 61 λ λ λ λ,2 λ,3 0 λ 11,2 λ 11, λ 13,3 0 0 λ 14,3 0 0 λ 15,3 η 1 η 2 η 3 + δ 1 δ 2 δ 3 δ 4 δ 5 δ 6 δ 7 δ 8 δ 9 δ δ 11 δ 12 δ 13 δ 14 δ 15 The misspecified models used in SIM2 are given in Table 2. [Insert Table 2 here] For each simulation we generated 00 data sets with each of the following sample sizes: N = 0, N = 2, N = 0, N = 00, and N = 00. Note that, following standard practice, samples were discarded if they led to non-converging or improper solutions, and new samples were drawn to replace the bad samples until a total of 00 samples were drawn that had convergent solutions for each of the 12 fitted models. 3 In these simulations we tested the relative performance of various information criteria as well as several traditional tests of overall model fit by calculating the percentage of samples (at 3 This was most problematic at small sample sizes for SIM1 at N = 0, 715 samples had to be discarded for nonconvergence and 64 for improper solutions. The respective SIM1 numbers at other sample sizes are: N = 2: 236 and 7, N = 0: 96 and 0, N = 00: and 0, and N = 00: 27 and 0. The numbers for SIM2 are: N = 0: 85 and 56, N = 2: 9 and 4, N = 0: 5 and 0, N = 00: 0 and 0, and N = 00: 0 and 0. 13

15 each sample size) for which the correct model (M1) was unambiguously selected, ambiguously contained within a set of well-fitting models, ambiguously rejected, or unambiguously rejected. The fit statistics included in the study are the, I, SP, H,, chi-square p-value,,,,, and. To assess the information criteria measures, we used the following categories: True/Clear: The true model has the smallest value for the information criterion, and is at least 2 points smaller than the next best model. True/Close: The true model has the smallest value for the information criterion, but it is within 2 points of the next best model. False/Close: An incorrect model has the smallest value of the information criterion, but it is within 2 points of the true model False/Clear: An incorrect model has the smallest value of the information criterion, and is at least 2 points smaller of the true model. We use the value of 2 as a measure of close fit based on the Jeffreys-Raftery recommendations for Bayes Factor approximations (Raftery 1995). In essence, this suggests that measures that differ by 2 or less do not draw sharp distinctions between two models. For the chi-square p-value,,,, and, the best fitting model is chosen as that with the highest value. For the, the model with the smallest value is the best fitting one. For the,,,, and the fit of two models is considered close if the fit is within ± This cutoff is based on the common practice of only considering models that fall between 0.9 and 1 for the,,, and, and between 0. and 0 for the. A different of.01 represents % of the typical range of acceptable model fit. Fit indices that differ by less than this amount, differ by less than this %. The chi-square p-value difference that we use to separate close fit is also Instances in which the p-value differs by less than 0.01 seem too close to justify the conclusion that the models fit the data differently. 4 4 Regardless, results remain unchanged if another value, such as ± is used for all of these cutoffs. 14

16 4.1 Results Figures 1 and 2 present model selection results. The y-axes in these graphs plot the percentage of the 00 simulations in which a given fit statistic clearly or ambiguously selected the true model (positive bars) and clearly or ambiguously selected an incorrect model (negative bars). Each graph plots results at a different sample size, increasing from 0 (top-middle graph) to panel (bottom-right). A perfect fit statistic would produce a medium gray positive bar that extends up to 0 on the y-axis of every graph, and no bar extending downward. This would indicate that the statistic always clearly selects the true model and never selects an incorrect model. Two main points to note in these figures are that the Bayes Factor approximations are notably superior to the conventional methods with respect to selecting the correct model and that their performance improves as sample size increases. The graphs show that all of the measures perform poorly at N = 0, and improve as the sample size increases. However, this improvement in performance is larger for the various measures compared to the chi-square-based indices. At N = 00 in SIM1, three of the four measures clearly select the true model over % of the time. In contrast, the best performing chi-square-based method,, only selects the correct model ambiguously, and in only about 65% of the simulations. In fact, even at N = 00 (SIM1), the,,,, and still ambiguously select an incorrect model more than % of the time and the chi-square p-value clearly selects the wrong model in almost % of the simulations. Both SIM1 and SIM2 show similar patterns, though performance of all of the measures increases slightly in SIM2, which is to be expected given that its model contains an additional two observed indicators for each latent variable. In short, the Bayes Factor approximations substantially outperform the more common fit measures in selecting the true model. These results hold for several different cutoff values with the the,,,, and (see note 4). [Insert Figure 1 here] [Insert Figure 2 here] There are also noticeable differences between the various information criteria we examine 15

17 in Figures 1 and 2. At small sample sizes, and I have the highest rate of clearly selecting an incorrect model (both over % at N = 0 for SIM1), while SP and H perform better, with SP showing the best performance (over % True/Clear at N = 2 in SIM1 and % in SIM2). All of the information criteria improve as N increases, with the two new measures described here I and SP performing slightly better than the others in several instances., I, and SP perform the best at the larger sample sizes in both SIM1 and SIM2, with H performing better than the conventional methods, but not quite at the level of the other three. For instance, in SIM2 at N = 00,, I, and SP all clearly select the true model more than % of the time (with I at 0%), while H falls into the True/Clear category about 85% of the time. Finally, performance is between that of the various Bayesian information criteria and the conventional methods, most often falling in one of the two ambiguous categories (True/Close or False/Close). 4.2 Discussion This simulation study suggests that there is much to gain from using Bayesian information criteria including two new measures show here in SEM model selection. The, I, SP, and H consistently outperform the conventional methods of conducting model selection. This pattern holds at several sample sizes and in SEM data with relatively less information (SIM1) and relatively more information (SIM2). While all of the measures examined here perform poorly in selecting the correct model in small samples, the Bayes Factor approximations drastically improve as N increases while the conventional methods improve only moderately or not at all. There is also variance between the various Bayes Factor approximations. In model selection, the new I and SP shown here as well as are the top performers, with H not performing as well. In several cases, either I or SP clearly selects the true model at the highest frequency of all the fit statistics. 16

18 5 Conclusions As in any statistical analysis, a key issue in the use of SEMs is determining the fit of the model to the data and comparing the relative fit of two or more models to the same data. However, current practice with SEMs commonly employs methods that are problematic for several different reasons. For example, the asymptotic chi-square distributed test statistic that tests whether a SEM exactly reproduces the covariance matrix of observed variables routinely rejects models in large samples regardless of their merit as approximations to the underlying process. Several other fit indices have been developed in response, but those measures also have important problems: their sampling distributions are not commonly known, they tend to increase as sample size increases, and there is disagreement about the optimal cutoff values for a good fitting model. In this paper, we have shown that Bayes Factor approximations can serve as a useful alternative to these conventional methods. Though others have noted this usefulness (e.g., Raftery 1993, 1995; Haughton, Oud, and Jansen 1997), it is rare to see the or Bayes Factors in applications of SEMs. We introduce two new information criteria the I and SP to the SEM literature and show through an extensive set of comparisons that, along with some established approximations, these fit statistics outperform several conventional measures in selecting the correct model. In sum, Bayes Factor approximations are a useful tool for the applied researcher in comparing SEM models. Their significant improvement in model selection compared to the conventional methods can substantially improve the current practice of SEM analysis. In addition, among the various options for approximating a Bayes Factor, we show that our new fit indices the I and SP perform well when compared to other information criteria. Both of these measures are easily calculated from standard SEM software, and thus are easy for the applied researcher to use. Thus, we recommend use of I and SP to researchers seeking to compare SEM models in their work. 17

19 A Appendix: Bayes Factor Technical Details This paper is based on research done in an unpublished companion technical paper. Below we reproduce key details from that paper on the Bayes Factor, methods of approximation, and specifics of the, SP, and I. A.1 Bayes Factor In this section we briefly describe the Bayes Factor and the Laplace approximation. Readers familiar with this derivation may skip this section. For a strict definition of the Bayes Factor we have to appeal to the Bayesian model selection literature. Thorough discussions on Bayesian statistics in general and model selection in particular are in Gelman et al. (1995), Carlin and Louis (1996), and Kuha (04). Here we provide a very simple description. We start with data Y and hypothesize that it was generated by either of two competing models M 1 and M 2. 5 Moreover, let us assume that we have prior beliefs about the data generating models M 1 and M 2, so that we can specify the probability of data coming from model M 1, P (M 1 ) = 1 P (M 2 ). Now, using the Bayes theorem the ratio of the posterior probabilities is given by P (M 1 Y ) P (M 2 Y ) = P (Y M 1) P (M 1 ) P (Y M 2 ) P (M 2 ). The left side of the equation can be interpreted as the posterior odds of M 1 vs. M 2, and it represents how the prior odds P (M 1 )/P (M 2 ) is updated by the observed data Y. The factor P (Y M 1 )/P (Y M 2 ) that is multiplied to obtain the posterior from the prior is known as the Bayes Factor, which we will denote as B 12 = P (Y M 1) P (Y M 2 ). (19) Typically, we will deal with parametric models M k which are described through model 5 Generalization to more than two competing models is achieved by comparing the Bayes Factor of all models to a base model and choosing the model with highest Bayes Factor 18

20 parameters, say θ k. So the marginal likelihoods P (Y M k ) are evaluated using P (Y M k ) = P (Y θ k, M k )P (θ k M k )dθ k, () where P (Y θ k, M k ) is the likelihood under M k, and P (θ k M k ) is the prior distribution of θ k. A closed form analytical expression of the marginal likelihoods is difficult to obtain, even with completely specified priors (unless they are conjugate priors). Moreover, in most cases θ k will be high-dimensional and a direct numerical integration will be computationally intensive if not impossible. For this reason we turn to a way to approximate this integral given in (). A.1.1 Laplace Method of Approximation The Laplace Method of approximating Bayes Factors is a common device (see Raftery 1993; Weakliem 1999; Kuha 04). Since we also make use of the Laplace method for our proposed approximations, we will briefly review it here. The Laplace method of approximating P (Y θ k, M k )P (θ k M k )dθ k = P (Y M k ) uses a second order Taylor series expansion of the log P (Y M k ) around θ k, the posterior mode. It can be shown that log P (Y M k ) = log P (Y θ k, M k ) + log P ( θ k M k ) + d k 2 log(2π) 1 2 log H( θ k ) + O(N 1 ), where d k is the number of distinct parameters estimated in model M k, π is the usual mathematical constant, H( θ k ) is the Hessian matrix (21) 2 log[p (Y θ k, M k )P (θ k M k )] θ k θ k, evaluated at θ k = θ k.and O(N 1 ) is the big-o notation indicating that the last term is bounded in probability from above by a constant times N 1 (see, for example, Tierney and Kadane 1986; Kass and Vaidyanathan 1992; Kass and Raftery 1995). Instead of the Hessian matrix we will be working with different forms of information ma- 19

21 trices. Specifically, the observed information matrix is denoted by I O (θ) = 2 log[p (Y θ k, M k )] θ k θ k, whereas the expected information matrix I E is given by [ ] 2 log[p (Y θ k, M k )] I E (θ) = E. θ k θ k Let us also define the average information matrices ĪE = I E /N and ĪO = I O /N. If the observations come from an i.i.d. distribution we have the following identity [ ] 2 log[p (Y i θ k, M k )] Ī E = E, θ k θ k the expected information for a single observation. We will make use of this property later. In large samples the Maximum Likelihood (ML) estimator, θ k, will generally be a reasonable approximation of the posterior mode, θ k. If the prior is not that informative relative to the likelihood, we can write log P (Y M k ) = log P (Y θ k, M k ) + log P ( θ k M k )+ d k 2 log(2π) 1 2 log I O( θ k ) + O(N 1 ), (22) where log P (Y θ k, M k ) is the log likelihood for M k evaluated at θ k, I O ( θ k ) is the observed information matrix evaluated at θ k (Kass and Raftery 1995) and O(N 1 ) is the big-o notation that refers to a term bounded in probability to be some constant multiplied by N 1. If the expected information matrix, I E ( θ k ), is used in place of I O ( θ k ), then the approximation error is O(N 1/2 )

22 and we can write log P (Y M k ) = log P (Y θ k, M k ) + log P ( θ k M k ) + d k 2 log(2π) 1 2 log IE ( θ k ) + O(N 1/2 ) (23) If we now make use of the definition, ĪE = I E /N and substitute it in (23), we have log P (Y M k ) = log P (Y θ k, M k ) + log P ( θ k M k )+ d k 2 log(2π) d k 2 log(n) 1 2 log Ī E ( θ k ) + O(N 1/2 ). (24) This equation forms the basis of several approximations to Bayes Factors, which we discuss below. A.2 Approximation through Special Prior Distributions The has been justified by making use of a unit information prior distribution of θ k. Our first new approximation, SP, builds on this rationale using a more flexible prior. We first derive the, then present our proposed modification. A.2.1 : The Unit Information Prior Distribution We assume for model M K (the full model, or the largest model under consideration), the prior of θ K is given by a multivariate normal density P (θ K M K ) N (θ K, V K) (25) where θ K and V K are the prior mean and variance of θ K. In the case of we take V K = Ī 1 0, the average information defined in Section 3. This choice of variance can be interpreted as setting the prior to have approximately the same information as the information contained in a single data 21

23 point. 6 This prior is thus referred to as the Unit Information Prior and is given by P (θ K M K ) N ( θk, [ĪO ( ˆθ ] ) 1 K ). (26) For any model M k nested in the full model M K, the corresponding prior P (θ k M k ) is simply the marginal distribution obtained by integrating out the parameters that do not appear in M k. Using this prior in (22), we can show that log P (Y M k ) log(p (Y ˆθ k, M k )) d k 2 log(n) 1 2N ( ˆθ θ ) T I 0 ( ˆθ k )( ˆθ θ ) log(p (Y ˆθ k, M k )) d k 2 log(n). (27) Note that the the error of approximation is still of the order O(N 1 ). Comparing models M 1 and M 2 and multiplying by 2 gives the usual = 2(l( ˆθ 2 ) l( ˆθ 1 )) (d 2 d 1 ) log(n) (28) where we use the more compact notation of l( ˆθ k ) log(p (Y ˆθ k, M k )) for the log likelihood function at ˆθk. An advantage of this unit information prior rationale for the is that the error is of order O(N 1 ) instead of O(1) when we make no explicit assumption about the prior distribution. However, some critics argue that the unit information prior is too flat to reflect a realistic prior (e.g., Weakliem 1999). A.2.2 SP: The Scaled Unit Information Prior Distribution The SP, our alternative approximation to the Bayes Factor, is similar to the derivation except instead of using the unit information prior, we use a scaled unit information prior that allows the variance to differ and permits more flexibility in the prior probability specifica- 6 Note that unlike the expected information, even if the observations are i.i.d., the average observed information Ī O ( θ k ), may not be equal to the observed information of any single observation. 22

24 tion. The scale is chosen by maximizing the marginal likelihood over the class of normal priors. Berger (1994), Fougere (19), and others have used related concepts of partially informative priors obtained by maximizing entropy. In general, these maximizations involve tedious numerical calculations. We ease the calculation by employing components in the approximation that are available from standard computer output. We first provide a Bayesian interpretation of our choice of the scaled unit information prior and then provide an analytical expression for calculating this empirical Bayes prior. We have already shown how the can be interpreted as using a unit information prior for calculating the marginal probability in (22). This very prior leads to the common criticism of being too conservative towards the null, as the unit prior penalizes complex models too heavily. The prior of is centered at θ (parameter value at the null model) and has a variance scaled at the level of unit information. As a result this prior may put extremely low probability around alternative complex models (for discussion, see Weakliem 1999; Kuha 04; Berger et al. 06). One way to overcome this conservative nature of is to use the ML estimate, ˆθ, of the complex model M k, as the center of the prior P (θ k M k ). But this prior specification puts all the mass around the complex model and thus favors the complex model heavily. A scaled unit information prior (henceforth denoted as SUIP) is a less aggressive way to be more favorable to a complex model. The prior is still centered around θ, so that the highest density is at θ, but we choose its scale, c k, so that the prior flattens out to put more mass on the alternative models. This places higher prior probability than the unit information prior on the space of complex modes, but at the same time prevents complete concentration of probability on the complex model. This effectively leads to choosing a prior in our class that is most favorable to the model under consideration, placing the SUIP in the class of Empirical Bayes priors. To obtain the analytical expression for the appropriate scale, we start from the marginal probability given in (). Instead of using the more common form of the Laplace approximation applied jointly to the likelihood and the prior, we use the approximate normality of the likelihood, but choose the scale of the variance of the multivariate normal prior so as to maximize the marginal 23

25 likelihood. As a result, our prior distribution is not fully specified by the information matrix; rather the variance is a scaled version of the variance of the unit information prior defined in (26). This prior has the ability to adjust itself through a scaling factor, which may differ significantly from unity in several situations (e.g., highly collinear covariates, dependent observations). The main steps in constructing the SP are as follows. First we define the scaled unit information prior as P k (θ k ) N θ k, [ĪO ( ˆθ 1 k )] c k, (29) where ĪO = 1 N I O and c k is the scale factor which we will evaluate from the data. Note that c k may vary across different models. Using the above prior in (29) and the generalized Laplace integral approximation described in Appendix A, we have log P (Y M k ) l( ˆθ k ) 1 ) 2 d k log (1 + Nck 1 c [ k 2 N + c k ( ˆθ k θ ) T I( ˆθ k )( ˆθ ] k θ ). () Now we estimate c k. Note that we can view () as a likelihood function in c k, and we can write L(c k Y, M k ) = L(θ k )π(θ k c k )dθ k = L(Y M k ). So the optimum c k is given by its ML estimate, which can be obtained as follows. ) 2l(c k ) = d k log (1 + Nck + 2l( ˆθ k ) c k ˆθT N + c k I O ( ˆθ k ) ˆθ k k = 2l N (c k ) = d k c k (N + c k ) ˆθ k T I O( ˆθ k ) ˆθ k N (N + c k ) 2 = 0 = N(N + c k )d k N c k ˆθT k I O ( ˆθ k ) ˆθ k d k ˆθ k O( ˆθ k ) ˆθ k d k if d k < ˆθ k T I O( ˆθ k ) ˆθ k, cˆ = kn = if d k ˆθ k T I O( ˆθ k ) ˆθ k. Using this c k we arrive at the SP measures under the two cases. 24

26 Case 1: When d k < ˆθ T k I O( ˆθ k ) ˆθ k, we have log P (Y M k ) = l( ˆθ k ) d k 2 + d k 2 (log d k log( ˆθ k T I O ( ˆθ k ) ˆθ k )) (31) [ ]) = SP = 2(l( ˆθ 2 ) l( ˆθ 1 )) d 2 (1 log ˆθ 2 T I O ( ˆθ 2 ) ˆθ 2 [ ]) + d 1 (1 log d 1 ˆθ T 1 I O ( ˆθ 1 ) ˆθ 1 d 2 (32) Case 2: Alternatively, when d k ˆθ T k I O( ˆθ k ) ˆθ k, as c k =, the prior variance goes to 0 so the prior distribution is a point mass at the mean, θ. Thus we have log P (Y M k ) = l( ˆθ k ) 1 2 ˆθ T k I O ( ˆθ k ) ˆθ k, which gives SP = 2(l( ˆθ 2 ) l( ˆθ 1 )) ˆθ T 2 I O ( ˆθ 2 ) ˆθ 2 + ˆθ T 1 I O ( ˆθ 1 ) ˆθ 1. (33) A.3 Approximation through Eliminating O(1) Terms An alternative derivation of the does not require the specification of priors. Rather, it is justified by eliminating smaller order terms in the expansion of the marginal likelihood with arbitrary priors (see (24)). After providing the mathematical steps of deriving the using (24), we show how approximations of the marginal likelihood with fewer assumptions can be obtained using readily available software output. A.3.1 : Elimination of Smaller Order Terms One justification for the Schwarz (1978) is based on the observation that the first and fourth terms in (24) have orders of approximations of O(N) and O(log(N)), respectively, while the second, third, and fifth terms are O(1) or less. This suggests that the former terms will dominate 25

27 the latter terms when N is large. Ignoring these latter terms, we have log P (Y M k ) = log P (Y θ k, M k ) d k 2 log(n) + O(1). (34) If we multiply this by 2, the for, say, M 1 and M 2 is calculated as = 2[log P (Y θ 2, M 2 ) log P (Y θ 1, M 1 )] (d 2 d 1 ) log(n). (35) Equivalently using l( ˆθ k ), the log-likelihood function in place of the log P (Y θ k, M k ) terms in (35) we have = 2[l( ˆθ 2 ) l( ˆθ 1 )] (d 2 d 1 ) log(n). (36) The relative error of the in approximating a Bayes Factor is O(1). This approximation will not always work well, for example when N is small, but also when sample size does not accurately summarize the amount of available information. This assumption breaks down for example when the explanatory variables are extremely collinear or have little variance, or when the number of parameters increase with sample size (Winship 1999; Weakliem 1999). A.3.2 I and Related Variants: Retaining Smaller Order Terms Some of the terms for the approximation of the marginal likelihood that are dropped by the can be easily calculated and retained. One variant called the H retains the third term in equation (23) (Haughton 1988; Haughton, Oud, and Jansen 1997). A simulation study by Haughton, Oud, and Jansen (1997) found that this approximation performs better in model selection for structural equation models than does the usual. The second term can also be retained by calculating the estimated expected information matrix, I E ( θ k ), which is often available in statistical software for a variety of statistical models. Taking advantage of this and building on Haughton, Oud, and Jansen (1997), we propose a new 26

28 approximation for the marginal likelihood, omitting only the second term in (24). log P (Y M k ) = log P (Y θ k, M k ) d ( ) k N 2 log 2π 1 2 log Ī E ( θ k ) + O(1). For models M 1 and M 2 and multiplying by 2, this leads to a new approximation of a Bayes Factor, the Information matrix-based Bayesian Information Criterion (I). 7 This is given by ( ) I = 2[l( ˆθ 2 ) l( ˆθ N 1 )] (d 2 d 1 ) log 2π log ĪE( θ 2 ) + log ĪE( θ 1 ). (38) Of the three approximations that we have discussed, I includes the most terms from equation (23), leaving out just the prior distribution term log P ( θ k M k ). 7 Another approximation due to Kashyap (1982), and given by 2[l( ˆθ 2 ) l( ˆθ 1 )] (d 2 d 1 ) log (N) log ĪE( θ 2 ) + log ĪE( θ 1 ) (37) is very similar to our proposed I. The I incorporates the estimated expected information matrix at the parameter estimates for the two models being compared. 27

29 References Bentler, P.M. and Douglas G. Bonett. 19. Significance Tests and Goodness of Fit in the Analysis of Covariance Structures. Psychological Bulletin 88: Bentler, P.M. and J.A. Stein Structural Equation Models in Medical Research. Statistical Methods in Medical Research 1(2): Beran, Tanya N. and Claudio Violato.. Structural Equation Modeling in Medical Research: A Primer. BMC Research Notes 3(1): Berger, James O An Overview of Robust Bayesian Analysis. Test (Madrid) 3(1):5 59. Berger, James O., Surajit Ray, Ingmar Visser, Ma J. Bayarri and W. Jang. 06. Generalization of. Technical report University of North Carolina, Duke University, and SAMSI. Bollen, Kenneth A Structural Equations with Latent Variables. New York: John Wiley & Sons. Bollen, Kenneth A. and J. Scott Long Testing Structural Equation Models. Sage. Bollen, Kenneth A. and Robert A. Stine Bootstrapping Goodness of Fit Measures in Structural Equation Models. Sociological Methods & Research 21: Carlin, Bradley P. and Thomas A. Louis Bayes and Empirical Bayes Methods for Data Analysis. New York: Chapman and Hall. Cudeck, Robert and Michael W. Browne Cross-validation of Covariance Structures. Multivariate Behavioral Research 18(2): Fougere, P. 19. Maximum Entropy and Bayesian Methods. In Maximum Entropy and Bayesian Methods, ed. P. Fougere. Dordrecht, NL: Kluwer Academic Publishers. Gelman, Andrew, John B. Carlin, Hal S. Stern and Donald B. Rubin Bayesian Data Analysis. London: Chapman & Hall. Haughton, Dominique M. A On the Choice of a Model to Fit Data From an Exponential Family. The Annals of Statistics 16(1): Haughton, Dominique M. A., Johan H. L. Oud and Robert A. R. G. Jansen Information 28

Testing Restrictions and Comparing Models

Econ. 513, Time Series Econometrics Fall 00 Chris Sims Testing Restrictions and Comparing Models 1. THE PROBLEM We consider here the problem of comparing two parametric models for the data X, defined by