An Introduction to Structural Equation Modelling John Fox

Size: px

Start display at page:

Download "An Introduction to Structural Equation Modelling John Fox"

Dominic Bruno Potter
6 years ago
Views:

Lecture Notes (with corrections) An Introduction to Structural Equation

2 Introduction to Structural-Equation Modelling 1 1. Introduction Structural-equation models (SEMs) are multiple-equation regression models in which the response variable in one regression equation can appear as an explanatory variable in another equation. Indeed, two variables in an SEM can even effect one-another reciprocally, either directly, or indirectly through a feedback loop. Structural-equation models can include variables that are not measured directly, but rather indirectly through their effects (called indicators) or, sometimes, through their observable causes. Unmeasured variables are variously termed latent variables, constructs, orfactors. Modern structural-equation methods represent a confluence of work in many disciplines, including biostatistics, econometrics, psychometrics, and social statistics. The general synthesis of these various traditions dates to the late 1960s and early 1970s. Introduction to Structural-Equation Modelling 2 This introduction to SEMs takes up several topics: The form and specification of observed-variables SEMs. Instrumental variables estimation. The identification problem : Determining whether or not an SEM, once specified, can be estimated. Estimation of observed-variable SEMs. Structural-equation models with latent variables, measurement errors, and multiple indicators. The LISREL model: A general structural-equation model with latent variables. We will estimate SEMs using the sem package in R. Introduction to Structural-Equation Modelling 3 2. Specification of Structural-Equation Models Structural-equation models are multiple-equation regression models representing putative causal (and hence structural) relationships among a number of variables, some of which may affect one another mutually. Claiming that a relationship is causal based on observational data is no less problematic in an SEM than it is in a single-equation regression model. Such a claim is intrinsically problematic and requires support beyond the data at hand. Introduction to Structural-Equation Modelling 4 Several classes of variables appears in SEMs: Endogenous variables are the response variables of the model. There is one structural equation (regression equation) for each endogenous variable. An endogenous variable may, however, also appear as an explanatory variable in other structural equations. For the kinds of models that we will consider, the endogenous variables are (as in the single-equation linear model) quantitative continuous variables. Exogenous variables appear only as explanatory variables in the structural equations. The values of exogenous variable are therefore determined outside of the model (hence the term). Like the explanatory variables in a linear model, exogenous variables are assumed to be measured without error (but see the later discussion of latent-variable models).

3 Introduction to Structural-Equation Modelling 5 Exogenous variables can be categorical (represented, as in a linear model, by dummy regressors or other sorts of contrasts). Structural errors (or disturbances) represent the aggregated omitted causes of the endogenous variables, along with measurement error (and possibly intrinsic randomness) in the endogenous variables. There is one error variable for each endogenous variable (and hence for each structural equation). Theerrorsareassumedtohavezeroexpectationsandtobe independent of (or at least uncorrelated with) the exogenous variables. The errors for different observations are assumed to be independent of one another, but (depending upon the form of the model) different errors for the same observation may be related. Introduction to Structural-Equation Modelling 6 Each error variable is assumed to have constant variance across observations, although different error variables generally will have different variances (and indeed different units of measurement the square units of the corresponding endogenous variables). As in linear models, we will sometimes assume that the errors are normally distributed. I will use the following notation for writing down SEMs: Endogenous variables: y k,y k 0 Exogenous variables: x j,x j 0 Errors: ε k,ε k 0 Introduction to Structural-Equation Modelling 7 Structural coefficients (i.e., regression coefficients) representing the direct (partial) effect of an exogenous on an endogenous variable, x j on y k : γ kj (gamma). Note that the subscript of the response variable comes first. of an endogenous variable on another endogenous variable, y k 0 on y k : β kk 0 (beta). Covariances between two exogenous variables, x j and x j 0: σ jj 0 two error variables, ε k and ε k 0: σ kk 0 When we require them, other covariances are represented similarly. Variances will be written either as σ 2 j or as σ jj (i.e., the covariance of a variable with itself), as is convenient. Introduction to Structural-Equation Modelling Path Diagrams An intuitively appealing way of representing an SEM is in the form of a causal graph, called a path diagram. An example, from Duncan, Haller, and Portes s (1968) study of peer influences on the aspirations of high-school students, appears in Figure 1. The following conventions are used in the path diagram: A directed (single-headed) arrow represents a direct effect of one variable on another; each such arrow is labelled with a structural coefficient. A bidirectional (two-headed) arrow represents a covariance, between exogenous variables or between errors, that is not given causal interpretation. Igiveeachvariableinthemodel(x, y, and ε) a unique subscript; I find that this helps to keep track of variables and coefficients.

4 Introduction to Structural-Equation Modelling 9 x 1 x 2 σ 14 x 3 x 4 γ 51 y 5 ε 7 γ 52 β 56 β 65 γ 63 y 6 ε 8 γ 64 σ 78 Introduction to Structural-Equation Modelling 10 When two variables are not linked by a directed arrow it does not necessarily mean that one does not affect the other: For example, in the Duncan, Haller, and Portes model, respondent s IQ (x 1 ) can affect best friend s occupational aspiration (y 6 ), but only indirectly, through respondent s aspiration (y 5 ). The absence of a directed arrow between respondent s IQ and best friend s aspiration means that there is no partial relationship between the two variables when the direct causes of best friend s aspiration are held constant. In general, indirect effects can be identified with compound paths through the path diagram. Figure 1. Duncan, Haller, and Portes s (nonrecursive) peer-influences model: x 1, respondent s IQ; x 2, respondent s family SES; x 3, best friend s family SES; x 4, best friend s IQ; y 5, respondent s occupational aspiration; y 6, best friend s occupational aspiration. So as not to clutter the diagram, only one exogenous covariance, σ 14,isshown. Introduction to Structural-Equation Modelling Structural Equations The structural equations of a model can be read straightforwardly from thepathdiagram. For example, for the Duncan, Haller, and Portes peer-influences model: y 5i = γ 50 + γ 51 x 1i + γ 52 x 2i + β 56 y 6i + ε 7i y 6i = γ 60 + γ 63 x 3i + γ 64 x 4i + β 65 y 5i + ε 8i I ll usually simplify the structural equations by (i) suppressing the subscript i for observation; (ii) expressing all x s and y s as deviations from their populations means (and, later, from their means in the sample). Putting variables in mean-deviation form gets rid of the constant terms (here, γ 50 and γ 60 ) from the structural equations (which are rarely of interest), and will simplify some algebra later on. Introduction to Structural-Equation Modelling 12 Applying these simplifications to the peer-influences model: y 5 = γ 51 x 1 + γ 52 x 2 + β 56 y 6 + ε 7 y 6 = γ 63 x 3 + γ 64 x 4 + β 65 y 5 + ε 8

5 Introduction to Structural-Equation Modelling Matrix Form of the Model It is sometimes helpful (e.g., for generality) to cast a structural-equation model in matrix form. To illustrate, I ll begin by rewriting the Duncan, Haller and Portes model, shifting all observed variables (i.e., with the exception of the errors) to the left-hand side of the model, and showing all variables explicitly; variables missing from an equation therefore get 0 coefficients, while the response variable in each equation is shown with a coefficient of 1: 1y 5 β 56 y 6 γ 51 x 1 γ 52 x 2 +0x 3 +0x 4 = ε 7 β 65 y 5 +1y 6 +0x 1 +0x 2 γ 63 x 3 γ 64 x 4 = ε 8 Introduction to Structural-Equation Modelling 14 Collecting the endogenous variables, exogenous variables, errors, and coefficients into vectors and matrices, we can write x 1 1 β56 y5 γ51 γ x 2 β 65 1 y γ 63 γ 64 = ε7 ε 8 More generally, where there are q endogenous variables (and hence q errors) and m exogenous variables, the model for an individual observation is B y i + Γ (q q)(q 1) x i (q m) (m 1) = ε i (q 1) x 3 x 4 The B and Γ matrices of structural coefficients typically contain some 0 elements, and the diagonal entries of the B matrix are 1 s. Introduction to Structural-Equation Modelling 15 Wecanalsowritethemodelforalln observations in the sample: X Γ 0 Y (n q)(q q) B0 + = (n m)(m q) E (n q) I have transposed the structural-coefficient matrices B and Γ, writing each structural equation as a column (rather than as a row), so that each observation comprises a row of the matrices Y, X, and E of endogenous variables, exogenous variables, and errors. Introduction to Structural-Equation Modelling Recursive, Block-Recursive, and Nonrecursive Structural-Equation Models An important type of SEM, called a recursive model, has two defining characteristics: (a) Different error variables are independent (or, at least, uncorrelated). (b) Causation in the model is unidirectional: There are no reciprocal paths or feedback loops, as shown in Figure 2. Put another way, the B matrix for a recursive SEM is lower-triangular, while the error-covariance matrix Σ εε is diagonal. An illustrative recursive model, from Blau and Duncan s seminal monograph, The American Occupational Structure (1967), appears in Figure 3. For the Blau and Duncan model:

6 Introduction to Structural-Equation Modelling 17 Introduction to Structural-Equation Modelling 18 ε 6 reciprocal paths a feedback loop y k y k y k y k γ 31 x 1 y 3 β 53 β 43 σ 12 γ 32 γ 52 y 5 ε 8 β 54 y k x 2 γ 42 y 4 Figure 2. Reciprocal paths and feedback loops cannot appear in a recursive model. Figure 3. Blau and Duncan s basic stratification model: x 1, father s education; x 2, father s occupational status; y 3, respondent s (son s) education; y 4, respondent s first-job status; y 5, respondent s present (1962) occupational status. ε 7 Introduction to Structural-Equation Modelling 19 B = β β 53 β 54 1 Σ εε = σ σ σ 2 8 Sometimes the requirements for unidirectional causation and independent errors are met by subsets ( blocks ) of endogenous variables and their associated errors rather than by the individual variables. Such a model is called block recursive. An illustrative block-recursive model for the Duncan, Haller, and Portes peer-influences data is shown in Figure 4. Introduction to Structural-Equation Modelling 20 x 1 x 2 x 3 x 4 y 5 y 6 block 1 ε 9 ε 10 y 7 y 8 block 2 Figure 4. An extended, block-recursive model for Duncan, Haller, and Portes s peer-influences data: x 1, respondent s IQ; x 2, respondent s family SES; x 3, best friend s family SES; x 4, best friend s IQ; y 5, respondent s occupational aspiration; y 6, best friend s occupational aspiration; y 7,respondent s educational aspiration; y 8, best friend s educational aspiration. ε 11 ε 12

7 Introduction to Structural-Equation Modelling 21 Here B = = Σ εε = = 1 β β β β 78 0 β 86 β 87 1 B11 0 B 21 B 22 σ 2 9 σ 9, σ 10,9 σ σ 2 11 σ 11, σ 12,11 σ 2 12 Σ Σ 22 A model that is neither recursive nor block-recursive (such as the model for Duncan, Haller and Portes s data in Figure 1) is termed nonrecursive. Introduction to Structural-Equation Modelling Instrumental-Variables Estimation Instrumental-variables (IV) estimation is a method of deriving estimators that is useful for understanding whether estimation of a structural equation model is possible (the identification problem ) and for obtaining estimates of structural parameters when it is. Introduction to Structural-Equation Modelling Expectations, Variances and Covariances of Random Variables It is helpful first to review expectations, variances, and covariances of random variables: The expectation (or mean) of a random variable X is the average value of the variable that would be obtained over a very large number of repetitions of the process that generates the variable: For a discrete random variable X, E(X) =µ X = X all x xp(x) where x is a value of the random variable and p(x) =Pr(X = x). I m careful here (as usually I am not) to distinguish the random variable (X) from a value of the random variable (x). Introduction to Structural-Equation Modelling 24 For a continuous random variable Z X, + E(X) =µ X = xp(x)dx where now p(x) is the probability-density function of X evaluated at x. Likewise, the variance of a random variable is inthediscretecase Var(X) =σ 2 X = X µ X ) all x(x 2 p(x) and in the continuous case Var(X) =σ 2 X = In both cases Z + (x µ X ) 2 p(x)dx Var(X) =E (X µ X ) 2

8 Introduction to Structural-Equation Modelling 25 The standard deviation of a random variable is the square-root of the variance: q SD(X) =σ X =+ The covariance of two random variables X and Y is a measure of their linear relationship; again, a single formula suffices for the discrete and continuous case: Cov(X, Y )=σ XY = E [(X µ X )(Y µ Y )] The correlation of two random variables is defined in terms of their covariance: Cor(X, Y )=ρ XY = σ XY σ X σ Y σ 2 X Introduction to Structural-Equation Modelling Simple Regression To understand the IV approach to estimation, consider first the following routetotheordinary-least-squares (OLS) estimator of the simpleregression model, y = βx + ε where the variables x and y are in mean-deviation form, eliminating the regression constant from the model; that is, E(y) =E(x) =0. By the usual assumptions of this model, E(ε) =0;Var(ε) =σ 2 ε ;and x, ε are independent. Now multiply both sides of the model by x and take expectations: xy = βx 2 + xε E(xy) =βe(x 2 )+E(xε) Cov(x, y) =βvar(x)+cov(xε) σ xy = βσ 2 x +0 where Cov(xε) =0because x and ε are independent. Introduction to Structural-Equation Modelling 27 Solving for the regression coefficient β, β = σ xy σ 2 x Of course, we don t know the population covariance of x and y, nor do we know the population variance of x, but we can estimate both of these parameters consistently: P s 2 (xi x) 2 x = P n 1 (xi x)(y i y) s xy = n 1 Intheseformulas,thevariablesareexpressedinraw-scoreform,and so I show the subtraction of the sample means explicitly. A consistent estimator of β is then b = s P xy (xi x)(y i y) = P (xi x) 2 which we recognize as the OLS estimator. s 2 x Introduction to Structural-Equation Modelling 28 Imagine, alternatively, that x and ε are not independent, but that ε is independent of some other variable z. Suppose further that z and x arecorrelated thatis,cov(x, z) 6= 0. Then, proceeding as before, but multiplying through by z rather than by x (with all variable expressed as deviations from their expectations): zy = βzx + zε E(zy) =βe(zx)+e(zε) Cov(z,y) =βcov(z,x)+cov(zε) σ zy = βσ zx +0 β = σ zy σ zx where Cov(zε) =0because z and ε are independent. Substituting sample for population covariances gives the instrumental variables estimator of β: b IV = s P zy (zi z)(y i y) = P s zx (zi z)(x i x)

9 Introduction to Structural-Equation Modelling 29 The variable z is called an instrumental variable (or, simply, an instrument). b IV is a consistent estimator of the population slope β, because the sample covariances s zy and s zx are consistent estimators of the corresponding population covariances σ zy and σ zx. Introduction to Structural-Equation Modelling Multiple Regression The generalization to multiple-regression models is straightforward. For example, for a model with two explanatory variables, y = β 1 x 1 + β 2 x 2 + ε (with x 1, x 2,andy all expressed as deviations from their expectations). If we can assume that the error ε is independent of x 1 and x 2, then we can derive the population analog of estimating equations by multiplying through by the two explanatory variables in turn, obtaining E(x 1 y)=β 1 E(x 2 1)+β 2 E(x 1 x 2 )+E(x 1 ε) E(x 2 y)=β 1 E(x 1 x 2 )+β 2 E(x 2 2)+E(x 2 ε) σ x1 y = β 1 σ 2 x 1 + β 2 σ x1 x 2 +0 σ x2 y = β 1 σ x1 x 2 + β 2 σ 2 x 2 +0 Introduction to Structural-Equation Modelling 31 Substituting sample for population variances and covariances produces the OLS estimating equations: s x1 y = b 1 s 2 x 1 + b 2 s x1 x 2 s x2 y = b 1 s x1 x 2 + b 2 s 2 x 2 Alternatively, if we cannot assume that ε is independent of the x s, but can assume that ε is independent of two other variables, z 1 and z 2, then E(z 1 y)=β 1 E(z 1 x 1 )+β 2 E(z 1 x 2 )+E(z 1 ε) E(z 2 y)=β 1 E(z 2 x 1 )+β 2 E(z 2 x 2 )+E(z 2 ε) σ z1 y = β 1 σ z1 x 1 + β 2 σ z1 x 2 +0 σ z2 y = β 1 σ z2 x 1 + β 2 σ z2 x 2 +0 Introduction to Structural-Equation Modelling 32 the IV estimating equations are obtained by the now familiar step of substituting consistent sample estimators for the population covariances: s z1 y = b 1 s z1 x 1 + b 2 s z1 x 2 s z2 y = b 1 s z2 x 1 + b 2 s z2 x 2 For the IV estimating equations to have a unique solution, it s necessary that there not be an analog of perfect collinearity. For example, neither x 1 nor x 2 canbeuncorrelatedwithbothz 1 and z 2. Good instrumental variables, while remaining uncorrelated with the error, should be as correlated as possible with the explanatory variables. In this context, good means yielding relatively small coefficient standard errors (i.e., producing efficient estimates).

10 Introduction to Structural-Equation Modelling 33 OLS is a special case of IV estimation, where the instruments and the explanatory variables are one and the same. When the explanatory variables are uncorrelated with the error, the explanatory variables are their own best instruments, since they are perfectly correlated with themselves. Indeed, the Gauss-Markov theorem insures that when it is applicable, the OLS estimator is the best (i.e., minimum variance or most efficient) linear unbiased estimator (BLUE). Introduction to Structural-Equation Modelling Instrumental-Variables Estimation in Matrix Form Our object is to estimate the model = X y (n 1) β (n k+1)(k+1 1) + ε (n 1) where ε N n (0,σ 2 εi n ). Of course, if X and ε are independent, then we can use the OLS estimator b OLS =(X 0 X) 1 X 0 y with estimated covariance matrix bv (b OLS )=s 2 OLS(X 0 X) 1 where for s 2 OLS = e0 OLS e OLS n k 1 e OLS = y Xb OLS Introduction to Structural-Equation Modelling 35 Suppose, however, that we cannot assume that X and ε are independent, but that we have observations on k +1instrumental variables,, that are independent of ε. Z (n k+1) For greater generality, I have not put the variables in mean-deviation form, and so the model includes a constant; the matrices X and Z therefore each include an initial column of ones. A development that parallels the previous scalar treatment leads to the IV estimator b IV =(Z 0 X) 1 Z 0 y with estimated covariance matrix bv (b IV )=s 2 IV(Z 0 X) 1 Z 0 Z(X 0 Z) 1 where for s 2 IV = e0 IV e IV n k 1 e IV = y Xb IV Introduction to Structural-Equation Modelling 36 Since the results for IV estimation are asymptotic, we could also estimate the error variance with n rather than n k 1 in the denominator, but dividing by degrees of freedom produces a larger variance estimate and hence is conservative. For b IV to be unique Z 0 X must be nonsingular (just as X 0 X must be nonsingular for the OLS estimator).

11 Introduction to Structural-Equation Modelling The Identification Problem If a parameter in a structural-equation model can be estimated then the parameter is said to be identified; otherwise, it is underidentified (or unidentified). If all of the parameters in a structural equation are identified, then so is the equation. If all of the equations in an SEM are identified, then so is the model. Structural equations and models that are not identified are also termed underidentified. If only one estimate of a parameter is available, then the parameter is just-identified or exactly identified. If more than one estimate is available, then the parameter is overidentified. Introduction to Structural-Equation Modelling 38 The same terminology extends to structural equations and to models: An identified structural equation or SEM with one or more overidentified parameters is itself overidentified. Establishing whether an SEM is identified is called the identification problem. Identification is usually established one structural equation at a time. Introduction to Structural-Equation Modelling Identification of Nonrecursive Models: The Order Condition Using instrumental variables, we can derive a necessary (but, as it turns out, not sufficient) condition for identification of nonrecursive models called the order condition. Because the order condition is not sufficient to establish identification, it is possible (though rarely the case) that a model can meet the order condition but not be identified. There is a necessary and sufficient condition for identification called the rank condition, which I will not develop here. The rank condition is described in the reading. Introduction to Structural-Equation Modelling 40 The terms order condition and rank condition derive from the order (number of rows and columns) and rank (number of linearly independent rows and columns) of a matrix that can be formulated during the process of identifying a structural equation. We will not pursue this approach. Both the order and rank conditions apply to nonrecursive models without restrictions on disturbance covariances. Such restrictions can sometimes serve to identify a model that would not otherwise be identified. More general approaches are required to establish the identification of models with disturbance-covariance restrictions. Again, these are takenupinthereading. We will, however, use the IV approach to consider the identification of two classes of models with restrictions on disturbance covariances: recursive and block-recursive models.

12 Introduction to Structural-Equation Modelling 41 The order condition is best developed from an example. Recall the Duncan, Haller, and Portes peer-influences model, reproduced in Figure 5. Let us focus on the first of the two structural equations of the model, y 5 = γ 51 x 1 + γ 52 x 2 + β 56 y 6 + ε 7 where all variables are expressed as deviations from their expectations. There are three structural parameters to estimate in this equation, γ 51, γ 52, and β 56. It would be inappropriate to perform OLS regression of y 5 on x 1, x 2,andy 6 to estimate this equation, because we cannot reasonably assume that the endogenous explanatory variable y 6 is uncorrelated with the error ε 7. ε 7 maybecorrelatedwithε 8, which is one of the components of y 6. ε 7 is a component of y 5 which is a cause (as well as an effect) of y 6. Introduction to Structural-Equation Modelling 42 σ 14 x 1 x 2 x 3 x 4 γ 51 γ 52 γ 63 γ 64 y 5 β 56 β 65 y 6 ε 7 ε 8 σ 78 Figure 5. Duncan, Haller, and Portes nonrecursive peer-influences model (repeated). Introduction to Structural-Equation Modelling 43 This conclusion is more general: We cannot assume that endogenous explanatory variables are uncorrelated with the error of a structural equation. As we will see, however, we will be able to make this assumption in recursive models. Nevertheless, we can use the four exogenous variables x 1, x 2, x 3,and x 4, as instrumental variables to obtaining estimating equations for the structural equation: For example, multiplying through the structural equation by x 1 and taking expectations produces x 1 y 5 = γ 51 x γ 52 x 1 x 2 + β 56 x 1 y 6 + x 1 ε 7 E(x 1 y 5 )=γ 51 E(x 2 1)+γ 52 E(x 1 x 2 )+β 56 E(x 1 y 6 )+E(x 1 ε 7 ) σ 15 = γ 51 σ γ 52 σ 12 + β 56 σ since σ 17 = E(x 1 ε 7 )=0. Introduction to Structural-Equation Modelling 44 Applying all four exogenous variables, IV Equation x 1 σ 15 = γ 51 σ γ 52 σ 12 + β 56 σ 16 x 2 σ 25 = γ 51 σ 12 + γ 52 σ β 56 σ 26 x 3 σ 35 = γ 51 σ 13 + γ 52 σ 23 + β 56 σ 36 x 4 σ 45 = γ 51 σ 14 + γ 52 σ 24 + β 56 σ 46 If the model is correct, then all of these equations, involving population variances, covariances, and structural parameters, hold simultaneously and exactly. If we had access to the population variances and covariances, then, we could solve for the structural coefficients γ 51, γ 52,andβ 56 even though there are four equations and only three parameters. Since the four equations hold simultaneously, we could obtain the solution by eliminating any one and solving the remaining three.

13 Introduction to Structural-Equation Modelling 45 Translating from population to sample produces four IV estimating equations for the three structural parameters: s 15 = bγ 51 s bγ 52 s 12 + b β 56 s 16 s 25 = bγ 51 s 12 + bγ 52 s b β 56 s 26 s 35 = bγ 51 s 13 + bγ 52 s 23 + b β 56 s 36 s 45 = bγ 51 s 14 + bγ 52 s 24 + b β 56 s 46 The s 2 j s and s jj0 s are sample variances and covariances that can be calculated directly from sample data, while bγ 51, bγ 52,andβ b 56 are estimates of the structural parameters, for which we want to solve the estimating equations. There is a problem, however: The four estimating equations in the three unknown parameter estimates will not hold precisely: Because of sampling variation, there will be no set of estimates that simultaneously satisfies the four estimating equations. Introduction to Structural-Equation Modelling 46 That is, the four estimating equations in three unknown parameters are overdetermined. Under these circumstances, the three parameters and the structural equation are said to be overidentified. It is important to appreciate the nature of the problem here: We have too much rather than too little information. We could simply throw away one of the four estimating equations and solve the remaining three for consistent estimates of the structural parameters. The estimates that we would obtain would depend, however, on which estimating equation was discarded. Moreover, throwing away an estimating equation, while yielding consistent estimates, discards information that could be used to improve the efficiency of estimation. Introduction to Structural-Equation Modelling 47 To illuminate the nature of overidentification, consider the following, even simpler, example: We want to estimate the structural equation y 5 = γ 51 x 1 + β 54 y 4 + ε 6 and have available as instruments the exogenous variables x 1, x 2,and x 3. Then, in the population, the following three equations hold simultaneously: IV Equation x 1 σ 15 = γ 51 σ β 54 σ 14 x 2 σ 25 = γ 51 σ 12 + β 54 σ 24 x 3 σ 35 = γ 51 σ 13 + β 54 σ 34 These linear equations in the parameters γ 51 and β 54 are illustrated in Figure 6 (a), which is constructed assuming particular values for the population variances and covariances in the equations. Introduction to Structural-Equation Modelling 48 The important aspect of this illustration is that the three equations intersect at a single point, determining the structural parameters, which are the solution to the equations. The three estimating equations are s 15 = bγ 51 s β b 54 s 14 s 25 = bγ 51 s 12 + β b 54 s 24 s 35 = bγ 51 s 13 + β b 54 s 34 As illustrated in Figure 6 (b), because the sample variances and covariances are not exactly equal to the corresponding population values, the estimating equations do not in general intersect at a common point, and therefore have no solution. Discarding an estimating equation, however, produces a solution, since each pair of lines intersects at a point.

14 Introduction to Structural-Equation Modelling 49 possible values of β 54 β 54 (a) possible values ^γ γ of γ 51 ^ β 54 (b) 1 2 Introduction to Structural-Equation Modelling 50 Let us return to the Duncan, Haller, and Portes model, and add a path from x 3 to y 5, so that the first structural equation becomes y 5 = γ 51 x 1 + γ 52 x 2 + γ 53 x 3 + β 56 y 6 + ε 7 There are now four parameters to estimate (γ 51, γ 52, γ 53,andβ 56 ), and four IVs (x 1, x 2, x 3,andx 4 ), which produces four estimating equations. With as many estimating equations as unknown structural parameters, there is only one way of estimating the parameters, which are therefore just identified. We can think of this situation as a kind of balance sheet with IVs as credits and structural parameters as debits. Figure 6. Population equations (a) and corresponding estimating equations (b) for an overidentified structural equation with two parameters and three estimating equations. The population equations have a solution for the parameters, but the estimating equations do not. Introduction to Structural-Equation Modelling 51 For a just-identified structural equation, the numbers of credits and debits are the same: Credits Debits IVs parameters x 1 γ 51 x 2 γ 52 x 3 γ 53 x 4 β Introduction to Structural-Equation Modelling 52 In the original specification of the Duncan, Haller, and Portes model, there were only three parameters in the first structural equation, producing a surplus of IVs, and an overidentified structural equation: Credits Debits IVs parameters x 1 γ 51 x 2 γ 52 x 3 β 56 x 4 4 3

15 Introduction to Structural-Equation Modelling 53 Now let us add still another path to the model, from x 4 to y 5, so that the first structural equation becomes y 5 = γ 51 x 1 + γ 52 x 2 + γ 53 x 3 + γ 54 x 4 + β 56 y 6 + ε 7 Now there are fewer IVs available than parameters to estimate in the structural equation, and so the equation is underidentified: Credits Debits IVs parameters x 1 γ 51 x 2 γ 52 x 3 γ 53 x 4 γ 54 β That is, we have only four estimating equations for five unknown parameters, producing an underdetermined system of estimating equations. Introduction to Structural-Equation Modelling 54 From these examples, we can abstract the order condition for identification of a structural equation: For the structural equation to be identified, we need at least as many exogenous variables (instrumental variables) as there are parameters to estimate in the equation. Since structural equation models have more than one endogenous variable, the order condition implies that some potential explanatory variables must be excluded apriori from each structural equation of the model for the model to be identified. Put another way, for each endogenous explanatory variable in a structural equation, at least one exogenous variable must be excluded from the equation. Suppose that there are m exogenous variable in the model: A structural equation with fewer than m structural parameters is overidentified. A structural equation with exactly m structural parameters is justidentified. Introduction to Structural-Equation Modelling 55 A structural equation with more than m structural parameters is underidentified, and cannot be estimated. Introduction to Structural-Equation Modelling Identification of Recursive and Block-Recursive Models ThepoolofIVsforestimatingastructuralequationinarecursive model includes not only the exogenous variables but prior endogenous variables as well. Because the explanatory variables in a structural equation are drawn from among the exogenous and prior endogenous variables, there will always be at least as many IVs as there are explanatory variables (i.e., structural parameters to estimate). Consequently, structural equations in a recursive model are necessarily identified. To understand this result, consider the Blau and Duncan basicstratification model, reproduced in Figure 7.

16 Introduction to Structural-Equation Modelling 57 ε 6 γ 31 ε x 1 y 8 3 β 53 β 43 σ 12 γ 32 γ 52 y 5 Introduction to Structural-Equation Modelling 58 The first structural equation of the model is y 3 = γ 31 x 1 + γ 32 x 2 + ε 6 with balance sheet Credits Debits IVs parameters x 1 γ 31 x 2 γ x 2 γ 42 y 4 β 54 Because there are equal numbers of IVs and structural parameters, the first structural equation is just-identified. ε 7 Blau and Duncan s recursive basic-stratification model (re- Figure 7. peated). Introduction to Structural-Equation Modelling 59 More generally, the first structural equation in a recursive model can have only exogenous explanatory variables (or it wouldn t be the first equation). If all the exogenous variables appear as explanatory variables (as in the Blau and Duncan model), then the first structural equation is just-identified. If any exogenous variables are excluded as explanatory variables from the first structural equation, then the equation is overidentified. The second structural equation in the Blau and Duncan model is y 4 = γ 42 x 2 + β 43 y 3 + ε 7 As before, the exogenous variable x 1 and x 2 can serve as IVs. The prior endogenous variable y 3 can also serve as an IV, because (according to the first structural equation), y 3 is a linear combination of variables (x 1, x 2,andε 6 ) that are all uncorrelated with the error ε 7 (x 1 and x 2 because they are exogenous, ε 6 because it is another error variable). Introduction to Structural-Equation Modelling 60 The balance sheet is therefore Credits Debits IVs parameters x 1 γ 42 x 2 β 43 y Because there is a surplus of IVs, the second structural equation is overidentified. More generally, the second structural equation in a recursive model can have only the exogenous variables and the first (i.e., prior) endogenous variable as explanatory variables.

17 Introduction to Structural-Equation Modelling 61 All of these predetermined variables are also eligible to serve as IVs. If all of the predetermined variables appear as explanatory variables, then the second structural equation is just-identified; if any are excluded, the equation is overidentified. The situation with respect to the third structural equation is similar: y 5 = γ 52 x 2 + β 53 y 3 + β 54 y 4 + ε 8 Here, the eligible instrumental variables include (as always) the exogenous variables (x 1, x 2 ) and the two prior endogenous variables: y 3 because it is a linear combination of exogenous variables (x 1 and x 2 ) and an error variable (ε 6 ), all of which are uncorrelated with the error from the third equation, e 8. y 4 because it is a linear combination of variables (x 2, y 3,andε 7 as specified in the second structural equation), which are also all uncorrelated with ε 8. Introduction to Structural-Equation Modelling 62 The balance sheet for the third structural equation indicates that the equation is overidentified: Credits Debits IVs parameters x 1 γ 52 x 2 β 53 y 3 β 54 y Introduction to Structural-Equation Modelling 63 More generally: All prior variables, including exogenous and prior endogenous variables, are eligible as IVs for estimating a structural equation in a recursive model. If all of these prior variables also appear as explanatory variables in the structural equation, then the equation is just-identified. If, alternatively, one or more prior variables are excluded, then the equation is overidentified. A structural equation in a recursive model cannot be underidentified. Introduction to Structural-Equation Modelling 64 A slight complication: There may only be a partial ordering of the endogenous variables. Consider, for example, the model in Figure 8. This is a version of Blau and Duncan s model in which the path from y 3 to y 4 has been removed. As a consequence, y 3 is no longer prior to y 4 in the model indeed, the two variables are unordered. Because the errors associated with these endogenous variables, ε 6 and ε 7, are uncorrelated with each other, however, y 3 is still available for use as an IV in estimating the equation for y 4. Moreover, now y 4 is also available for use as an IV in estimating the equation for y 3, so the situation with respect to identification has, if anything, improved.

18 Introduction to Structural-Equation Modelling 65 ε 6 γ 31 ε x 1 y 8 3 β 53 σ 12 γ 32 γ 52 β 54 x 2 y γ 4 42 ε 7 y 5 Introduction to Structural-Equation Modelling 66 In a block-recursive model, all exogenous variables and endogenous variables in prior blocks are available for use as IVs in estimating the structural equations in a particular block. A structural equation in a block-recursive model may therefore be under-, just-, or overidentified, depending upon whether there are fewer, the same number as, or more IVs than parameters. For example, recall the block-recursive model for Duncan, Haller, and Portes s peer-influences data, reproduced in Figure 9. There are four IVs available to estimate the structural equations in the first block (for endogenous variables y 5 and y 6 ) the exogenous variables (x 1, x 2, x 3, and x 4 ). Because each of these structural equations has four parameters to estimate, each equation is just-identified. Figure 8. A recursive model (a modification of Blau and Duncan s model) in which there are two endogenous variables, y 3 and y 4, that are not ordered. Introduction to Structural-Equation Modelling 67 block 1 y 5 ε 9 x 1 x y 2 7 x 3 y 8 block 2 ε 11 ε 12 Introduction to Structural-Equation Modelling 68 There are six IVs available to estimate the structural equations in the second block (for endogenous variables y 7 and y 8 ) thefour exogenous variables plus the two endogenous variables (y 5 and y 6 ) from the first block. Because each structural equation in the second block has five structural parameters to estimate, each equation is overidentified. In the absence of the block-recursive restrictions on the disturbance covariances, only the exogenous variables would be available as IVs to estimate the structural equations in the second block, and these equations would consequently be underidentified. x 4 y 6 ε 10 Figure 9. Block-recursive model for Duncan, Hallter and Portes s peer-influences data (repeated).

19 Introduction to Structural-Equation Modelling Estimation of Structural-Equation Models 5.1 Estimating Nonrecursive Models There are two general and many specific approaches to estimating SEMs: (a) Single-equation or limited-information methods estimate each structural equation individually. I will describe a single-equation method called two-stage least squares (2SLS). Unlike OLS, which is also a limited-information method, 2SLS produces consistent estimates in nonrecursive SEMs. Unlike direct IV estimation, 2SLS handles overidentified structural equations in a non-arbitrary manner. Introduction to Structural-Equation Modelling 70 2SLS also has a reasonable intuitive basis and appears to perform well it is generally considered the best of the limited-information methods. (b) Systems or full-information methods estimate all of the parameters in the structural-equation model simultaneously, including error variances and covariances. I will briefly describe a method called full-information maximumlikelihood (FIML). Full information methods are asymptotically more efficient than single-equation methods, although in a model with a misspecified equation, they tend to proliferate the specification error throughout the model. FIML appears to be the best of the full-information methods. Both 2SLS and FIML are implemented in the sem package for R. Introduction to Structural-Equation Modelling Two-Stage Least Squares Underidentified structural equations cannot be estimated. Just-identified equations can be estimated by direct application of the available IVs. We have as many estimating equations as unknown parameters. For an overidentified structural equation, we have more than enough IVs. There is a surplus of estimating equations which, in general, are not satisfied by a common solution. 2SLS is a method for reducing the IVs to just the right number but by combining IVs rather than discarding some altogether. Introduction to Structural-Equation Modelling 72 Recall the first structural equation from Duncan, Haller, and Portes s peer-influences model: y 5 = γ 51 x 1 + γ 52 x 2 + β 56 y 6 + ε 7 This equation is overidentified because there are four IVs available (x 1, x 2, x 3,andx 4 ) but only three structural parameters to estimate (γ 51, γ 52, and β 56 ). An IV must be correlated with the explanatory variables but uncorrelated with the error. A good IV must be as correlated as possible with the explanatory variables, to produce estimated structural coefficients with small standard errors. 2SLS chooses IVs by examining each explanatory variable in turn: The exogenous explanatory variables x 1 and x 2 are their own best instruments because each is perfectly correlated with itself.

20 Introduction to Structural-Equation Modelling 73 To get a best IV for the endogenous explanatory variable y 6,wefirst regress this variable on all of the exogenous variables (by OLS), accordingtothereduced-form model y 6 = π 61 x 1 + π 62 x 2 + π 63 x 3 + π 64 x 4 + δ 6 producing fitted values by 6 = bπ 61 x 1 + bπ 62 x 2 + bπ 63 x 3 + bπ 64 x 4 Because by 6 is a linear combination of the x s indeed, the linear combination most highly correlated with y 6 it is (asymptotically) uncorrelated with the structural error ε 7. This is the first stage of 2SLS. Now we have just the right number of IVs: x 1, x 2,andby 6,producing three estimating equations for the three unknown structural parameters: Introduction to Structural-Equation Modelling 74 IV 2SLS Estimating Equation x 1 s 15 = bγ 51 s bγ 52 s 12 + b β 56 s 16 x 2 s 25 = bγ 51 s 12 + bγ 52 s β b 56 s 26 by 6 s 5 b6 = bγ 51s 1 b6 + bγ 52s 2 b6 + β b 56 s 6 b6 where, e.g., s 5 b6 is the sample covariance between y 5 and by 6. The generalization of 2SLS from this example is straightforward: Stage 1: Regress each of the endogenous explanatory variables in a structural equation on all of the exogenous variables in the model, obtaining fitted values. Stage 2: Use the fitted endogenous explanatory variables from stage 1 along with the exogenous explanatory variables as IVs to estimate the structural equation. If a structural equation is just-identified, then the 2SLS estimates are identical to those produced by direct application of the exogenous variablesasivs. Introduction to Structural-Equation Modelling 75 There is an alternative route to the 2SLS estimator which, in the second stage, replaces each endogenous explanatory variable in the structural equation with the fitted values from the first stage regression, and then performs an OLS regression. The second-stage OLS regression produces the same estimates as the IV approach. The name two-stage least squares originates from this alternative approach. Introduction to Structural-Equation Modelling 76 The 2SLS estimator for the jth structural equation in a nonrecursive model can be formulated in matrix form as follows: Write the jth structural equation as y j (n 1) = Y j β j (n q j )(q j 1) + X j γ j (n m j )(m j 1) =[Y j, X j ] βj γ j + ε j + ε j (n 1) where y j is the response-variable vector in structural equation j Y j is the matrix of q j endogenous explanatory variables in equation j β j is the vector of structural parameters for the endogenous explanatory variables X j is the matrix of m j exogenous explanatory variables in equation j, normally including a column of 1 s γ j ε j is the vector of structural parameters for the exogenous explanatory variables is the error vector for structural equation j

21 Introduction to Structural-Equation Modelling 77 In the first stage of 2SLS, the endogenous explanatory variables are regressed on all m exogenous variables in the model, obtaining the OLS estimates of the reduced-form regression coefficients P j =(X 0 X) 1 X 0 Y j and fitted values by j = XP j = X(X 0 X) 1 X 0 Y j In the second stage of 2SLS, we apply X j and Y b j as instruments to the structural equation to obtain (after quite a bit of manipulation) bβj Y 0 = j X(X 0 X) 1 X 0 Y j Y 0 1 jx j Y 0 j X(X 0 X) 1 X 0 y j bγ j X 0 jy j X 0 jx j X 0 jy j Introduction to Structural-Equation Modelling 78 The estimated variance-covariance matrix of the 2SLS estimates is bβj bv = s 2 Y 0 j X(X 0 X) 1 X 0 Y j Yj 0X 1 j e bγ j j X 0 j Y j X 0 j X j where s 2 e 0 j e j = e j n q j m j e j = y j Y jβj b X j bγ j Introduction to Structural-Equation Modelling Full-Information Maximum Likelihood Along with the other standard assumptions of SEMs, FIML estimates are calculated under the assumption that the structural errors are multivariately normally distributed. Under this assumption, the log-likelihood for the model is log e L(B, Γ, Σ εε )=nlog e det(b) nq 2 log e 2π n 2 log e det(σ εε ) 1 2 nx i=1 where det represents the determinant. (By i +Γx i ) 0 Σ 1 εε (By i +Γx i ) Introduction to Structural-Equation Modelling 80 The FIML estimates are the values of the parameters that maximize the likelihood under the constraints placed on the model for example, that certain entries of B, Γ, and (possibly) Σ εε are 0. Estimated variances and covariances for the parameters are obtained from the inverse of the information matrix the negative of the Hessian matrix of second-order partial derivatives of the log-likelihood evaluated at the parameter estimates. The full general machinery of maximum-likelihood estimation is available for example, alternative nested models can be compared by a likelihood-ratio test.

22 Introduction to Structural-Equation Modelling Estimation Using the sem Package in R The tsls function in the sem package is used to estimate structural equations by 2SLS. Thefunctionworksmuchlikethelm function for fitting linear models by OLS, except that instrumental variables are specified in the instruments argument as a one-sided formula. For example, to fit thefirst equation in the Duncan, Haller, and Portes model, we would specify something like eqn.1 <- tsls(roccasp ~ RIQ + RSES + FOccAsp, instruments= ~ RIQ + RSES + FSES + FIQ, data=dhp) summary(eqn.1) This assumes that we have Duncan, Haller, and Portes s data in the data frame DHP, whichisnotthecase. The sem functionmaybeusedtofit a wide variety of models including observed-variable nonrecursive models by FIML. Introduction to Structural-Equation Modelling 82 This function takes three required arguments: ram: A coded formulation of the model, described below. S: The covariance matrix among the observed variables in the model; may be in upper- or lower-triangular form as well as the full, symmetric matrix. N: The number of observations on which the covariance matrix is based. In addition, for an observed-variable model, the argument fixed.x should be set to the names of the exogenous variables in the model. ram stands for recticular-action model, and stems from a particular approach, due originally to McArdle, to specifying and estimating SEMs. Each structural coefficient of the model is represented as a directed arrow -> in the ram argument to sem. Each error variance and covariance is represented as a bidirectional arrow, <->. Introduction to Structural-Equation Modelling 83 To write out the model in this form, it helps to redraw the path diagram, as in Figure 10 for the Duncan, Haller, and Portes model. Then the model can be encoded as follows, specifying each arrow, and giving a name to and start-value for the corresponding parameter (NA = let the program compute the start-value): model.dhp.1 <- specify.model() RIQ -> ROccAsp, gamma51, NA RSES -> ROccAsp, gamma52, NA FSES -> FOccAsp, gamma63, NA FIQ -> FOccAsp, gamma64, NA FOccAsp -> ROccAsp, beta56, NA ROccAsp -> FOccAsp, beta65, NA ROccAsp <-> ROccAsp, sigma77, NA FOccAsp <-> FOccAsp, sigma88, NA ROccAsp <-> FOccAsp, sigma78, NA Introduction to Structural-Equation Modelling 84 RIQ RSES FSES FIQ gamma51 gamma63 gamma64 beta56 sigma77 ROccAsp beta65 FOccasp sigma88 sigma78 Figure 10. Modified path diagram for the Duncan, Haller, and Portes model, omitting covariances among exogenous variables, and showing error variances and covariances as double arrows attached to the endogenous variables.

23 Introduction to Structural-Equation Modelling 85 As was common when SEMs were first introduced to sociologists, Duncan, Haller, and Porter estimated their model for standardized variables. That is, the covariance matrix among the observed variables is a correlation matrix. The arguments for using standardized variables in an SEM are no more compelling than in a regression model. In particular, it makes no sense to standardize dummy regressors, for example. Introduction to Structural-Equation Modelling 86 FIML estimates and standard errors for the Duncan, Haller, and Portes model are as follows: Parameter Estimate Standard Error γ γ β γ γ β σ σ σ The ratio of each estimate to its standard error is a Wald statistic for testing the null hypothesis that the corresponding parameter is 0, distributed asymptotically as a standard normal variable under the hypothesis. Introduction to Structural-Equation Modelling 87 Note the large (and highly statistically significant) negative estimated error covariance, corresponding to an error correlation of r 56 = = Introduction to Structural-Equation Modelling Estimation of Recursive and Block-Recursive Models Becausealloftheexplanatoryvariablesinastructuralequationofa recursive model are uncorrelated with the error, the equation can be consistently estimated by OLS. For a recursive model, the OLS, 2SLS, and FIML estimates coincide. Estimation of a block-recursive model is essentially the same as of a nonrecursive model: All variables in prior blocks are available for use as IVs in formulating 2SLS estimates. FIML estimates reflect the restrictions placed on the disturbance covariances.

An Introduction to Structural Equation Modeling with the sem Package for R 1. John Fox

Georg-August University Göttingen An Introduction to Structural Equation Modeling with the sem Package for R John Fox McMaster University, Canada June 2014 Copyright 2014 by John Fox An Introduction to