On the relationships among latent variables and residuals in PLS path modeling: The formative-reflective scheme

Size: px

Start display at page:

Download "On the relationships among latent variables and residuals in PLS path modeling: The formative-reflective scheme"

Anissa Chapman
6 years ago
Views:

Computational Statistics & Data Analysis 51 (2007) 5828 5846 wwwelseviercom/locate/csda On the relationships among latent variables and residuals in PLS path modeling: The formative-reflective scheme

Received 4 December 2005; received in revised form 26 October 2006; accepted 26 October 2006 Available online 27 November 2006 Abstract A new approach for the estimation and the validation of a

1 Computational Statistics & Data Analysis 51 (2007) wwwelseviercom/locate/csda On the relationships among latent variables and residuals in PLS path modeling: The formative-reflective scheme Giorgio Vittadini, Simona C Minotti, Marco Fattore, Pietro G Lovaglio Dipartimento di Statistica, Università degli Studi di Milano-Bicocca, Via Bicocca degli Arcimboldi 8, Milano, Italy Received 4 December 2005; received in revised form 26 October 2006; accepted 26 October 2006 Available online 27 November 2006 Abstract A new approach for the estimation and the validation of a structural equation model with a formative-reflective scheme is presented The basis of the paper is a proposal for overcoming a potential deficiency of PLS path modeling In the PLS approach the reflective scheme assumed for the endogenous latent variables (LVs) is inverted; moreover, the model errors are not explicitly taken into account for the estimation of the endogenous LVs The proposed approach utilizes all the relevant information in the formative manifest variables (MVs) providing solutions which respect the causal structure of the model The estimation procedure is based on the optimization of the redundancy criterion The new approach, entitled redundancy analysis approach to path modeling (RA-PM) is compared with both traditional PLS Path Modeling and LISREL methodology, on the basis of real and simulated data 2006 Elsevier BV All rights reserved Keywords: Latent variables; Partial least squares; PLS path modeling; Redundancy analysis; LISREL model 1 Introduction This paper proposes a new approach for the estimation and the validation of a structural equation model characterized by a formative scheme for exogenous latent variables (LVs) and a reflective scheme for endogenous LVs The proposal consists of an alternative approach to the PLS path modeling approach (PLS-PM) introduced by Lohmöller (1989) In order to examine the implications involved in the adoption of a formative-reflective scheme, let us examine an example regarding the causal links between the investment in human capital (HC), (ie educational attainment and work experience), and the ability to generate earned and property income, as indicated in Fig 1 The HC model, which is described in detail in Section 7, consists of two exogenous LVs, educational HC and work experience HC, and two endogenous LVs which can be defined as the ability to generate earned income and the ability to generate property income These two endogenous LVs represent the innate capability to transform educational attainment and work experience into the process of accumulation of earned and property income The results of the application reveal some critical points in PLS-PM First of all, the goodness of fit is quite low (GoF = 03863) and, as Goldstein Dillon s ρ shows, the formative block containing the Job indicators is not unidimensional (ρ = 01630) We expect that the extraction of further information from the formative manifest variables (MVs) will improve the model s goodness of fit Hence we suggest the explicit introduction of the model errors (estimated Corresponding author Tel: ; fax: address: simonaminotti@unimibit (SC Minotti) /$ - see front matter 2006 Elsevier BV All rights reserved doi:101016/jcsda

2 G Vittadini et al / Computational Statistics & Data Analysis 51 (2007) Educational indicators Educational human capital Ability to generate earned income Earned income indicators Job indicators Work experience human capital Ability to generate property income Property income indicators Fig 1 The structural model for human capital by means of the residuals) into PLS-PM Secondly, we observe that the redundancy of the reflective MVs accounted for by the estimated endogenous LVs (4062%) is greater than the redundancy accounted for by the formative MVs (3451%) This anomalous result depends on Lohmöller s algorithm, which estimates the endogenous LVs as linear combinations of their reflective MVs In order to resolve the problems mentioned above, we propose a new approach based on the optimization of the redundancy criterion (Stewart and Love, 1968) The new approach, entitled RA-PM (redundancy analysis approach to path modeling), utilizes all the relevant information in the formative MVs and provides solutions which respect the causal structure of the model The paper is organized as follows: in Section 2, the formative-reflective model is introduced; in Section 3, the traditional methodologies utilized for its estimation are discussed; in Section 4, the RA-PM algorithm is presented; in Section 5, an appropriate measure of goodness of fit is proposed; in Section 6, a discussion of the basic choices and properties of RA-PM is provided; in Sections 7 and 8, RA-PM is compared with traditional PLS-PM and the LISREL methodology, on the basis of real and simulated data; in Section 9, some conclusions and further research are proposed 2 The formative-reflective model The non-recursive structural equation model considered in this paper is characterized by a formative scheme for the exogenous LVs and a reflective scheme for the endogenous LVs The measurement model of the exogenous LVs imposes a formative relationship between each LV and its corresponding MVs, ie each exogenous LV summarizes its MVs ξ (j) = ω (j) X (j) + u (j), j = 1,,r, where ξ (j), j = 1,,r, is a random variable denoting the jth exogenous LV; X (j) = (X (j)1,,x (j)pj ) is a p j - dimensional vector of centred observable random variables; ω (j) = (ω (j)1,,ω (j)pj ) is a p j -dimensional vector of unknown outer weights; u (j) is a random disturbance, with expected value null and uncorrelated to the MVs X (j) It can be observed that, in the formative scheme, the blocks of MVs can be multidimensional (Tenenhaus et al, 2005) The measurement model of the endogenous LVs imposes a reflective relationship between each LV and its corresponding MVs, ie each endogenous LV is described by its MVs: Y (k) = λ (k) η (k) + ε (k), k = 1,,s, (2) where η (k), k = 1,,s, is a random variable which denotes the kth endogenous LV; Y (k) = (Y (k)1,,y (k)qk ) is a q k -dimensional vector of centred observable random variables; λ (k) = (λ (k)1,,λ (k)qk ) is a q k -dimensional vector of unknown loadings; ε (k) = (ε (k)1,,ε (k)qk ) is a q k -dimensional vector of errors of measurement, with expected value null and uncorrelated to η (k) It can be observed that, in the reflective scheme, the blocks of MVs are unidimensional in accordance with factor analysis (Tenenhaus et al, 2005) The model is completed by the structural equation which describes the relationship between the LVs η (k) = β (k) η + γ (k) ξ + ζ (k), k = 1,,s, (3)

3 5830 G Vittadini et al / Computational Statistics & Data Analysis 51 (2007) ζ X ξ η Y X (2) ξ (2) η (2) Y (2) ζ (2) Fig 2 A formative-reflective structural equation model where η is a random vector which contains the endogenous LVs; ξ is a random vector which contains the exogenous LVs; β (k) and γ (k) are vectors of s and r unknown path coefficients, respectively; ζ (k) is a random variable indicating the error in equation associated with the kth endogenous LV η (k) Errors in equations may represent the effect of unknown variables or the effect of known but omitted variables (Saris and Stronkhorst, 1984) In a compact form, model (3) can be rewritten as ξ = Ω XX + u, (4) Y = Λ Y η + ε (5) and η = B η + Γ ξ + ζ = (I B ) 1 Γ ξ + (I B ) 1 ζ (6) Given the observations of n independent drawings of the random vectors x (j) and y (k), by arranging the observations in n-dimensional column vectors, model (4) (6) becomes Ξ = XΩ X + U, (7) Y = HΛ Y + E (8) and H = HB + ΞΓ + Z = ΞΓ(I B) 1 + Z(I B) 1, (9) where Ξ = (ξ,,ξ (r) ) is a (n r) matrix of latent scores on ξ; H = (η,,η (s) ) is a (n s) matrix of latent scores on η; X = (x,,x (p) ) is a (n p) matrix of observations on X; Y = (y,,y (q) ) is a (n q) matrix of observations on Y ; U is a (n r) matrix of random disturbances; E = (ɛ,,ɛ (q) ) is a (n q) matrix of errors of measurement; Z = (ζ,,ζ (s) ) is a (n s) matrix of errors in equations In the special case of r = 2 and s = 2, model (7) (9) corresponds to the HC model, completed by the errors in equations (see Fig 2) A suitable estimation method for model (7) (9) should have the following properties: (i) The parameter identifiability (ii) The non-correlation between exogenous LVs and errors in equations (iii) The uniqueness of the latent scores (iv) The coherence of the solutions with the causal structure of the model (ie the path directions) (v) The utilization of all the relevant information in the formative MVs, which implies the modelization of the residuals

4 G Vittadini et al / Computational Statistics & Data Analysis 51 (2007) Problems in PLS path modeling and LISREL model PLS-PM (Wold, 1982) satisfies properties (i) and (iii) (given that the rank of components models is equal to the rank of the MVs, Lohmöller, 1989) but does not respect properties (ii), (iv) and (v), as will be discussed in the following section For the sake of simplicity, the discussion will regard the case of a formative-reflective model characterized by r = 1 and s = 1 In traditional PLS-PM, LVs are estimated as linear combinations of their MVs, with weights obtained by means of an iterative procedure which takes into account the inner relations of the model, as will now be briefly described (for an up-to-date review, see Tenenhaus et al, 2005) Outer estimation: The standardized outer estimates of latent score vectors ξ and η, indicated by ˆξ outer and ˆη outer, respectively, are obtained as linear combinations of their observed MVs, given the arbitrary vectors of outer weights ω 0 and υ 0 : ˆξ outer Xω 0 (10) and ˆη outer Yυ 0 (11) Inner estimation: The standardized inner estimates of ξ and η, indicated by ˆξ inner and ˆη inner, respectively, are obtained as linear combinations of standardized outer estimates of their adjacent LVs and ˆξ inner γˆη outer (12) ˆη inner β ˆξ outer, (13) where the inner weights ˆγ and ˆβ can be alternatively defined by means of the centroid scheme, the factorial scheme or the path weighting scheme Estimation modes for the outer weights: The outer weights ω 0 are updated by means of Mode B (ie a multiple regression of the inner estimate ˆξ inner on matrix X), while the weights υ 0 are updated by means of Mode A (simple regressions of the vectors y h, h = 1,,q, on the inner estimate ˆη inner ) and ˆω = (X X) 1 X ˆξ inner (14) ˆυ = (ˆη inner ˆη inner) 1 ˆη inner Y =ˆη innery (15) The procedure is iterated until the algorithm converges, producing ˆξ X(X X) 1 X ˆη (16) and ˆη YY ˆξ (17) From (16) and (17) it follows that PLS-PM does not respect properties (iv) and (v) given that: The final estimate ˆξ in (16) is proportional to the final estimate ˆη and this does not respect the causal relation implied in (3) Moreover, the final estimate ˆη in (17) is a linear combination of the columns of matrix Y and this does not respect the reflective scheme assumed in (5), although the outer weights, by means of Mode A, preserve this assumption The inner estimate ˆη inner in (13) does not explicitly take into account the error in equation, ie further components underlied by the block of formative MVs, which can be multidimensional (Tenenhaus et al, 2005)

5 5832 G Vittadini et al / Computational Statistics & Data Analysis 51 (2007) It can be observed that the LISREL model (Jöreskog, 1973, 1981, 2000) satisfies properties (ii), (iv) and (v), but the formative scheme assumed in is substituted by a reflective scheme, which implies that properties (i) and (iii) are not respected In fact, in the LISREL model instead of (7) we have X = ΞΛ X + Δ (18) and consequently the reduced form of model (7) (9) is J = ML M + T, (19) with a covariance structure equal to Σ J = L M Σ ML M + Σ T, (20) where [ ] J =[X, Y ], M =[Ξ, Z ] Λ, L M = X 0 Γ(I B) 1 Λ Y (I B) 1 and T =[F E ] Λ Y Since no necessary and sufficient conditions for identification are available, Jöreskog (1981) suggests that the identification problem be studied on a case by case basis, examining the equations, choosing the restrictions, not only in number but also in position, in order to obtain unique parameters Moreover, it has been demonstrated that even if the parameters of the LISREL model are perfectly identified, the scores of the LVs are not unique (Guttman, 1955; Schönemann, 1971; Vittadini, 1989) Given that the infinite scores of each LV in M express the same concept, it would be opportune if these scores were highly and positively correlated In practice, the correlations between the two maximally different solutions M and M, given by the diagonal elements of matrix 2D 1/2 M Σ ML M Σ 1 J L M Σ MD 1/2 M D 1/2 M Σ MD 1/2 M (21) (where D M is the diagonal matrix of Σ M ) can be weak or even negative 4 The redundancy analysis approach to path modeling In order to obtain solutions which respect properties (i) (v), we propose a new approach based on the optimization of the redundancy criterion (Stewart and Love, 1968) which is entitled RA-PM Traditional redundancy analysis (Van den Wollenberg, 1977) does not calculate errors in equations and endogenous LVs, because it does not refer to structural equation models RA-PM is an extension of redundancy analysis to a formative-reflective structural equation model with r formative blocks and s reflective blocks It consists of an iterative procedure which is characterized by the following steps: 1 Estimation of exogenous LVs and initial estimation of endogenous LVs: The initialization of the RA-PM algorithm is given by estimating each exogenous LV as the first redundancy component of the Y variables on the corresponding X block The choice of the redundancy criterion ensures that the maximum amount of information concerning Y is extracted from each X block and that the causal structure of the model is respected Next, the initial estimates of the endogenous LVs are obtained by means of a Redundancy Analysis of each single Y block on the estimates of the exogenous LVs 2 Estimation of further components and updated estimation of endogenous LVs: The variables within each X block are projected onto the orthogonal complement of the linear space spanned by all the redundancy components extracted previously Thus, an orthogonal subspace is identified for each X block These subspaces are referred to as residual X blocks A second set of components is estimated by means of a redundancy analysis of the Y variables on each residual X block Next, the endogenous LVs are re-estimated by means of a redundancy analysis of each single Y block on the linear space spanned by the first and the second set of extracted components This scheme is iterated according to a forward selection procedure This procedure considers new residual blocks, extracts the corresponding redundancy components and updates the estimates of the endogenous LVs, until the accounted redundancy added at each iteration becomes marginal

6 G Vittadini et al / Computational Statistics & Data Analysis 51 (2007) Refinement of the solutions: A backward selection procedure of the extracted components is performed in order to identify the minimal subset of components which accounts for a given Y redundancy level Note that although backward selection requires some amount of computational work, it guarantees that we obtain a more parsimonious model On the basis of the components which have been selected by backward selection, the final estimates of the endogenous LVs are computed 4 Estimation of structural parameters and loadings: Finally, the structural parameters of the model are estimated, together with the errors in equations Moreover, the estimation of the loadings of each y variable on the corresponding endogenous LV is provided In the following, we present each step analytically, assuming that the MVs are centred 41 Estimation of exogenous LVs and initial estimation of endogenous LVs Estimation of exogenous LVs: Each exogenous LV ξ (j), j = 1,,r, is estimated as the first redundancy component ˆξ (j) = X (j) ˆω (j), (22) ie the unit variance linear combination of block X (j) which maximizes the redundancy index R (ξ) (j), defined as R (ξ) (j) = q k=1 cor(x (j) ω (j), y k ) 2 /q (23) In this way, ˆω (j) is computed as the eigenvector corresponding to the highest eigenvalue of matrix Σ 1 X (j) X (j) Σ X(j) YΣ YX(j), under the constraint ˆω (j) Σ X (j) X (j) ˆω (j) = 1 (24) in such a way that ˆξ (j) is standardized We have thus established the following association: X ˆξ X (r) ˆξ (r) The estimates of the exogenous LVs are arranged in matrix ˆΞ, defined as ˆΞ =[ˆξ,, ˆξ (r) ] (25) Initial estimation of endogenous LVs: For each block Y (k), k = 1,,s, each endogenous LV η (k) is estimated as the first redundancy component ˆη (in) (k) on the linear space V (Ξ) spanned by the columns of ˆΞ, ie the unit variance linear combination of the estimated exogenous LVs which maximizes the redundancy index R (ηin ) (k), defined as q k R (ηin ) (k) = i=1 cor( ˆΞρ (k), y (k)i ) 2 /q k (26) In this way, ˆρ (k) is the eigenvector corresponding to the highest eigenvalue of matrix Σ 1 ˆΞ ˆΞ Σ ˆΞY Σ (k) Y(k), under the ˆΞ constraint ˆρ (k) Σ ˆΞ ˆΞ ˆρ (k) = 1 (27) in such a way that ˆη (k) is standardized We have thus established the following association: Y ˆη (in) Y (s) ˆη (in) (s)

7 5834 G Vittadini et al / Computational Statistics & Data Analysis 51 (2007) Estimation of further components and updated estimation of endogenous LVs Estimation of further components: We define V (X) as the linear space spanned by X,,X (r), V as the linear space spanned by ˆξ,, ˆξ (r), P V and Q V =I P V as the orthogonal projectors onto V and V, respectively (here the orthogonal complement is defined in V (X) ) For j = 1,,r, the residual blocks Q V X (j) are taken into consideration The second component of block X (j) is estimated as the first redundancy component, ie the unit variance linear combination ˆφ (j) = Q V X (j) ˆψ (j), j = 1,,r (28) which maximizes the redundancy index R (φ) (j), defined as q R (φ) (j) = cor(q V X (j) ψ (j), y k ) 2 /q (29) k=1 In this way, ˆψ (j) is the eigenvector corresponding to the highest eigenvalue of matrix Σ 1 Q V X (j) Q V X (j) Σ QV X (j) YΣ YQV X (j) under the constraint ˆψ (j) Σ Q V X (j) Q V X (j) ˆψ (j) = 1 (30) in such a way that ˆφ (j) is standardized We have thus established the following association: X ˆξ ˆφ X (r) ˆξ (r) ˆφ (r) It can be observed that the estimates ˆφ,, ˆφ (r) of the second set of components are not correlated to any of the estimates ˆξ,, ˆξ (r) of the exogenous LVs (ie the first set of components), being linear combinations of vectors in V In this way, we ensure that the new components share no part of the already extracted redundancy Next, we arrange the estimates ˆφ,, ˆφ (r) in matrix ˆΦ, defined as ˆΦ =[ˆφ,, ˆφ (r) ] (31) and introduce matrix ˆΠ, where the first r columns are those of matrix ˆΞ while the remaining columns are those of matrix ˆΦ ˆΠ =[ˆΞ, ˆΦ ] (32) Updated estimation of endogenous LVs: For each block Y (k), k = 1,,s, the estimate ˆη (k) is updated by computing the first redundancy component ˆη (upd) (k) on the linear space V spanned by the columns of ˆΠ, ie the unit variance linear combination of the extracted components, which maximizes the redundancy index R (ηupd ) (k), defined as q k cor( ˆΠ θ (k), y (k)i ) 2 /q k (33) R (ηupd ) (k) = i=1 In this way, ˆθ (k) is the eigenvector corresponding to the highest eigenvalue of matrix Σ 1 ˆΠ ˆΠ Σ ˆΠ Σ, under Y (k) Y(k) ˆΠ the constraint ˆθ (k) Σ ˆθ ˆΠ ˆΠ (k) = 1 (34) in such a way that ˆη (k) is standardized

8 We have thus established the association: Y ˆη (upd) Y (s) ˆη (upd) (s) G Vittadini et al / Computational Statistics & Data Analysis 51 (2007) The above scheme can be iterated in order to provide the estimation of more sets of components ˆΦ (2),, ˆΦ (p) and consequently, to update the estimates ˆη (upd),,ˆη (upd) (s) of the endogenous LVs In order to establish the number of components to be extracted and a suitable stopping rule, we adopt a forward selection procedure This means that the newly extracted components will be utilized for updating the estimates of the endogenous LVs in the following two situations: first, if these new components add a relevant marginal proportion of Y redundancy, or second, if the Y block redundancy accounted for by the updated ˆη (upd) variables increases in a significant way (see Section 5 for a discussion of the indicators to be used to perform these evaluations) 43 Refinement of the solutions Backward selection of the extracted components: In order to achieve the most parsimonious model, we apply a backward selection procedure on the set of the extracted components: for each block X (j) we eliminate the components ˆφ (j) whose removal has minor effect on the accounted Y redundancies This makes it possible to obtain a different number of components for each block X (j) X ˆξ ˆφ ˆφ (v 1) X (r) ˆξ (r) ˆφ (r) ˆφ (v r ) (r), where v j, j = 1,,r, is the number of selected components for block X (j) The estimates of the selected components are arranged in matrix ˆΦ ˆΦ =[ˆφ,, ˆφ (r),, ˆφ(v 1), ˆφ(v r ) (r) ] (35) We then introduce matrix ˆΠ whose first r columns are those of matrix ˆΞ while the remaining columns are those of matrix ˆΦ ˆΠ =[ˆΞ, ˆΦ], (36) thus the columns of ˆΠ contain all the components which have been selected by means of backward selection Finally, we indicate by V (Π) the linear space spanned by the columns of ˆΠ Final estimation of endogenous LVs: For each block Y (k), k = 1,,s, the final estimate ˆη (k) is obtained as the first redundancy component on the linear subspace V (Π), ie the unit variance linear combination of the selected components which maximizes the redundancy index R (η) (k), defined as qk R (η) (k) = i=1 cor( ˆΠθ (k), y (k)i ) 2 /q k (37) In this way, ˆθ (k) is the eigenvector corresponding to the highest eigenvalue of matrix Σ 1 ˆΠ ˆΠ Σ ˆΠY Σ (k) Y(k), under the ˆΠ constraint ˆθ (k) Σ ˆΠ ˆΠ ˆθ (k) = 1 (38) in such a way that ˆη (k) is standardized

9 5836 G Vittadini et al / Computational Statistics & Data Analysis 51 (2007) We have thus established the final association: Y ˆη Y (s) ˆη (s) If we arrange the final estimates ˆη,,ˆη (s) in matrix Ĥ =[ˆη,,ˆη (s) ] and the corresponding eigenvectors ˆθ,,ˆθ (s) in matrix ˆΘ =[ˆθ,,ˆθ (s) ], we can express the final estimates of the endogenous LVs as Ĥ = ˆΠ ˆΘ = ˆΞ ˆΘ 1 + ˆΦ ˆΘ 2, (39) where ˆΘ 1 is the matrix obtained from ˆΘ retaining only the first r rows and ˆΘ 2 the matrix obtained from ˆΘ retaining only the rows after the rth one 44 Structural parameters and loadings Structural parameters and errors in equations: The estimate ˆγ of the structural parameter γ and the estimate ˆζ of the error in equation ζ are obtained by means of a regression of ˆη on ˆΞ ˆη = ˆΞˆγ + ˆζ, (40) where ˆγ is the first column of matrix ˆΘ 1 and ˆζ can be decomposed as ˆζ = ˆΦˆα, with ˆα as the first column of ˆΘ 2 Under the assumption that ˆη and the columns ˆξ,, ˆξ (r) of ˆΞ form a linearly independent set of regressors, the estimates ˆγ (2) and ˆζ (2) are obtained by means of a regression of ˆη (2) on ˆη and ˆΞ: ˆη (2) =ˆη ˆb (2) + ˆΞˆγ (2) + ˆζ (2), (41) where ˆζ (2) can be decomposed as ˆζ (2) = ˆΦˆα (2) For k = 3,,s, under the assumption that the columns ˆη,,ˆη (k 1) of matrix Ĥ (k 1) and the columns of ˆΞ form a linearly independent set of regressors, the estimates ˆγ (k) and ˆζ (k) are obtained by means of a regression of ˆη (k) on Ĥ (k 1) and ˆΞ ˆη (k) = Ĥ (k 1) ˆb (k) + ˆΞˆγ (k) + ˆζ (k), k = 3,,s, (42) where ˆζ (k) can be decomposed as ˆζ (k) = ˆΦˆα (k) We now define ˆβ as a zero vector and ˆβ (k), k = 2,,s, by means of the following association: ˆβ (k)i = ˆb (k)i, 1 i k 1 or ˆβ(k)i = 0, k i s, (43) where ˆb (k)i is the ith component of vector ˆb (k) If we consider vectors ˆβ (k), ˆγ (k) and ˆζ (k), k = 1,,s,asthekth columns of matrices ˆB, ˆΓ and Ẑ, respectively, we can express the final estimates of the endogenous LVs in the form of Eq (9) Ĥ = Ĥ ˆB + ˆΞ ˆΓ + Ẑ (44) Loadings: For each variable y (k)i, k = 1,,s, i = 1,,q k, the loadings on the corresponding variable ˆη (k) are estimated as ˆλ (k)i = (ˆη (k) ˆη (k)) 1 ˆη (k) y (k)i (45)

10 G Vittadini et al / Computational Statistics & Data Analysis 51 (2007) Model validation At this point, an important question is raised: does the introduction of the errors in equations improve the model? And if so, to what degree? In order to evaluate the effectiveness of RA-PM as compared to traditional PLS-PM, we utilize the redundancy criterion for both the exogenous and endogenous LVs An absolute measure of goodness of fit for the estimates of the exogenous LVs is the accounted redundancy associated to the linear subspace V (Ξ) with respect to Y Tr(Σ 1 Red(Y V (Ξ) ˆΞ ) = ˆΞ Σ ˆΞY Σ Y ˆΞ ) Tr(Y (46) Y) By dividing Red(Y V (Ξ) ) by its theoretical maximum, we obtain a relative measure of goodness of fit If Red REL (Y V (Ξ) ) = Tr(Σ 1 ˆΞ ˆΞ Σ ˆΞY Σ Y ˆΞ ) Tr(Σ 1 XX Σ XYΣ YX ) (47) Red(Y X) = Tr(Σ 1 XX Σ XYΣ YX ) Tr(Y Y) is the redundancy accounted for by X, from (46) and (47) we have Red(Y V (Ξ) ) = Red REL (Y V (Ξ) ) Red(Y X) (49) Note that an analogous measure of goodness of fit may be calculated for the extracted components, thus representing the accounted redundancy associated to the linear subspace V (Π) with respect to Y A measure of goodness of fit for the estimates of the endogenous LVs is the redundancy accounted for by ˆη (k) with respect to the single block Y (k) Red(Y (k) ˆη (k) ) = ˆη (k) Y (k)y(k) ˆη (k) Tr(Y (k) Y, k = 1,,s (50) (k)) 6 RA-PM properties 61 The utilization of residual information in the formative model Tenenhaus et al (2005) state that in a formative model, the block of MVs can be multidimensional In practice, multidimensionality can be verified by means of various indicators (eg Goldstein-Dillon s ρ, Cronbach s α or Principal Component Analysis) In these circumstances, in order to extract the residual information in the data, more redundancy components need to be considered In RA-PM, this is achieved by an iterative algorithm that extracts the most informative components and generates a suitable orthogonal decomposition of the linear spaces spanned by the MVs in each X block In practice, on the basis of the redundancy criterion, the algorithm starts by extracting the most informative component within each X block By means of suitable orthogonal projections, it then identifies the residual subspaces (one for each X block), ie those containing information which has not yet been extracted At each successive iteration, the most informative component within each residual subspace is extracted and new residual subspaces are obtained The algorithm stops when all the relevant residual information about Y contained in the X blocks is extracted 62 Algebraic description and optimality In order to justify the backward selection procedure and demonstrate the optimality of RA-PM algorithm, we adopt an algebraic perspective based on the orthogonal decomposition of the linear space V (X) induced by the iterative extraction of the redundancy components (48)

11 5838 G Vittadini et al / Computational Statistics & Data Analysis 51 (2007) By performing the estimation of the exogenous LVs, we identify the linear subspace V (Ξ) V (X), defined as the linear span of ˆξ,, ˆξ (r) Similarly, when a second set of components is extracted, the linear subspace V V (X) is identified as the linear span of ˆξ,, ˆξ (r), ˆφ,, ˆφ (r) The components which are extracted at different iterations are uncorrelated, thus the linear subspace W = span( ˆφ,, ˆφ (r) ) is orthogonal to V (Ξ) and the linear subspace V can ben expressed as V = V (Ξ) W (51) As new sets of redundancy components are extracted, a chain of subspaces V (Ξ) V V (l) V (X), l = max(v 1,,v r ), is generated From the above discussion it can be deduced that V (k) = V (k 1) W (k), k = 1,,l, (52) where W (k) is the linear space spanned by the redundancy components extracted during iteration k, and that V (X) can be decomposed as V (X) = V (Ξ) W W (l) (53) At this point, if we halt the extractions at iteration ˆl l according to forward selection (ie when the marginal contribution to the accounted redundancies due to W ( ˆl+1) is negligible), V (X) can be decomposed as V (X) = V ( ˆl) V ( ˆl) (54) If we then establish a target level of Y redundancy to be accounted for by the extracted components, backward selection provides the removal of the less contributing components, reducing V ( ˆl) to a subspace V (Π) of minimum dimension accounting for the target redundancy level In this sense, the procedure of estimation and selection of the components (ie the procedure of identification of the subspace V (Π) ) is optimal Based on this selection, the computation of the final estimates of the endogenous LVs fulfills the optimality criterion of redundancy analysis Hence, we can affirm that the proposed estimation procedure for the LVs is optimal 63 Features of RA-PM estimates As a consequence of the adoption of the RA-PM algorithm, the following algebraic and geometrical features hold: (i) Once the exogeneous and endogenous LVs have been estimated, the estimates ˆB and ˆΓ of the structural parameters and the estimates ˆΛ Y of the loadings are univocally determined (ii) The estimates Ẑ of the errors in equations are linear combinations of the estimates ˆΦ of the further components, which are uncorrelated with the estimates ˆΞ of the exogenous LVs; it follows that cov( ˆξ (j), ˆζ (k) ) = 0, j = 1,,r, k= 1,,s, (55) meaning that the estimates Ẑ share no information with the estimates ˆΞ (iii) The estimates ˆΞ of the exogenous LVs are univocally defined as linear combinations of the X variables, the estimates Ĥ of the endogenous LVs and the estimates Ẑ of the errors in equations are univocally defined as linear combinations of the estimates ˆΞ and ˆΦ Moreover, the rank of the set { ˆξ,, ˆξ (r), ˆζ,,ˆζ (s) } does not exceed the rank of X (iv) The solutions respect the causal structure of the model In fact, in the RA-PM algorithm the estimates ˆΞ are linear combinations of the X variables, the estimates Ẑ are linear combinations of the estimates ˆΦ, and the estimates Ĥ are linear combinations of the estimates ˆΞ and Ẑ (v) The errors in equations are taken into consideration for the estimation of the endogenous LVs, improving their redundancy accounting capability

12 G Vittadini et al / Computational Statistics & Data Analysis 51 (2007) Table 1 LVs and MVs in the HC model Exogenous LVs Educational human capital (ξ 1 ) Work experience human capital (ξ 2 ) Endogenous LVs Ability to generate earned income (η 1 ) Ability to generate property income (η 2 ) Formative MVs Educational indicators H Years of schooling (z 1 ) S Years of schooling (z 2 ) H Father s years of schooling (z 3 ) H Mother s years of schooling (z 4 ) S Father s years of schooling (z 5 ) S Mother s years of schooling (z 6 ) Household real assets (z 7 ) Household financial assets (z 8 ) Household total debt (z 9 ) Job indicators H Age of entrance in the labour market (x 1 ) S Age of entrance in the labour market (x 2 ) H Number of years of full time job (x 3 ) S Number of years of full time job (x 4 ) H Number of years of part time job (x 5 ) S Number of years of part time job (x 6 ) Reflective MVs Earned income Household net disposable earned income (wages and salaries) (y 1 ) Household pensions and net transfers (y 2 ) Household net income from self-employment (y 3 ) Property income Household income from property (y 4 ) Household financial income (y 5 ) Table 2 Percentage of Y, Y and Y (2) redundancy accounted for by X Red(Y X) Red(Y X) Red(Y (2) X) The HC model 71 The variables We will now briefly describe the MVs involved in the HC model introduced in Section 1 The data regard n = 1936 Italian families and are drawn from the survey on Italian Household Budget performed by Banca d Italia (2002) As indicated in Table 1, the formative MVs associated with the two exogenous LVs regard variables pertaining to education and work experience Similarly, the reflective MVs associated with the two endogenous LVs regard the indicators of earned and property income For each family in the survey, some of the above variables are measured for both household head (H) and spouse (S) 72 The model estimation Before estimating the exogenous and endogenous LVs involved in the HC model, we evaluated the redundancy accounting capability of X with respect to Y, Y and Y (2) The results are reported in Table 2 We then analysed the redundancy structure of the data It is known that each eigenvalue of matrix Σ PX(j) YP X(j) Y is associated with a redundancy component of X (j) with respect to Y (D Ambra and Lauro, 1982); the analysis of these eigenvalues enables to formulate some hypotheses on the optimal number of redundancy components to be extracted (see Table 3)

13 5840 G Vittadini et al / Computational Statistics & Data Analysis 51 (2007) Table 3 First five eigenvalues for X and X (2) blocks 1st 2nd 3rd 4th 5th X block X (2) block Table 4 Percentage of Y redundancy accounted for by the extracted components at different iterations Iteration Red(Y V (Ψ) ) Red REL (Y V (Ψ) ) Table 5 Percentage of Y and Y (2) accounted redundancy for by ˆη and ˆη (2) at different iterations Iteration Estimate Red(Y ˆη ) Estimate Red(Y (2) ˆη (2) ) 1 ˆη (in) 1850 ˆη (in) (2) ˆη upd 1896 ˆη upd (2) ˆη upd 1930 ˆη upd (2) 3849 Table 6 Percentage of Y, Y and Y (2) redundancy accounted for by the RA-PM estimates Red(Y V (Π) ) Red(Y ˆη ) Red(Y (2) ˆη (2) ) It is evident that it is possible to extract only one redundancy component from block X and three redundancy components from block X (2) Note that if we do not use the components beyond the first extraction, we lose a relevant amount of information, as it is confirmed in Table 4, which reports the accounted Y redundancy corresponding to different iterations of the algorithm As the number of iterations grows, the accounted redundancy increases; however, the increasing rate slows down after five iterations (Note that we have indicated by V (Ψ) the linear space spanned by the extracted components) At each iteration we computed the estimates of the endogenous LVs and the corresponding accounted Y block redundancy (see Table 5 ) The relevance of the residuals (taken into account by means of the components extracted after the first iteration) in improving the accounted redundancy is highly evident, particularly for the estimate of η (2) By performing a backward selection procedure, and removing the less contributing variables, we ultimately selected ˆξ, the first component from X, together with ˆξ (2), ˆφ (2) and ˆφ(2) (2), (ie the first three components from X (2)) Table 6 reports the redundancy accounted for by the selected components and the final estimates of the endogenous LVs (We remind that V (Π) indicates the linear space spanned by the selected components)

14 G Vittadini et al / Computational Statistics & Data Analysis 51 (2007) Table 7 Percentage of Y, Y and Y (2) redundancy accounted for by the PLS-PM estimates PLS PLS Red(Y ˆξ, ˆξ (2) ) Red(Y ˆη PLS ) Red(Y (2) ˆη PLS (2) ) Table 8 Jöreskog goodness of fit index (GFI) and minimum correlations Minimum correlation GFI ˆξ Lsr ˆξ Lsr (2) ˆζ Lsr ˆζ Lsr Table 9 RA-PM outer weights z 1 z 2 z 3 z 4 z 5 z 6 z 7 z 8 z 9 ˆξ x 1 x 2 x 3 x 4 x 5 x 6 ˆξ (2) Comparison with PLS path modeling and LISREL model The HC data set has been also analysed by means of PLS-PM The redundancy of Y, Y and Y (2) which is accounted for by the PLS-PM estimates is reported in Table 7 PLS PLS From the above, it is evident that, when only the first extraction is performed, ˆξ and ˆξ (2) account for approximately the same Y redundancy as RA-PM (2102%, see Table 4) As more components are being estimated, the accounted Y redundancy in RA-PM overcomes that accounted for by PLS-PM; this is particularly true for the selected components It should be noted that the total Y redundancy accounted for by ˆη PLS and ˆη PLS (2) can be directly computed as 4062%, while the total Y redundancy accounted for by the X variables is 3451% (see Table 2) This demonstrates that a part of the accounting capability of PLS-PM is derived from the Y variables themselves and not from the X variables Before interpreting the results, we verified the consequences of the non-uniqueness of the scores of the LVs and the errors in equations in the LISREL Model In Table 8, it can be seen that the goodness of fit of the LISREL Model is high, but the correlations between the two maximally different solutions regarding the same identified model for HC are far from 1 and also negative 74 Interpretation of model parameters The outer weights of the exogenous LV estimates (see Table 9) are in agreement with the knowledge of HC (Vittadini and Lovaglio, 2004); in fact, ˆξ is correlated to both the years of schooling and household real assets (it thus represents the educational HC), and ˆξ (2) is correlated to years of work (it thus represents the work experience HC) The structural parameters are reported in Table 10; ˆη expresses a contrast between educational HC and work experience HC, according to the Italian situation; ˆη (2) presents structural parameters of different signs, but it is positively correlated with ˆξ The loadings of each y variable are reported in Table 11 and can be seen to be in agreement with a model for HC

15 5842 G Vittadini et al / Computational Statistics & Data Analysis 51 (2007) Table 10 RA-PM structural parameters ˆη ˆη (2) ˆξ ˆξ (2) ˆη 283 Table 11 RA-PM loadings y 1 y 2 y 3 ˆη y (2)1 y (2)2 ˆη (2) RA-PM simulation study 81 The simulation experiment In order to illustrate the effectiveness of RA-PM, we have performed a simulation study based on model (7) (9), with r = 2, s = 2, p = 8, q = 8 and n = 1000 The purpose of the simulation is to show how taking into account redundancy components beyond the first extraction can improve the redundancy accounted for by the endogenous LV estimates, particularly when the covariance structure within and between the formative and the reflective MVs does not enable for unidimensional description of the variables To this aim, we have first defined two reference situations, indicated by A and K In situation A, the variables in the Y block are correlated with those in the X block alone and, similarly, the variables in the Y (2) block are only correlated with those in the X (2) block; moreover the variables in the X block are not correlated with the variables in the X (2) block On the contrary, in situation K there are cross correlations between the variables in the X and the Y blocks and there are correlations among some of the variables in the X block and some of the variables in the X (2) block Situation A is transformed into situation K through nine intermediate situations (indicated by B J) The transformation is driven by a parameter which progressively transforms Σ (A) XX and Σ(A) XY into Σ(K) XX and Σ(K) XY More specifically, we first defined two lower triangular matrices U (A) and U (K) such that matrices Σ (A) XX = U(A) U (A) and Σ (K) XX = U(K) U (K) (56) reproduced the desired covariance structure for the two reference situations Starting from U (A) and U (K) we introduced a sequence of nine intermediate lower triangular matrices defined as U (t) = (1 t) U (A) + t U (K) (t = 01, 02,,09) (57) After obtaining the sequence {U (t) }, the corresponding sequence {Σ (t) XX } was computed and nine data sets of X variables, distributed as N(0, Σ (t) XX ), were generated In this paper we report the results for only the nine intermediate situations (from B to J), given that the reference situations were used only as a starting point for the generation of the other data sets 82 Discussion of the results For each simulation case, the eigenvalues of covariance matrices Σ PX YP X Y and Σ PX(2) YP X(2) Y indicate the number of relevant redundancy components within the X variables (see Table 12) In both cases, it is evident that, moving from

16 G Vittadini et al / Computational Statistics & Data Analysis 51 (2007) Table 12 First five eigenvalues/trace (percentage) for X and X (2) blocks (situations B J) X block X (2) block 1st 2nd 3rd 4th 5th 1st 2nd 3rd 4th 5th B C D E F G H I J Table 13 Percentage of accounted redundancy (situations B J) Red(Y X) Red(Y X ) Red(Y X (2) ) Red(Y X) Red(Y (2) X) B C D E F G H I J situation B to situation J, a second component is progressively revealed From an analytical point of view, this is due to the fact that the eigenvalues of a matrix are a continuous function of the matrix elements In the simulation, the eigenvalues are a continuous function of parameter t, driving the computation of the covariance matrices from situation A, where there is only one relevant eigenvalue, to situation K, where there are two relevant eigenvalues Table 13 reports the results of a redundancy analysis performed on the variables involved in the simulation It can be seen that the redundancy accounting capability of the X variables concerning Y, Y and Y (2) decreases as we leave situation B, reaches a minimum in the middle cases, and then increases when approaching situation J; on the contrary, the Y redundancy accounting capability of X and X (2) increases progressively from situation B to J We ran RA-PM on the nine data sets, extracting a pair of components (ie a component from each X block) as the number of iterations grows Table 14 reports the Y redundancy accounted for by the pairs of extracted components at each iteration We note that the higher the proximity to situation J is reached, the less the first extraction (ie the estimation of the two exogenous LVs alone) accounts for the Y redundancy This justifies the extraction of more redundancy components In order to show the relevance of the components beyond the first extraction, Table 15 reports the Y redundancy accounted for by both the initial estimates ˆη (in) and ˆη(in) (2) and the final estimates ˆη and ˆη (2) In the last three situations the improvement is remarkable, as it is expected by considering the redundancy structure of the simulation cases 83 Comparison with PLS path modeling and LISREL model The simulation data sets have been also analysed by means of PLS-PM The redundancy of Y, Y and Y (2) which is accounted for by the PLS-PM estimates in situations B J is reported in Table 16 If in RA-PM only the first iteration is performed, the amount of Y redundancy accounted for by the estimates of the two exogenous LVs is similar in the two approaches However, if a second iteration is performed, the Y redundancy

17 5844 G Vittadini et al / Computational Statistics & Data Analysis 51 (2007) Table 14 Percentage of Y redundancy accounted for by the pairs of extracted components B C D E F G H I J Table 15 Percentage of Y and Y (2) redundancy accounted for by ˆη (in), ˆη(in) (2) and ˆη, ˆη (2) Red(Y ˆη (in) ) Red(Y ˆη ) Red(Y (2) ˆη (in) (2) ) Red(Y (2) ˆη (2) ) B C D E F G H I J Table 16 Percentage of Y, Y and Y (2) redundancy accounted for by the PLS-PM estimates PLS PLS Red(Y ˆξ, ˆξ (2) ) Red(Y ˆη PLS ) Red(Y (2) ˆη PLS (2) ) B C D E F G H I J accounted for by the RA-PM estimates becomes higher than in PLS-PM in all simulation cases Concerning the estimates of the endogenous LVs, a direct comparison between RA-PM and PLS-PM is more problematic because these two approaches estimate the endogenous LVs in extremely different ways In fact, while ˆη PLS and ˆη PLS (2) are expressed as linear combinations of the Y variables, the corresponding RA-PM estimates are expressed as linear combinations of the X variables Computing the redundancies of the Y blocks in PLS-PM, we see that ˆη PLS accounts for a redundancy comparable with that accounted for by ˆη in RA-PM However, ˆη PLS (2) reduces its accounting capability as it approaches situation J, where it accounts for 765% We conclude by noting that when the Y redundancy structure of the formative blocks is not unidimensional, the addition of more redundancy components really improves the redundancy accounted capability of the RA-PM estimates of the endogenous LVs We maintain that PLS-PM approach could also benefit from considering more

18 G Vittadini et al / Computational Statistics & Data Analysis 51 (2007) Table 17 Jöreskog goodness of fit index (GFI) and minimum correlations GFI Minimum correlation ˆξ Lsr ˆξ Lsr (2) ˆζ Lsr ˆζ Lsr (2) B C D E F G H I J components This would increase the redundancy accounting capability of the PLS-PM estimates of the endogenous LVs and also match the formative scheme of the models underlying the data, as in the examples we have discussed in this paper Finally, we have performed a LISREL model analysis on each simulation case Referring to Table 17, it can be seen that the goodness of fit is high in every situation, however the correlations between the two maximally different solutions of the same identified models are in many cases far from 1 9 Conclusions and further research The paper proposes a new approach for the estimation of a structural equation model with a formative-reflective scheme The proposal is entitled RA-PM (redundancy analysis approach to path modeling) and consists of an alternative approach to the PLS-PM approach introduced by Lohmöller (1989) The new RA-PM algorithm provides unique solutions for both the LVs and the parameters by preserving the causal structure of the model and utilizing all the relevant information in the formative MVs The solutions are obtained by means of an iterative procedure which optimizes the redundancy criterion of Stewart and Love (1968) Future developments will address with the extension of the methodology proposed to the case of categorical MVs based on the non-symmetric correspondence analysis quantification In fact, it has been demonstrated that a relationship exists between redundancy analysis and non-symmetric correspondence analysis (Lauro and D Ambra, 1984) As a consequence, it is possible to consider in the model both simple and interaction effects among categorical variables Moreover, given that redundancy analysis represents a special case of PLS-PM with two sets of variables in the formativereflective scheme (Tenenhaus et al, 2005), the iterative steps of PLS regression can be successfully introduced in the new proposed algorithm in order to overcome the problems related to missing data, multicollinearity and flat tables Acknowledgements The authors would like to express their thanks to the referees for their valuable remarks and comments References Banca d Italia, 2002 I bilanci delle famiglie italiane nell anno 2000 Supplemento al Bollettino Statistico, Banca d Italia, Roma D Ambra, L, Lauro, CN, 1982 A principal component analysis with respect to a reference subspace Statist Appl 15, Guttman, L, 1955 The determinacy of factor score matrices with implications for five other basic problems of common factor theory British J Math Statist Psych 8 (II), Jöreskog, KG, 1973 A general method for estimating a linear structural equation system In: Goldberger, AS, Duncan, OD (Eds), Structural Equation Models in the Social Sciences Seminar Press, New York, pp Jöreskog, KG, 1981 Analysis of covariance structures Scand J Statist 8, Jöreskog, KG, 2000 Latent Variable Scores and Their Uses

19 5846 G Vittadini et al / Computational Statistics & Data Analysis 51 (2007) Lauro, CN, D Ambra, L, 1984 Non-symmetrical correspondence analysis In: Diday, E et al (Eds), Data Analysis and Informatics III North- Holland, Amsterdam, pp Lohmöller, JB, 1989 Latent Variable Path Modeling with Partial Least Squares Physica-Verlag, Heidelberg Saris, WE, Stronkhorst, LH, 1984 Causal Modeling in Non Experimental Research Sociometric Research Foundation, Amsterdam Schönemann, PH, 1971 The minimum average correlation between equivalent sets of uncorrelated factors Psychometrika 35, Stewart, DK, Love, WA, 1968 A general canonical correlation index Psych Bullet 70, Tenenhaus, M, Esposito Vinzi, V, Chatelin, YM, Lauro, C, 2005 PLS path modeling Comput Statist Data Anal 48, Van den Wollenberg, AL, 1977 Redundancy Analysis: An Alternative for Canonical Correlation Analysis Psychometrika 42, Vittadini, G, 1989 Indeterminacy problem in the LISREL model Multivariate Behavioral Res 24 (4), Vittadini, G, Lovaglio, PG, 2004 The estimate of human capital from two sets of observed indicators: formative and reflective In: Proceedings of the ASA Joint Statistical Meeting, Section on Business and Economic Statistics CD-ROM, Toronto, pp Wold, H, 1982 Soft modeling: the basic design and some extensions In: Jöreskog, KG, Wold, H (Eds), Systems Under Indirect Observation: Causality, Structure, Prediction North-Holland, Amsterdam, pp 1 54

Role and treatment of categorical variables in PLS Path Models for Composite Indicators

Role and treatment of categorical variables in PLS Path Models for Composite Indicators Laura Trinchera 1,2 & Giorgio Russolillo 2! 1 Dipartimento di Studi sullo Sviluppo Economico, Università degli Studi