Old and new approaches for the analysis of categorical data in a SEM framework

Size: px

Start display at page:

Download "Old and new approaches for the analysis of categorical data in a SEM framework"

Katherine Farmer
6 years ago
Views:

1 Old and new approaches for the analysis of categorical data in a SEM framework Yves Rosseel Department of Data Analysis Belgium Myrsini Katsikatsou Department of Statistics London Scool of Economics UK Meeting of the Working Group Structural Equation Modeling 26 February 2015 Freie Universität Berlin Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 1 / 32

2 two approaches for handling categorical data in a SEM framework limited information approach only univariate and bivariate information is used mainly developed in the SEM literature perhaps the best known method: three-stage least squares (in Mplus: estimator WLSMV) new approach: pairwise likelihood estimation full information approach all information is used frequentist approach: marginal maximum likelihood estimation requires numerical integration (number of dimensions = number of latent variables) mainly developed in the IRT literature (and GLMM literature) only recently incorporated in modern SEM software Bayesian approach Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 2 / 32

3 example SEM framework: u = binary, o = ordered, y = numeric u1 y1 y2 y3 u2 u3 u4 o1 o2 o3 f2 f3 f1 o4 x1 x2 Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 3 / 32

4 full information approach three approaches: 1. marginal maximum likelihood (MML) 2. latent response approach 3. (Bayesian estimation) Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 4 / 32

5 full information approach: marginal maximum likelihood origins: IRT models (eg Bock & Lieberman, 1970) and GLMMs the marginal likelihood for the response vector y i can be written as L i (θ) = f(y i x i ; θ) = f(y i η, x i ; θ)f(η x i ; θ)dη D(η) where y i are observed endogenous variables, x i are observed exogenous covariates, and η are latent variables; D(η) is the domain of integration; θ is the parameter vector numerical integration Gauss-Hermite quadrature adaptive quadrature Laplace approximation Monte Carlo integration some clever dimension reduction techniques exist for special cases Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 5 / 32

6 available software for the marginal maximum likelihood approach commercial software: SEM software: Mplus,... IRT software: BILOG-MG, MUTLILOG, PARSCALE, TESTFACT, EQSIRT, IRTPRO, flexmirt,... non-commercial, open-source software the Stata module gllamm R packages for IRT: TAM, mirt,... (see the CRAN Task View: Psychometric Models and Methods) and lme4 R packages for SEM: OpenMx, lavaan (since , but slow) Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 6 / 32

7 full information approach: latent response approach (1) an observed variable y can often be viewed as a partial observation of a latent continuous response y ; eg ordinal variable with K = 4 response categories: t2 t1 t3 y=1 y=2 y=3 y= latent continuous response y* Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 7 / 32

8 full information approach: latent response approach (2) assumption: both latent continuous responses (y ) and latent variables (η) are multivariate normal the likelihood contribution for observation i is given by L i (θ) = f(y i x i ; θ) = N [y i µ i (θ), Σ i (θ)] dy i T (y i ) where y i are observed endogenous variables, x i are observed exogenous covariates; T (y i ) is the integration region (defined by the thresholds) the order of integration equals the number of (non-continuous) observed variables some examples in the literature exist, up to 4 variables Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 8 / 32

9 available software for the full information latent response approach commercial software: none? non-commercial, open-source software R package lavaan (since version ) estimator="fml" integration is done by the sadmvn() function in the R package mnormt no analytical gradient (for now) just for fun Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 9 / 32

10 limited information approaches 1. three stage least squares (Mplus WLSMV) 2. pairwise likelihood estimation Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 10 / 32

11 the three stage least squares estimator developed by Bengt Muthén, in a series of papers; the seminal paper is Muthén, B. (1984). A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika, 49, this approach has been the golden standard in the SEM literature for almost three decades first available in LISCOMP (Linear Structural Equations using a Comprehensive Measurement Model), distributed by SSI, follow up program: Mplus (Version 1: 1998), currently version 7.3 other authors (Jöreskog 1994; Lee, Poon, Bentler 1992) have proposed similar approaches (implemented in LISREL and EQS respectively) another great program: MECOSA (Arminger, G., Wittenberg, J., Schepers, A.) written in the GAUSS language (mid 90 s) Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 11 / 32

12 stage 1 estimating the thresholds estimating the thresholds: maximum likelihood using univariate data if no exogenous variables, this is just # generate ordered data with 4 categories Y <- sample(1:4, size = 100, replace = TRUE) prop <- table(y)/sum(table(y)) cprop <- c(0, cumsum(prop)) th <- qnorm(cprop) in the presence of exogenous covariates, this is just ordered probit regression library(mass) X1 <- rnorm(100); X2 <- rnorm(100); X3 <- rnorm(100) fit <- polr(ordered(y) X1 + X2 + X3, method = "probit") fit$zeta Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 12 / 32

13 stage 2 estimating tetrachoric, polychoric,..., correlations estimate tetrachoric/polychoric/... correlation from bivariate data: tetrachoric (binary binary) polychoric (ordered ordered) polyserial (ordered numeric) biserial (binary numeric) pearson (numeric numeric) ML estimation is available (see eg. Olsson 1979 and 1982) two-step: first estimate thresholds using univariate information only; then, keeping the thresholds fixed, estimate the correlation one-step: estimate thresholds and correlation simultaneously if exogenous covariates are involved, the correlations are based on the residual values of y (eg bivariate probit regression) Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 13 / 32

14 stage 3 estimating the SEM model third stage uses weighted least squares: F W LS = (s ˆσ) W 1 (s ˆσ) where s and ˆσ are vectors containing all relevant sample-based and modelbased statistics respectively s contains: thresholds, correlations, optionally regression slopes of exogenous covariates, optionally variances and means of continuous variables the weight matrix W is (a consistent estimator of) the asymptotic covariance matrix of the sample statistics (s) robust version: WLSMV use the diagonal of W only for estimation (DWLS) use the full matrix for inference (standard errors and test statistic) MV stands for the Satterthwaite s mean and variance corrected test statistic Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 14 / 32

15 available software for the WLSMV estimator commercial software: golden standard: Mplus (since 1998) LISREL and EQS have similar capabilities (but less general) MECOSA (mid 90s, not available anymore) non-commercial, open-source software R package lavaan (since version 0.5) estimator="wlsmv" is the default estimator if some of the observed (endogenous) variables are categorical full implementation including delta and theta parameterization for multiple groups and/or longitudinal data Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 15 / 32

16 pairwise likelihood (PL) estimation special case of the broader framework of composite likelihood estimation key idea: the complex likelihood is broken down as a (weighted) product of component likelihoods which are easier to handle (computationally) composite ML estimators are asymptotically unbiased, consistent, and normally distributed key references: Lindsay, B. (1998). Composite likelihood methods. Contemporary Mathematics, 80, Varin, C. (2008). On composite marginal likelihoods. Advances in Statistical Analysis, 92(1), introduced in the SEM literature by Jöreskog & Moustaki (2001), De Leon (2005), Liu (2007), Xi (2011), Katsikatsou et al. (2012) computational complexity can be kept low regardless the number of observed and latent variables Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 16 / 32

17 pairwise likelihood (PL) estimation in SEM (1) in PL estimation, all model parameters are estimated in a single step for a random sample of N observations, the pairwise loglikelihood (pl) is defined as the sum of all bivariate log-likelihood functions given by: N pl (θ; y) = ln L (θ; (y in, y i n)) = n=1 i<i = c i c i n (y i = a, y i = b) ln π (y i = a, y i = b; θ), i<i where a=1 b=1 π (y i = a, y i = b; θ) = τi,a τ i,a 1 τi,b τ i,b 1 f (y i, y i ) dy i dy i. robust standard errors are based on the Godambe/sandwich information Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 17 / 32

18 pairwise likelihood (PL) estimation in SEM (2) a recent simulation study illustrates the many pleasant properties of PL: bias and MSE of PL estimators and their (sandwich type) standard errors are found to be small in all experimental conditions, and decreasing with the sample size Katsikatsou, M., Moustaki, I., Yang-Wallentin, F., & Jöreskog, K. G. (2012). Pairwise likelihood estimation for factor analysis models with ordinal data. Computational Statistics & Data Analysis, 56(12), a follow-up study illustrates how PL can be used in a large SEM setting (7 latent variables, many indicators) available in lavaan since (dec 2012) estimation and robust standard errors only no support for mixed item types no support for exogenous covariates Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 18 / 32

19 inference under PL in a SEM framework reference: inference: Katsikatsou, M., Moustaki, I. (under revision). Inference under pairwise likelihood for structural equation models with ordinal variables. Wald test Pairwise Likelihood Ratio Test (PLRT) for overall fit PL-AIC and PL-BIC PLRT for comparing nested models available in the development version of lavaan (0.5-18) Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 19 / 32

20 lavaan example simulated dataset (N = 500) 7 latent variables (2 endogenous, 5 exogenous) 45 ordinal indicators (4 response categories) structural part: timings: ξ 6 ξ 1 + ξ 2 + ξ 3 + ξ 4 + ξ 5 ξ 7 ξ 5 + ξ 6 lavaan estimator = PML : currently takes about 36 minutes (3 min estimation, 17min standard errors, 16min test statistic) lavaan estimator = WLSMV about 3 minutes Mplus estimator = ML, integration = montecarlo (700), default settings: 1h 17min, but failed with THE MODEL ESTIMATION DID NOT TERMINATE NORMALLY Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 20 / 32

21 lavaan input library(lavaan) Data <- read.csv("rx.ord") Data[,] <- lapply(data[,], ordered) simmodel <- # exogenous lv ksi1 = V1 + V2 + V3 + V4 + V5 ksi2 = V6 + V7 + V8 ksi3 = V9 + V10 + V11 + V12 + V13 ksi4 = V14 + V15 + V16 ksi5 = V17 + V18 + V19 + V20 + V21 # endogenous lv ksi6 = V22 + V23 + V24 + V25 + V26 + V27 + V28 + V29 + V30 + V31 + V32 + V33 + V34 + V35 + V36 + V37 + V38 ksi7 = V39 + V40 + V41 + V42 + V43 + V44 + V45 # structural model ksi6 ksi1 + ksi2 + ksi3 + ksi4 + ksi5 ksi7 ksi5 +ksi6 fit <- sem(model = simmodel, data = Data, estimator = "PML") Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 21 / 32

22 lavaan output (header only) lavaan ( ) converged normally after 105 iterations Number of observations 500 Estimator PML Robust Minimum Function Test Statistic Degrees of freedom P-value NA Scaling correction factor for the mean and variance adjusted correction Parameter estimates: Information Standard Errors Observed Robust.huber.white Estimate Std.err Z-value P(> z ) Latent variables: ksi1 = V V V V V ksi2 =... Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 22 / 32

23 last slide PL estimation is a (relatively) new approach for handling categorical data in a SEM framework PL can handle a large number of observed and latent variables PL has many attractive statistical properties support for the full SEM framework the R package lavaan: still catching-up in the full information ML area (but wait for 0.6!) full support for WLSMV (and friends) PL is currently only implemented in lavaan no support for inference in multiple groups yet! complete data only (PL with missing values is ongoing research) Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 23 / 32

24 Thank you! (questions?) Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 24 / 32

25 PL estimation: likelihood basic assumption: ( y i y i ) N 2 (( 0 0 ), ( 1 ρ i i 1 )) the pl for N independent observations: N pl (θ; y) = ln L (θ; (y in, y i n)) = n=1 i<i = c i c i n (y i = a, y i = b) ln π (y i = a, y i = b; θ), i<i where a=1 b=1 τi,a π (y i = a, y i = b; θ) = τ i,a 1 τi,b τ i,b 1 f (y i, y i ) dy i dy i. Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 25 / 32

26 properties of the pairwise likelihood estimator ˆθ P L = maxpl (θ; y) θ under regularity conditions upon the component likelihoods, N ( ˆθP L θ) N ( 0, G 1 (θ) ) where and G(θ) = H(θ)J 1 (θ)h(θ) H(θ) = E { 2 J(θ) = V ar } pl(θ; y) } θ θ { pl(θ; y) θ Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 26 / 32

27 PLRT for nested models (1) let θ be partitioned as θ = (ψ, ω ), where ψ : vector of parameters of interest of dimension d, ω : vector of nuisance parameters. consider the hypothesis: Pace et al. (2011), H 0 : ψ = ψ 0 vs H 1 : ψ ψ 0 P LRT (ψ 0 ) = 2 ( ( ) ( )) pl ˆθ pl θ, where ˆθ = ( ˆψ, ˆω ) and θ = (ψ 0, ω ψ 0 ) are the PL estimators under H 1 and H 0, respectively. the standard asymptotic result that, under H 0, P LRT χ 2 diff used. cannot be Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 27 / 32

28 PLRT for nested models (2) instead, we use a Satterthwaite approximation; under H 0, E (P LRT (ψ 0 )) 1 2 V ar (P LRT (ψ 0)) P LRT (ψ 0) χ 2 v [E(P LRT (ψ0))]2 1 2 V ar(p LRT (ψ0)), where v = ( E (P LRT (ψ 0 )) tr G [ ψψ H ψψ] ) 1, V ar (P LRT (ψ 0 )) 2 tr ( G ψψ [ H ψψ] 1 G ψψ [ H ψψ] 1 ), G ψψ and H ψψ are the parts of G 1 and H 1 that refer to ψ. Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 28 / 32

29 PL version of AIC and BIC AIC P L based on Varin & Vidoni (2005): ( ) AIC P L = 2 pl ˆθP L ; y + 2 tr(ĵ( ˆθ P L )Ĥ 1 ( ˆθ P L )) BIC P L based on Gao and Song (2009): ( ) BIC P L = 2 pl ˆθP L ; y + log N tr(ĵ( ˆθ P L )Ĥ 1 ( ˆθ P L )) Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 29 / 32

30 PLRT for overall fit (1) let θ be partitioned as θ = (ϕ, τ ), where τ : the vector of thresholds, ϕ : the vector of the rest SEM parameters, dimension d. recall P Corr (y ); let ρ = vech (P ), dimension p consider the hypothesis: where g : R d R p H 0 : ρ = g(ϕ) versus H 1 : ρ unconstrained under H 1, the parameter vector ϑ is partitioned as ϑ = (ρ, τ ) τ : nuisance parameter Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 30 / 32

31 PLRT for overall fit (2) let ( ( ) ( )) P LRT SEM = 2 pl ˆϑ pl ˆθ, where ˆϑ = ( ˆρ, ˆτ ) and ˆθ = ( ˆϕ, ˆτ ) are the PL estimates under H 1 and H 0, respectively under H 0, where E (P LRT SEM ) 1 2 V ar (P LRT SEM) P LRT SEM χ 2 v v = [E (P LRT SEM)] V ar (P LRT SEM) Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 31 / 32

32 PLRT for overall fit (3) mean: ( E (P LRT SEM ) tr G ρρ [H ρρ ] 1) tr (G ϕϕ [H ϕϕ ] 1) variance: V ar (P LRT SEM ) 2 tr (G ρρ [H ρρ ] 1 G ρρ [H ρρ ] 1) + 2 tr (G ϕϕ [H ϕϕ ] 1 G ϕϕ [H ϕϕ ] 1) 4 tr (M [H ρρ ] 1 MG ϕϕ [H ϕϕ ] 1 G ϕϕ) where M = ϕ g (ϕ) ϕ=ϕ0 Yves Rosseel Old and new approaches for the analysis of categorical data in a SEM framework 32 / 32

Pairwise Likelihood Estimation for factor analysis models with ordinal data

Working Paper 2011:4 Department of Statistics Pairwise Likelihood Estimation for factor analysis models with ordinal data Myrsini Katsikatsou Irini Moustaki Fan Yang-Wallentin Karl G. Jöreskog Working