Growth mixture modeling: Analysis with non-gaussian random effects

Size: px

Start display at page:

Download "Growth mixture modeling: Analysis with non-gaussian random effects"

Evelyn Myron Wiggins
6 years ago
Views:

1 CHAPTER 6 Growth mixture modeling: Analysis with non-gaussian random effects Bengt Muthén and Tihomir Asparouhov Contents 6.1 Introduction Examples Example 1: Clinical trials with placebo response Example 2: Randomized interventions with treatment effects varying across latent classes Example 3: High school dropout predicted by failing math achievement development Example 4: Age crime curves Example 5: Classification of schools based on achievement development Other applications Growth mixture modeling Specification of a simple growth model A general multilevel mixture model Estimation Model assessment Examples Analysis of Example 4: Age crime curves Analysis of Example 2: Varying intervention effects on classroom aggressive behavior Analysis of Example 5: Classification of schools based on achievement development Parametric versus non-parametric random-effect models Parametric random-effect model Non-parametric random-effect model Simulation study Conclusions Acknowledgments References Introduction This chapter gives an overview of non-gaussian random-effects modeling in the context of finite-mixture growth modeling developed in Muthén and Shedden (1999), Muthén (2001a, 2001b, 2004), and Muthén et al. (2002), and extended to cluster samples and clusterlevel mixtures in Asparouhov and Muthén (2008). Growth mixture modeling represents

2 144 GROWTH MIXTURE MODELING unobserved heterogeneity between the subjects in their development using both random effects (e.g., Laird and Ware, 1982) and finite mixtures (e.g., McLachlan and Peel, 2000). This allows different sets of parameter values for mixture components corresponding to different unobserved subgroups of individuals, capturing latent trajectory classes with different growth curve shapes. This chapter discusses examples motivating modeling with such trajectory classes. A general latent-variable modeling framework is presented together with its maximum likelihood estimation. Examples from criminology, mental health, and education are analyzed. The choice of a normal or a non-parametric distribution for the random effects is discussed and investigated using a simulation study. The discussion will refer to growth mixture modeling techniques as implemented in the Mplus program (Muthén and Muthén, ) and input scripts for the analyses are available at The outline of this chapter is as follows. Section 6.2 presents examples with substantive questions that motivate growth mixture analysis. Section 6.3 describes the general model. Section 6.4 discusses estimation and model assessment. Section 6.5 illustrates the modeling with a series of examples. Section 6.6 compares the parametric and non-parametric versions of the random-effect model. Section 6.7 concludes. 6.2 Examples The following examples show the breadth of longitudinal studies that may be approached by growth mixture modeling Example 1: Clinical trials with placebo response The first example concerns analysis of data from a double-blind 8-week randomized trial on depression medication (Leuchter et al., 2002). Of particular interest is how to assess medication effects in the presence of placebo response. Placebo response is an improvement in depression ratings that is unrelated to medication. The improvement is often seen as an early steep drop in depression, often followed by a later upswing. Figure 6.1 shows results for a two-class growth mixture model for the sample of 45 placebo group subjects using the Hamilton depression scale (Ham-D). The first two time points are before randomization and the next nine time points are after randomization. The responder class is shown in the left panel and the non-responder class in the right panel. The solid curve is the estimated mean curve, whereas the broken curves are observed individual trajectories for individuals classified as most likely belonging to this class. Placebo response confounds the estimation of the true effect of medication and is an important phenomenon, given its high prevalence of 25 60%. Because placebo response is pervasive, the statistical modeling should account for this when estimating medication effects. This can be done by acknowledging the qualitative heterogeneity in trajectory shapes for responders and non-responders using growth mixture modeling. The estimation of medication effects using growth mixture modeling is described in Muthén et al. (2007). The medication effect is estimated in line with the approach of the next example Example 2: Randomized interventions with treatment effects varying across latent classes The second example concerns a randomized preventive field trial conducted in Baltimore public schools (Dolan et al., 1993; Ialongo et al., 1999). The study applied a universal intervention aimed at reducing aggressive-disruptive behavior during first and second grade to improve reading and reduce aggression with outcomes assessed through middle school and beyond (Kellam et al., 1994). Children were followed from first to seventh grade with

3 EXAMPLES 145 HamD Baseline Week 1 Week 4 Week 8 Baseline HamD Week 1 Week 4 Week 8 Figure 6.1 Two-class growth mixture model for depression in a placebo group. respect to the course of aggressive behavior, and a follow-up to age 18 also allowed for the assessment of intervention impact on more distal events, such as the probability of juvenile delinquency as indicated by juvenile court records. The intervention was administered after one pre-intervention time point in fall of first grade. Key scientific questions addressed whether the intervention reduced the slope of the aggression trajectory across the grades, whether the intervention was different in impact for children who initially display higher levels of aggression, and whether the intervention impacted distal outcomes. Analyses of these hypotheses were presented in Muthén et al. (2002). Allowing for multiple trajectory classes in the growth model gave a flexible way to assess differential effects of the intervention. The analyses focused on boys and intervention status as defined by classroom assignment in fall of first grade, resulting in a sample of 119 boys in the intervention group and 80 boys in the control group. Figure 6.2 shows results from a four-class growth mixture model for the 119 boys. For each combination of latent-trajectory class and intervention condition, the estimated mean growth curve is shown together with observed individual trajectories for individuals estimated to be most likely a member of the class. An intervention effect in terms of reducing aggressive behavior is seen for the high class and perhaps also for the low starting ( LS ) class, whereas the other two classes show no effects Example 3: High school dropout predicted by failing math achievement development The third example concerns growth mixture modeling of mathematics achievement development in US schools. Muthén (2004) analyzed longitudinal math scores for students in grades 7 10 from the Longitudinal Study of American Youth (LSAY) and found a problematic trajectory class with an exceptionally low starting point in grade 7 as well as a low growth rate; see Figure 6.3. The class membership was strongly related to covariates such as grade 7 measures of having low schooling expectations and dropout thoughts. Taken together with the poor math development, this suggests that the class consists of students who are disengaged from school. Class membership was also highly predictive of dropping out by grade 12, a binary distal outcome. In a further analysis, Muthén (2004) carried out a growth mixture analysis where the clustering of students within schools was taken

4 146 GROWTH MIXTURE MODELING 6 High Class, Control Group 6 High Class, Intervention Group 5 5 TOCA-R 4 3 TOCA-R F 1S 2F 2S 3S 4S 5S 6S 7S Grades F 1S 2F 2S 3S 4S 5S 6S 7S Grades Medium Class, Control Group 6 Medium Class, Intervention Group 5 5 TOCA-R 4 3 TOCA-R F 1S 2F 2S 3S 4S 5S 6S 7S Grades F 1S 2F 2S 3S 4S 5S 6S 7S Grades Low Class, Control Group 6 Low Class, Intervention Group 5 5 TOCA-R 4 3 TOCA-R F 1S 2F 2S 3S 4S 5S 6S 7S Grades F 1S 2F 2S 3S 4S 5S 6S 7S Grades LS Class, Control Group 6 LS Class, Intervention Group 5 5 TOCA-R 4 3 TOCA-R F 1S 2F 2S 3S 4S 5S 6S 7S 1F 1S 2F 2S 3S 4S 5S 6S 7S Grades 1-7 Grades 1-7 Figure 6.2 Four-class growth mixture model for aggressive behavior in control and intervention groups.

EXAMPLES 147 100 Poor Development: 20% Moderate Development: 28% Good Development: 52% 100 100 Math Achievement 80 60 40 80 60 40 80 60 40 7 8 9 10 Grades 7-10 7 8 9 10 Grades 7-10 7 8 9 10 Grades

5 EXAMPLES Poor Development: 20% Moderate Development: 28% Good Development: 52% Math Achievement Grades Grades Grades 7-10 Dropout: 69% 8% 1% Figure 6.3 Three-class growth mixture model for math achievement related to high school dropout. into account by allowing random-effect variation across schools. The school variation was represented in the random effects for the growth, the random intercept in the logistic regression for dropping out, and the random intercept in the multinomial regression predicting latent-class membership as a function of student-level covariates. Furthermore, school-level covariates corresponding to poverty of the school neighborhood and teaching quality in the school were used to predict across-school variation in the random coefficients Example 4: Age crime curves The fourth example concerns criminal activity of 13,160 males born in Philadelphia, Pennsylvania in 1958 (D Unger et al., 1998; D Unger, Land, and McCall, 2002; Loughran and Nagin, 2006). Annual counts of police contacts are available from age 4 to 26 of this birth cohort. The aggregate age crime curve follows the well-known pattern of increasing annual convictions throughout the subjects teenage years and decreasing annual convictions thereafter. The criminology literature has focused extensively on identifying groups of individuals with similar patterns or careers of delinquent and criminal offending. To quote D Unger et al. (1998, p. 1595): This question of how many latent classes of criminal careers are optimal, and why the number of categories itself is important, has gained salience for criminological theory in light of recent theoretical debates. The authors go on to mention Moffit (1993) as a key contributor to the notion of different trajectory classes, proposing a distinction between the trajectory of life-course persistents versus adolescence limiteds depending on the behaviors persisting over the life course or seen only during adolescence. The debate continues, as seen in Sampson and Laub (2005) discussing the group-based analysis approach of Nagin (1999, 2005), Nagin and Land (1993), and Roeder, Lynch, and Nagin (1999). The Philadelphia crime data will be analyzed in a new way in this chapter. The analyses to be presented have two special features. First, the outcome variable is a count variable that is very skewed, with a large number of zeros at each point in time. Second, it is of interest to contrast the group-based approach with random-effects models Example 5: Classification of schools based on achievement development The fifth example extends the achievement analyses discussed in Example 3 by using a school-level latent-class variable, enabling a classification of schools as more or less successful. The LSAY data discussed in Example 3 are from a limited number of schools and

6 148 GROWTH MIXTURE MODELING analyses are instead performed on data from grades 8, 10, and 12 of the National Education Longitudinal Study (NELS). NELS surveyed 913 schools and a total of 14,217 students. In the analyses to be presented, student growth rate is regressed on the growth intercept in grade 8 and allows this relationship to vary across the school-level latent classes. It has been argued in the education literature that a weak relationship is an indicator of a school being egalitarian (e.g., Choi and Seltzer, 2006). The means of the random intercept and the intercept of the random growth rate are also allowed to vary across the school-level latent classes. Both types of school-level latent-class features are useful for determining school quality Other applications Other applications of growth mixture modeling found in the literature include Verbeke and Lesaffre (1996), see also Pearson et al. (1994), who considered different groups of males with linear or exponential growth in prostate-specific antigen (PSA); Muthén and Shedden (1999) and Muthén and Muthén (2000), with application to the development of heavy drinking and alcohol dependence; Lin et al. (2002), with application to PSA and prostate cancer, combining growth mixture modeling with survival analysis; Croudace et al. (2003), with application to bladder control; Muthén et al. (2003), with application to reading failure, including the modeling of a kindergarten process for phonemic awareness linked to a later process of word recognition; and Muthén and Masyn (2005), with application to aggressive behavior and juvenile delinquency, combining growth mixture modeling and discrete time survival analysis. Related applications to latent-class membership representing nonparticipation (non-compliance) and complier-average causal effect estimation in intervention studies (Angrist, Imbens, and Rubin, 1996) are given in Jo (2002), Jo and Muthén (2003), and are also generalizable to longitudinal studies (see Yau and Little, 2001; Dunn et al., 2003; Muthén, Jo, and Brown, 2003), including time-varying compliance (Lin, Ten Have, and Elliott, 2006). 6.3 Growth mixture modeling This section describes the general growth mixture modeling framework (see also Asparouhov and Muthén, 2008). The description is closely related to the implementation in the Mplus software version 4.2 and higher (Muthén and Muthén, ). To familiarize readers with the general Mplus modeling framework, the section starts with a simple growth example put into the conventional linear mixed-effects model as well as the Mplus modeling framework Specification of a simple growth model Consider a single growth process with no latent-trajectory classes, no clustering, linear growth for a continuous outcome y, atime-invariant covariate x and a time-varying covariate w, Y ij = η 0i + η 1i a ij + κ i w ij + ɛ ij, (6.1) where a ij are time scores (j =1, 2,...,T), the random intercept η 0i and the random slope η 1i represent the growth process, κ i is a random slope, and ɛ is a normally distributed residual. The random intercepts and slopes are expressed as η 0i = α 0 + γ 0 x i + ζ 0i, (6.2) η 1i = α 1 + γ 1 x i + ζ 1i, (6.3) κ i = α 2 + γ 2 w ij + ζ 2i, (6.4)

7 GROWTH MIXTURE MODELING 149 where the αs and γs are parameters and the ζs are normally distributed residuals. In multilevel terms, equation (6.1) represents level-1 variation across time and (6.2) (6.4) represent level-2 variation across individuals. Consider the mixed linear model formulation for Y i =(Y i1,...,y it ), Y i = X i β + Z i b i + e i, (6.5) where some individuals may not be observed at all occasions T, leading to missing data. In this example, let 1 a i1 1 a i2 Λ i = 1 a i3,.. 1 a it so that in (6.5) we have X i = ( ) Λ i w i Λ i x i w i x i, β =(α 0,α 1,α 2,γ 0,γ 1,γ 2 ), Z i = ( ) Λ i w i, b i =(ζ 0i,ζ 1i,ζ 2i ), e i =(ɛ i1,...,ɛ it ). The Mplus framework uses the general model expression for observed vectors Y i and X i, Y i = ν +Λη i + KX i + ɛ i, (6.6) η i = α + Bη i +ΓX i + ζ i, (6.7) implying Y i = ν +Λ(I B) 1 α +Λ(I B) 1 Γ X i + K X i +Λ(I B) 1 ζ i + ɛ i, where the first row refers to fixed effects and the second row to random effects. The regression parameter arrays Λ, K, B, and Γ are allowed to vary across i as a function of observed variables or they can be unobserved random slopes. The model equations (6.6) and (6.7) capture the level-1 and level-2 expressions for the linear growth example in (6.1) and (6.2) (6.4). The notation of (6.6) and (6.7) follows that of the linear growth example with B =0and with the vector X i containing both the time-varying covariate w ij and the time-invariant covariate x i. The model of (6.6) and (6.7) includes the mixed linear model of (6.5) as a special case. In latent-variable modeling terms, (6.6) is referred to as the measurement part of the model, where the latent-variable vector η i is measured by the indicators Y i. Here, Λ may contain parameters. A frequent example is when a it = a t,so that a t can be treated as parameters, for example capturing deviations from linear growth shape (fixing two a t values for identification, typically a 1 =0,a 2 = 1). Another example is where multiple indicators of a factor are available at each time point, where different indicators have different factor loadings λ. With Λ i =Λ,(6.6) also covers factor analysis with covariates. Furthermore, (6.7) is referred to as the structural part, containing regressions among the latent variables. The regression matrix B has zero diagonal elements, but the off-diagonal elements may be used to regress random effects on each other. For example, the growth slope (growth rate) η 1i,orthe random slope κ i may be expressed as a function of the intercept (initial status) η 0i. More generally, (6.7) also covers structural equation modeling. In this way, the extensions of (6.6) and (6.7) to finite mixtures and cluster samples presented in this chapter pertain to not only growth models but also factor analysis and structural equation models, as well as combinations of such models and growth models (Muthén, 2002).

8 150 GROWTH MIXTURE MODELING A general multilevel mixture model Let Y kij be the jth observed dependent variable for individual i in cluster k. Three types of variables are considered in the analyses to be presented: binary and ordered categorical variables, continuous normally distributed variables, and counts following the Poisson or zero-inflated Poisson distribution. Let C ki be a latent categorical variable for individual i in cluster k which takes values 1,...,L. Let D k beacluster-level latent categorical variable for cluster k which takes values 1,...,M. The choice of L and M will be discussed in Section To construct a model for observed binary and ordered categorical variables we proceed in line with Muthén (1984) by defining an underlying continuous, normally distributed latent variable Ykij such that, for a set of threshold parameters τ cdsj, Y kij = s Cki =c,d k =d τ cdsj <Ykij <τ cd,s+1,j. For continuous normally distributed variables we define Ykij = Y kij. For counts Ykij = log(λ kij ), where λ kij is the rate of the Poisson distribution. Let Y ki be the J-dimensional vector of all dependent variables and X ki be the Q-dimensional vector of all individual-level covariates. Using latent-variable terms, the measurement part of the model is defined by Y ki Cki =c,d k =d = ν cdk +Λ cdk η ki + K cdk X ki + ɛ ki, (6.8) where ν cdk is a J-dimensional vector of intercepts, Λ cdk is a J m slope matrix for the m-dimensional random-effect vector η ki, K cdk is a J Q slope matrix for the covariates, and ɛ ki is a J-dimensional vector of residuals with mean zero and covariance matrix Θ cd. Foracategorical variable Y kij a normality assumption for ɛ kij is thus equivalent to a probit regression for Y kij on η kij and X kij. Alternatively, ɛ kij can have a logistic distribution, resulting in a logistic regression. For a count variable Y kij the residual ɛ kij is assumed to be zero. For normally distributed continuous variables Y kij the residual variable ɛ kij is assumed normally distributed. The structural part of the model is defined by η ki Cki =c,d k =d = α cdk + B cdk η ki +Γ cdk X ki + ζ ki, (6.9) where α cdk is an m-dimensional vector of intercepts, B cdk is an m m structural regression parameter matrix, Γ cdk is a m Q slope parameter matrix, and ζ ki is an m-dimensional vector of normally distributed residuals with covariance matrix Ψ cd. The model for the latent categorical variable C ki is a multinomial logit model Pr(C ki = c D k = d) = exp(a cdk + b cdkx ki ) s exp(a sdk + b sdkx ki ). (6.10) Some parameters have to be restricted for identification purposes. For example, the variance of ɛ kij should be 1 for categorical variables Y kij under probit and π 2 / 3 under logit. Also a Ldk = b Ldk =0. The multilevel part of the model is introduced as follows. Each of the intercepts, slopes or loading parameters in equations (6.8) (6.10) can be either a fixed coefficient or a random effect that varies across clusters k. Let η k be the vector of all such random effects and let X k be the vector of all cluster-level covariates. The between-level model for η k is then η k Dk =d = µ d + B d η k +Γ d X k + ζ k, (6.11) where µ d, B d and Γ d are fixed parameters and ζ k is a normally distributed residual with covariance Ψ d. The model for the between level categorical variable D is also a multinomial logit regression Pr(D k = d) = exp(a d + b dx k ) s exp(a s + b sx k ). (6.12)

9 ESTIMATION 151 Equations (6.8) (6.12) comprise the definition of a multilevel latent-variable mixture model. There are many extensions of this model that are possible in the Mplus framework. For example, observed dependent variables can be incorporated on the between level. Other extensions arise from the fact that a regression equation can be constructed between any two variables in the model. Such equations can be fixed- or random-effect regressions. The model can also accommodate multiple latent-class variables on the within and the between level. Other types of dependent variables can also be incorporated in this model such as censored, nominal, semi-continuous, and time-to-event survival variables; see Olsen and Schafer (2001) and Asparouhov, Masyn, and Muthén (2006). 6.4 Estimation The above model is estimated by the maximum likelihood estimator using the EM algorithm where the latent variables C ki, η ki, D k and η k are treated as missing data. The observeddata likelihood is given by Pr(D k = d) ψ k (η k ) ( Pr(C ki = c) f ki (Y ki )ψ ki (η ki )dη ki )dη k, (6.13) k d i c where f ki, ψ ki and ψ k are the likelihood functions for Y ki, η ki and η k, respectively. Numerical integration is utilized in the evaluation of the above likelihood using both adaptive and non-adaptive quadrature (see Schilling and Bock, 2005). The method can be described as follows. Suppose that η is a continuously distributed random-effect variable with density function ψ. Then Q f(η)ψ(η)dη w q f(n q ), (6.14) where n q are the nodes of the numerical integration and w q are the weights. The weights are computed as w q = ψ(n q )/ Q i=1 ψ(n i). The numerical integration method approximates the continuous distribution for η with a categorical distribution, that is, we can assume that the variable η takes the values n q with probabilities w q. Using this method the likelihood (6.13) is approximated by Pr(D k = d) Pr(η k = n qk ) ( Pr(C ki = c) ) Pr(η ki = n rki )f ki (Y ki ) k d q i c r = Pr(D k = d, η k = n qk ) ( ) Pr(C ki = c, η ki = n rki )f ki (Y ki ), (6.15) k d,q i c,r where n qk and n rki are the nodes of the numerical integration. The EM algorithm is as follows. First compute the posterior distribution for the latent variables. The posterior joint distribution for D k and η k is computed as follows: p dqk =Pr(D k = d, η k = n qk ) Pr(D k = d, η k = n qk ) ) i( c,r Pr(C ki = c, η ki = n rki )f ki (Y ki ) = d,q Pr(D k = d, η k = n qk ) ). i( c,r Pr(C ki = c, η ki = n rki )f ki (Y ki ) The posterior conditional joint distribution for C ki and η ki is computed as follows: p crki dq = Pr(C ki = c, η ki = n rki,d k = d, η k = n qk ) q=1 = Pr(C ki = c, η ki = n rki )f ki (Y ki ) c,r Pr(C ki = c, η ki = n rki )f ki (Y ki ).

10 152 GROWTH MIXTURE MODELING The expected complete-data log-likelihood is now given by p dqk log(pr(d k = d, η k = n qk )) + p dqk p crki dq log(pr(c ki = c, η ki = n rki )) dqk dcqrki +p dqk p crki dq log(f ki (Y ki )), dcqrki which is maximized with respect to the model parameters. An alternative algorithm for obtaining the maximum likelihood estimates can be constructed by directly optimizing (6.15) with a standard maximization algorithms such as the Fisher scoring and the quasi-newton algorithms. Such alternative algorithms can be used in combination with the EM algorithm to achieve faster convergence, an approach known as the accelerated EM algorithm (AEM). The AEM algorithm is implemented in Mplus. A number of different integration methods can be used in (6.14). Mplus implements three different integration methods: rectangular, Gauss Hermite and Monte Carlo integration. In addition, adaptive integration can be used. With this method, the integration nodes are concentrated in the area where the posterior distribution of the random effects is non-zero. The estimation implemented in Mplus allows missing at random data for all dependent variables (Little and Rubin, 2002). Non-ignorable missing data is discussed in Muthén et al. (2003). It should be noted that mixture models in general are prone to have multiple local maxima of the likelihood and the use of many different sets of starting values in the interactive maximization procedure is strongly recommended. An automatic random starts procedure is implemented in the Mplus program, where starting values given by the user or produced automatically by the program are randomly perturbed Model assessment For comparison of fit of models that have the same number of classes and are nested, the usual likelihood ratio chi-square difference test can be used, as long as the requirement is fulfilled of not having parameters on the border of the admissible parameter space in the more restricted model. Comparison of models with different numbers of classes violates this requirement with zero probability parameters. Deciding on the number of classes is instead typically accomplished by a Bayesian information criterion (BIC: Schwartz, 1978; Kass and Raftery, 1993), BIC = 2 log L + r log n, where r is the number of free parameters in the model and n is the sample size. The lower the BIC value, the better the model. The number of classes is increased until a BIC minimum is found. Although not chi-square distributed, the usual likelihood ratio statistic for comparing models with different number of classes can still be used, assessing the distribution of the statistic by bootstrap techniques. McLachlan and Peel (2000, Chapter 6) discuss a parametric bootstrapped likelihood ratio approach proposed by Aitkin, Anderson, and Hinde (1981). Although computationally intensive, it has been found to perform well in simulation studies using latent-class and growth mixture models, outperforming BIC in some instances (Nylund, Asparouhov, and Muthén, 2007). The fit of the model to data for continuous variables can be studied by comparing for each class estimated moments with moments created by weighting the individual data by the estimated conditional probabilities (Roeder, Lynch and Nagin, 1999). To check how closely the estimated average curve within each class matches the data, it is also useful to randomly assign individuals to classes based on individual estimated conditional class

11 EXAMPLES 153 probabilities. Plots of the observed individual trajectories together with the model-estimated average trajectory can be used to check assumptions using class membership determined by pseudo-class draws (Bandeen-Roche et al., 1997). Wang, Brown, and Bandeen-Roche (2005) present methods for residual checking based on these ideas. With categorical and count outcomes, model fit may be investigated with respect to univariate and bivariate frequency tables, as well as frequencies for response patterns that do not have too small expected counts. Finally, it is important to note that the need for latent classes may be due to non-normality of the outcomes rather than substantively meaningful subgroups (see McLachlan and Peel, 2000, pp ; Bauer and Curran, 2003). To support a substantive interpretation of the latent classes, the researcher should consider not only the outcome variable in question, but also antecedents (covariates predicting latent-class membership), concurrent outcomes, and distal outcomes (predictive validity); see also related arguments in Muthén (2004). 6.5 Examples This section presents analyses of the crime data of Example 4, the aggressive behavior data of Example 2, and the math achievement data of Example 5. The Example 4 analysis uses a growth mixture model for crime counts. Examples 2 and 5 consider multilevel growth mixture modeling of cluster data. Example 2 examines intervention effects that vary across both student-level and classroom-level latent-class variables. Example 5 considers students within school where student growth characteristics vary across a school-level latent-class variable Analysis of Example 4: Age crime curves The analysis of the Philadelphia data with counts of criminal activity for 13,160 males aged 4 26 will compare two different approaches, a group-based approach and growth mixture analysis (for more extended comparisons, see Kreuter and Muthén, 2007, 2008). The group-based analysis is associated with the work of of Nagin and Land (1993), Nagin (1999, 2005), Roeder, Lynch, and Nagin (1999), and Jones, Nagin, and Roeder (2001). This approach is commonly seen in the criminology literature and was used by D Unger et al. (1998), D Unger, Land, and McCall (2002), and Loughran and Nagin (2006) for these data. The group-based analysis does not cover cluster sampling and has the further restrictions of zero within-class variances Ψ c =0,aswell as Θ c = θi. The group-based approach is further discussed in Muthén (2004) where it is referred to as latent-class growth analysis (LCGA), given its similarity to latent-class analysis (LCA). Both LCGA and LCA search for classes of individuals defined by conditional independence of the repeated measures given class. In contrast, a growth mixture model (GMM) allows for within-class correlations between repeated measures. Such correlation may, for example, be due to omitted time-varying covariates. If within-class correlation is ignored, a distorted class formation is obtained. Within-class correlation is obtained in GMMs by allowing for random effects with non-zero within-class variances. Both LCGA and GMMs use a zero-inflated Poisson model in line with Roeder, Lynch, and Nagin (1999). For time point j, individual i, and cluster k, { 0 with probability πkij, Y kij Cki =c = Poisson(λ ckij ) with probability 1 π kij where λ is the Poisson rate. In line with previous modeling of the Philadelphia data, a quadratic growth curve is used. Drawing on (6.8) and (6.9) of the general model in Section 6.3.2, the growth mixture zero-inflated Poisson model for these data is expressed in

12 154 GROWTH MIXTURE MODELING terms of the log rate as log λ = η ij Ci =c 0i + η 1i a ij + η 2i a 2 ij, η = α 0i Ci =c 0c + ζ 0i, η = α 1i Ci =c 1c + ζ 1i, η = α 2i Ci =c 2c + ζ 2i. To make analysis results comparable to the LCGA of Loughran and Nagin (2006), a minority of individuals with more than 10 criminal offenses in any given year are deleted, reducing the sample size only from 13,160 to 13,126, and combining the data into two-year intervals. Loughran and Nagin (2006) settled on a four-class solution: non-offenders, adolescentlimited, and high and low chronic (persisting criminal activity at age 26). D Unger et al. (1998) and D Unger, Land, and McCall (2002) used a random subset (n = 1000) of the data and concluded based on BIC that a five-class LCGA solution was preferred. Their five classes were labeled: non-offenders, high and low adolescent-peaked, and high and low chronic. Table 6.1 gives results for 1 4 classes of GMM and 4 8 classes for LCGA. In addition to log-likelihood values, number of parameters, and BIC, the table shows fit to the data in terms of the number of standardized residuals that are significant at the 5% level for the 10 most frequent response patterns across time (comprising 78% of the data and eliminating only patterns with observed frequency less than 100). The one-class GMM is the conventional random-effects model. Here, 5 of the 10 residuals show significant misfit, illustrating the need for a more flexible model. The two- and the three-class GMMs obtain considerably improved BIC values. The three-class GMM reduces the number of significant residuals from 5 to 1, indicating the appropriateness of the mixture modeling. The four-class GMM adds relatively little improvement. The three-class GMM displays the three themes of nonoffenders, adolescent-limited, and chronic. Figure 6.4 shows the mean trajectories for the three-class GMM. The four-class GMM splits the adolescent-limited class into two, where the total percentage for those two classes is about the same as for the adolescent-limited class of the three-class GMM. The four-class LCGA is the same as presented in Loughran and Nagin (2006) and the fiveclass LCGA shows the same types of trajectory classes as in D Unger et al. s analysis. Neither of these two models fit the data well. An eight-class LCGA is needed to get a reduction to one significant residual. In contrast, the three-class GMM has only one significant residual and the four-class GMM has none. With three classes the GMM gives a better BIC value than any of the LCGA models shown in Table 6.1. The BIC values for the four-class LCGA Table 6.1 Age crime curves: Log-likelihood and BIC comparisons for GMM and LCGA Model Log-Likelihood # Parameters BIC # Significant Residuals 1-class GMM 40, , class GMM 40, , class GMM 40, , class GMM 40, , class LCGA 40, , class LCGA 40, , class LCGA 40, , class LCGA 40, , class LCGA 40, ,896 1

13 EXAMPLES 155 used in Loughran and Nagin (2006) and the five-class model used in D Unger et al. (1998) and D Unger, Land, and McCall (2002) are considerably worse than the BIC value for the three-class GMM. Furthermore, the three-class GMM uses two parameters less than the fiveclass LCGA, but has a better log-likelihood by 200 points. This illustrates the importance of using random effects to allow for variations on the themes of the trajectory shapes of the classes. The LCGA approach leads to a proliferation of classes, all of which may not have substantive salience Analysis of Example 2: Varying intervention effects on classroom aggressive behavior The Baltimore randomized field trial discussed in Section was repeated for several cohorts of students. The Section analysis considered cohort 1 data, whereas data from cohort 3 (Ialongo et al., 1999) are analyzed here. A total of 362 boys in 27 classrooms are considered over four time points: fall of first grade, spring of first grade, spring of second grade, and spring of third grade. The average number of boys per classroom is It is of interest to study if teachers in classrooms with higher aggressiveness levels have a more difficult time successfully implementing the intervention aimed at reducing aggressivedisruptive behavior. For the first grade, there is substantial variation across classrooms in the aggressiveness scores as evidenced by the intraclass correlations at the four time points: 0.11, 0.16, 0.04, In addition to student-level trajectory classes, the use of latent classes on the classroom level makes it possible to more fully explore variation in intervention effects. Drawing on the Section general model, the two-level GMM is expressed as follows using a quadratic curve shape, Y = η kij Cki =c,d k =d 0ki + η 1ki a ij + η 2ki a 2 ij + ɛ kij, Class 1, 64.9% 1 Class 2, 15.6% Class 3, 19.6% Age Figure 6.4 Estimated mean trajectories from a three-class growth mixture model for criminal activities. Mean

14 156 GROWTH MIXTURE MODELING for j =1, 2, 3, 4, with variation across students within classrooms expressed as η = α 0ki Cki =c,d k =d cdk0 + ζ 0ki, η = α 1ki Cki =c,d k =d cdk1 + ζ 1ki, η = α 2ki Cki =c,d k =d cdk2 + ζ 2ki, and variation across classrooms expressed as α = α cdk0 Cki =c,d k =d cd0 + ζ 30k, α = α cdk1 Cki =c,d k =d cd1 + γ cd1 Z k + ζ 31k, α = α cdk2 Cki =c,d k =d cd2 + γ cd2 Z k + ζ 32k. Here, a i1 =0to center the intercept η 0 at the pre-intervention time point. Z is a treatmentcontrol dummy variable on the classroom level. For reasons of parsimony, the student-level latent-class variable C and the classroom-level latent-class variable D are taken to have an additive effect on the means α cd0, α cd1, and α cd2. The γ intervention effects are, however, allowed to vary across combinations of C and D classes. The linear and quadratic slopes were found to have zero variance across classrooms. The intraclass correlation is captured by the classroom variation in the random intercept of the growth model, α cdk0. The latent categorical variable C ki follows the multinomial logistic regression Pr(C ki = c D k = d) = exp(a cdk) s exp(a sdk), where in this application a cdk Dk =d = a c + ζ ck. (6.16) The analyses indicate that V (ζ ck )=0, that is, the random intercepts for the latent-class variable C do not vary across classrooms. In other applications, however, this variance can be substantial. As a first step, a model without the classroom-level latent-class variable D was explored. As judged by BIC, the conventional single-class random-effects growth model is clearly outperformed by growth mixture modeling, with a three-class model giving the lowest BIC. The log-likelihood for the conventional model is with 14 parameters and a BIC of 8398, while the three-class GMM has a log-likelihood of with 26 parameters and a BIC of The three-class model has a significant classroom variance for the random intercept. Second, two latent classes for D were added to the model resulting in latent classes with low versus high classroom-level aggression (51% versus 49%). The loglikelihood is with 34 parameters and a BIC of This BIC is not as good as for the previous model with no classroom-level latent-class variable, but it is not known how BIC performs in settings with multilevel latent-class variables. The three student-level latent-trajectory classes show a low-increasing class of 68%, a medium-increasing class of 19%, and a high-decreasing class of 12%. The mean curves for these three latent classes are shown in Figure 6.5 as pairs of control and intervention curves. Results for the latent class consisting of classrooms with low aggression level are given in the left plot and results for the latent class consisting of classrooms with high aggression level are given in the right plot. The plots suggest that in classrooms with a low level of aggression, students who are in the two highest trajectory classes benefit from the intervention. In classrooms with a high level of aggression, however, only students who are in the lowest trajectory class benefit from the intervention. This suggests that the intervention may be harder for teachers to implement well in high-aggressive classrooms. The results should be interpreted with caution, however, given the sample of only 27 classrooms and other competing models. An alternative model lets the C and D latent-class variables have an interactive effect on the

15 EXAMPLES Low Classroom Control Treatment 30 High Classroom Control Treatment Means Means Fall 1st Spring 1st Spring 2nd Spring 3rd 5 Fall 1st Spring 1st Spring 2nd Spring 3rd Figure 6.5 Estimated mean trajectories from a growth mixture model of classroom aggressive behavior. random-effect growth means and lets the random-effect means of a cdk in (6.16) be influenced by the latent classes of D. This significantly improves the log-likelihood, but the increased number of parameters on the classroom level results in a less stable solution. The resulting split into 7 and 20 classrooms for the latent classes of D causes estimated outcome mean differences with high variability Analysis of Example 5: Classification of schools based on achievement development The NELS math achievement data from grades 8, 10, and 12 discussed in Section are analyzed here. NELS surveyed 913 schools and a total of 14,217 students. The NELS analysis illustrates two features of the Section model, taking into account the school clusters and using a school-level latent-class variable. In the NELS analysis, student growth rate is regressed on the growth intercept in grade 8 using a random slope that varies across schools. This random slope and the means of the random intercept and the intercept of the random growth rate are allowed to vary across the school-level latent classes. Letting school-level latent classes influence student-level relations helps identify the school-level latent classes. Extending the example of Section to clusters k and a cluster-level latent-class variable D k,variation across grades is expressed as Y = η kij Dk =d 0ki + η 1ki a ij + ɛ kij, for j =1, 2, 3, with variation across students expressed as η = α 0ki Dk =d d0 + ζ 0ki, η = α 1ki Dk =d d1 + β dk η 0ki + ζ 1ki, where the variation across schools is accomplished by the variation of α d0, α d1, and β dk across the classes of D. A single-class growth model, that is, a conventional three-level analysis, obtains a log-likelihood of 31,791 with 10 parameters and a BIC of 63,678. A two-class GMM obtains a log-likelihood of 31,545 with 16 parameters and a BIC of 63,243. A three-class GMM obtains a log-likelihood of 31,434 with 22 parameters and a BIC of 63,079. A four-class model does not improve the log-likelihood further. The three-class model shows that the growth rate is significantly positively related to the growth intercept defined at grade 8 only for a class of 52% of the schools who have average growth over grades A higher developing class of 25% and a lower developing class of 23% have small and

16 158 GROWTH MIXTURE MODELING insignificant relationships. This illustrates the possibility of finding clusters of schools with different achievement profiles. School-level covariates predicting school class membership can give further understanding of the school classes. 6.6 Parametric versus non-parametric random-effect models Titterington, Smith, and Makov (1985) make a distinction in the use of finite-mixture modeling in terms of direct and indirect applications. A direct application uses mixtures to represent the underlying physical phenomenon, whereas with the indirect application the mixture components do not necessarily have a direct physical interpretation. The examples discussed so far can be seen as attempts at direct application, where trajectory classes are given substantive interpretation and results are presented for each mixture component rather than mixing over the classes. Examples of indirect applications include outlier detection and representation of non-normal distributions. Mixture modeling of non-normal distributions is the focus of this section. A growth model with a non-parametric representation of the random-effects distribution is presented and a simulation study compares the use of such a model to the conventional random-effect growth model assuming normality. It has been argued that with categorical and count outcomes, the typical normality assumption for random effects in repeated-measurement modeling may be less well supported by data (see also Aitkin, 1999). Deviations from normality may strongly affect the results. With categorical and count outcomes, maximum likelihood leads to the use of numerical integration which is computationally heavy and intractable when the number of random effects is large. Numerical integration uses fixed quadrature points and weights according to a normal distribution. A non-parametric approach instead considers a discretized distribution, estimating the points and the weights using a finite-mixture model. The latent class means are the points and the class probabilities are the weights. In this way, the non-parametric approach both avoids the normality specification and is computationally less demanding. In this section we describe and compare the general parametric and non-parametric random-effect models. Both of these models are special cases of the general model described in equations (6.8) (6.12). Both of these modeling alternatives attempt to capture clusterspecific effects. The difference between the two models is the underlying assumption for the distributions of the cluster-specific random effects. In the parametric model the random effects are assumed to a have conditionally normal distribution, that is, the conditional distribution of the random effects, given all covariates, is assumed to be normal. In the non-parametric model the random effect are assumed to have a non-parametric conditional distribution. The parametric random-effect model is well established and frequently used in practice. Butler and Louis (1992) show that the normality assumption in the parametric model does not affect the fixed slopes in the model. Verbeke and Lesaffre (1996) show that more accurate estimates can be obtained for the random effects if a non-normal distribution is estimated. Aitkin (1999) gives the general modeling approach to the non-parametric random-effect models that we follow here. First we give the complete description of the two modeling alternatives and show how they fit in the general modeling framework (6.8) (6.12) Parametric random-effect model This model is a special case of model (6.8) (6.12) for the case of no categorical latent variables. The within-level model is given by Y ki = ν k +Λ k η ki + K k X ki + ɛ ki, (6.17) η ki = α + B k η ki +Γ k X ki + ζ ki. (6.18)

17 PARAMETRIC VERSUS NON-PARAMETRIC RANDOM-EFFECT MODELS 159 The coefficients ν k,λ k, K k, B k, and Γ k can be either fixed coefficients that are the same across cluster or random effects that vary across cluster. Let η k represent the vector of all such random effects. The between-level model is described by η k = µ + Bη k +ΓX k + ζ k. (6.19) The random-effect residuals ζ ki and ζ k are assumed normally distributed. This assumption is the difference between the parametric and the non-parametric model. Note that the distributional assumption for ɛ ki is determined by the type of observed variable we model Non-parametric random-effect model This model is a special case of model (6.8) (6.12) where the random effects η ki and η k do not have normally distributed residuals ζ ki and ζ k. The within-level model is given by Y ki = ν k +Λ k η ki + K k X ki + ɛ ki, η ki Cki =c = α c + B k η ki +Γ k X ki, (6.20) Pr(C ki = c) =p c, (6.21) where p c are parameters to be estimated. The coefficients ν k,λ k, K k, B k, and Γ k can again be either fixed coefficients or random effects. Let η k represent the vector of all random effects. The between-level model is given by η k Dk =d = µ d + Bη k +ΓX k, (6.22) Pr(D k = d) =q d, (6.23) where q d are parameters to be estimated. The random-effect model (6.20) (6.23) can alternatively be presented as in equations (6.18) (6.19), considering the mixture across classes η ki = α + B k η ki +Γ k X ki + ζ ki, η k = µ + Bη k +ΓX k + ζ k, where α = c α cp c, µ = d µ dq d, and ζ ki and ζ k are non-parametric zero-mean residuals that are freely estimated. The residual ζ ki takes the values α c α with probability p c and the residual ζ k takes values µ d µ with probability q d. The variance and covariance for the non-parametric effects can also be computed; for example, the variance of ζ ki is c p c(α c α)(α c α) Simulation study A simulation study is conducted to compare the performance of the parametric and nonparametric random-effect models for data generated with non-normal random effects. Consider a logistic growth model with 10 binary items U 1,...,U 10, [ ] Pr(Uij =1) log = η 0i + η 1i a ij, (6.24) Pr(U ij =0) where the time scores a ij = (j 1)/2, and η 0 and η 1 are non-normal random effects. Generation of η 0 and η 1 used the following finite mixture of normal distributions: 0.67 N(µ 1,σ 2 )+0.09 N(µ 2,σ 2 )+0.24 N(µ 3,σ 3 ). To generate η 0, the following parameters were used: µ 1 =2,µ 2 =1,µ 3 =0,and σ =0.4. To generate η 1, the following parameters were used: µ 1 = 0.3, µ 2 = 0.4, µ 3 = 1, and σ = 0.1. From these values, 100 samples of size 2000 were generated according to the model (6.24). The data were analyzed using the parametric linear model (PM) and

18 160 GROWTH MIXTURE MODELING Table 6.2 Comparing the parametric (PM) and non-parametric (NPM) random-effect models Parameter True value PM bias NPM bias PM MSE NPM MSE m m v v ρ the non-parametric linear model (NPM). PM is a conventional single-class growth model with random normal effects as in (6.17) (6.18). Drawing on (6.20) and (6.22), the NPM is expressed as Yij = η 0i + η 1i a ij + ɛ ij, η = α 0i Ci =c 0c, η = α 1i Ci =c 1c, so that the random effects are represented by a mixture distribution. A more general form would allow within-class variation for residuals ζ as in (6.9). The parameter estimates are summarized in Table 6.2. The means of η 0 and η 1 are denoted by m 0 and m 1 and the variances by v 0 and v 1. The covariance of η 0 and η 1 is denoted by ρ. The results are presented for the non-parametric model with three nodes, since three nodes were determined to be sufficient for most replications using the McLachlan and Peel (2000) parametric likelihood ratio test. The estimates on which Table 6.2 is based are computed for the mixture over the three classes in line with (6.20) and (6.22). The results in Table 6.2 clearly indicate the advantages of the NPM method. The NPM parameter estimates have substantially smaller bias and smaller mean squared error (MSE) for several parameters. In general it is difficult to evaluate model fit for random-effect models. There is no general unrestricted model which can be used for comparison. In this simulated example, however, there is such a model, namely, the completely unrestricted contingency table for the binary items. In addition, the Pearson chi-square test can be used to test the fit of the model. The data were generated according to a linear growth model with non-normal random effects; that is, the true model is a linear random-effect growth model. Both the PM and NPM models are linear random-effect growth models but are based on different assumptions on the distribution of the random effects; neither assumption specifies the true random-effect distribution. This situation is typical in practical applications, where the true random-effect distribution is unknown and the modeling assumptions are likely to deviate from the true distribution to some extent. It is assumed that the distributional misspecification will not interfere with the basic structure of the model and that the estimated model will provide a good fit for the data despite the distributional misspecifications. The Pearson test of fit can be used to directly compare the sensitivity of the PM and NPM models. If the Pearson test rejects the model, one concludes that the model fit is poor. In practical applications the lack of model fit could incorrectly be interpreted as evidence for deficiency in the linear growth structure of the model rather than as possible misspecification in the random-effect distribution. In the current simulated example one wants the Pearson test to reject the model no more than the nominal 5% of the time. Table 6.3 contains the Pearson test of fit results for the PM linear growth model as well as the NPM linear growth model using 3, 4, 5, and 6 nodes. The rejection rate in Table 6.3 is the percentage of times the linear growth model was rejected incorrectly. Also presented are the average test statistic value and the degrees of freedom. These two values should generally be close because the expected value of the chi-square distribution is equal to the degrees of freedom. The parametric approach

Chapter 19. Latent Variable Analysis. Growth Mixture Modeling and Related Techniques for Longitudinal Data. Bengt Muthén

Chapter 19. Latent Variable Analysis. Growth Mixture Modeling and Related Techniques for Longitudinal Data. Bengt Muthén Chapter 19 Latent Variable Analysis Growth Mixture Modeling and Related Techniques for Longitudinal Data Bengt Muthén 19.1. Introduction This chapter gives an overview of recent advances in latent variable