Latent class analysis and finite mixture models with Stata

Similar documents
A Journey to Latent Class Analysis (LCA)

Using generalized structural equation models to fit customized models without programming, and taking advantage of new features of -margins-

Generalized linear models

Understanding the multinomial-poisson transformation

Lecture 2: Poisson and logistic regression

Recent Developments in Multilevel Modeling

Lecture 5: Poisson and logistic regression

Addition to PGLR Chap 6

Nonlinear Econometric Analysis (ECO 722) : Homework 2 Answers. (1 θ) if y i = 0. which can be written in an analytically more convenient way as

Analysing repeated measurements whilst accounting for derivative tracking, varying within-subject variance and autocorrelation: the xtiou command

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

David M. Drukker Italian Stata Users Group meeting 15 November Executive Director of Econometrics Stata

Categorical and Zero Inflated Growth Models

Description Remarks and examples Reference Also see

Homework Solutions Applied Logistic Regression

options description set confidence level; default is level(95) maximum number of iterations post estimation results

Marginal Effects for Continuous Variables Richard Williams, University of Notre Dame, Last revised January 20, 2018

Marginal Effects in Multiply Imputed Datasets

ECON 594: Lecture #6

NELS 88. Latent Response Variable Formulation Versus Probability Curve Formulation

Appendix A. Numeric example of Dimick Staiger Estimator and comparison between Dimick-Staiger Estimator and Hierarchical Poisson Estimator

Case of single exogenous (iv) variable (with single or multiple mediators) iv à med à dv. = β 0. iv i. med i + α 1

2. We care about proportion for categorical variable, but average for numerical one.

LOGISTIC REGRESSION Joseph M. Hilbe

Calculating Odds Ratios from Probabillities

Ordinal Independent Variables Richard Williams, University of Notre Dame, Last revised April 9, 2017

Lab 3: Two levels Poisson models (taken from Multilevel and Longitudinal Modeling Using Stata, p )

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Sociology 362 Data Exercise 6 Logistic Regression 2

Spatial Regression Models: Identification strategy using STATA TATIANE MENEZES PIMES/UFPE

sociology 362 regression

Model and Working Correlation Structure Selection in GEE Analyses of Longitudinal Data

Lecture 12: Effect modification, and confounding in logistic regression

Marginal effects and extending the Blinder-Oaxaca. decomposition to nonlinear models. Tamás Bartus

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Estimating Markov-switching regression models in Stata

Stat 5101 Lecture Notes

Consider Table 1 (Note connection to start-stop process).

Editor Executive Editor Associate Editors Copyright Statement:

Dealing With and Understanding Endogeneity

especially with continuous

Statistical and psychometric methods for measurement: Scale development and validation

Outline. Clustering. Capturing Unobserved Heterogeneity in the Austrian Labor Market Using Finite Mixtures of Markov Chain Models

Mixture Modeling in Mplus

Simulating complex survival data

LCA Distal Stata function users guide (Version 1.0)

Analyzing Proportions

Example 7b: Generalized Models for Ordinal Longitudinal Data using SAS GLIMMIX, STATA MEOLOGIT, and MPLUS (last proportional odds model only)

Lecture#12. Instrumental variables regression Causal parameters III

Description Syntax for predict Menu for predict Options for predict Remarks and examples Methods and formulas References Also see

Measures of Fit from AR(p)

Mixed Models for Longitudinal Binary Outcomes. Don Hedeker Department of Public Health Sciences University of Chicago.

Estimating and Interpreting Effects for Nonlinear and Nonparametric Models

sociology 362 regression

Varieties of Count Data

What is Latent Class Analysis. Tarani Chandola

Estimating chopit models in gllamm Political efficacy example from King et al. (2002)

Modelling Rates. Mark Lunt. Arthritis Research UK Epidemiology Unit University of Manchester

Practice 2SLS with Artificial Data Part 1

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

DEEP, University of Lausanne Lectures on Econometric Analysis of Count Data Pravin K. Trivedi May 2005

Binary Dependent Variables

Determining the number of components in mixture models for hierarchical data

LCA_Distal_LTB Stata function users guide (Version 1.1)

Introduction to GSEM in Stata

Latent class analysis with multiple latent group variables

Probabilistic Graphical Models

Sociology 63993, Exam 2 Answer Key [DRAFT] March 27, 2015 Richard Williams, University of Notre Dame,

8. Nonstandard standard error issues 8.1. The bias of robust standard errors

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

A short guide and a forest plot command (ipdforest) for one-stage meta-analysis

Multilevel Statistical Models: 3 rd edition, 2003 Contents

Logistic Regression. Building, Interpreting and Assessing the Goodness-of-fit for a logistic regression model

Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation

Subject CS1 Actuarial Statistics 1 Core Principles

One-stage dose-response meta-analysis

Non-Parametric Bayes

Statistical Modelling with Stata: Binary Outcomes

Econ 371 Problem Set #6 Answer Sheet In this first question, you are asked to consider the following equation:

Chapter 11. Regression with a Binary Dependent Variable

Assessing the Calibration of Dichotomous Outcome Models with the Calibration Belt

Extended regression models using Stata 15

ECON Introductory Econometrics. Lecture 5: OLS with One Regressor: Hypothesis Tests

Generalized method of moments estimation in Stata 11

Working with Stata Inference on the mean

Lecture 4: Generalized Linear Mixed Models

The Regression Tool. Yona Rubinstein. July Yona Rubinstein (LSE) The Regression Tool 07/16 1 / 35

Lecture 3.1 Basic Logistic LDA

Group Comparisons: Differences in Composition Versus Differences in Models and Effects

36-309/749 Experimental Design for Behavioral and Social Sciences. Dec 1, 2015 Lecture 11: Mixed Models (HLMs)

Using the same data as before, here is part of the output we get in Stata when we do a logistic regression of Grade on Gpa, Tuce and Psi.

Neutral Bayesian reference models for incidence rates of (rare) clinical events

Practice exam questions

Binomial Model. Lecture 10: Introduction to Logistic Regression. Logistic Regression. Binomial Distribution. n independent trials

Compare Predicted Counts between Groups of Zero Truncated Poisson Regression Model based on Recycled Predictions Method

Interaction effects between continuous variables (Optional)

Longitudinal and Multilevel Methods for Multinomial Logit David K. Guilkey

multilevel modeling: concepts, applications and interpretations

Transcription:

Latent class analysis and finite mixture models with Stata Isabel Canette Principal Mathematician and Statistician StataCorp LLC 2017 Stata Users Group Meeting Madrid, October 19th, 2017

Introduction Latent class analysis (LCA) comprises a set of techniques used to model situations where there are different subgroups of individuals, and group memebership is not directly observed, for example:. Social sciences: a population where different subgroups have different motivations to drink. Medical sciences: using available data to identify subgroups of risk for diabetes. Survival analysis: subgroups that are vulnerable to different types of risks (competing risks). Education: identifying groups of students with different learning skills. Market research: identifying different kinds of consumers.

The scope of the term latent class analysis varies widely from source to source. Collin and Lanza (2010) discuss some of the models that are usually considered LCA. Also, they point out: In this book, when we refer to latent class models we mean models in which the latent variable is categorical and the indicators are treated as categorical.

In Stata, we use LCA to refer to a wide array of models where there are two or more unobserved classes Dependent variables might follow any of the distributions supported by gsem, as logistic, Gaussian, Poisson, multinomial, negative binomial, Weibull, etc.(help gsem family and link options) There might be covariates (categorical or continuos) to explain the dependent variables There might be covariates to explain class membership Stata adopts a model-based approach to LCA. In this context, we can see LCA as group analysis where the groups are unknown. Let s see an example, first with groups and then with classes:

Below we use group() option fit regressions to the childweight data, weight vs age, different regressions per sex:. gsem (weight <- age), group(girl) ginvariant(none) /// > vsquish nodvheader noheader nolog Group : boy Number of obs = 100 weight Coef. Std. Err. z P> z [95% Conf. Interval] age 3.481124.1987508 17.52 0.000 3.09158 3.870669 _cons 5.438747.2646575 20.55 0.000 4.920028 5.957466 var(e.weight) 2.4316.3438802 1.842952 3.208265 Group : girl Number of obs = 98 weight Coef. Std. Err. z P> z [95% Conf. Interval] age 3.250378.1606456 20.23 0.000 2.935518 3.565237 _cons 4.955374.2152251 23.02 0.000 4.533541 5.377207 var(e.weight) 1.560709.2229585 1.179565 2.06501 Group analysis allows us to make comparisons between these equations, and easily set some common. (help gsem group options)

Now let s assume that we have the same data, and we don t have variable girl. We suspect that there are two groups that behave different.. gsem (weight <- age), lclass(c 2) lcinvariant(none) /// > vsquish nodvheader noheader nolog 1.C (base outcome) Coef. Std. Err. z P> z [95% Conf. Interval] 2.C _cons.5070054.2725872 1.86 0.063 -.0272557 1.041267

Class : 1 weight Coef. Std. Err. z P> z [95% Conf. Interval] age 5.938576.2172374 27.34 0.000 5.512798 6.364353 _cons 3.8304.2198091 17.43 0.000 3.399582 4.261218 var(e.weight).6766618.1817454.3997112 1.145505 Class : 2 weight Coef. Std. Err. z P> z [95% Conf. Interval] age 2.90492.2375441 12.23 0.000 2.439342 3.370498 _cons 5.551337.4567506 12.15 0.000 4.656122 6.446551 var(e.weight) 1.52708.2679605 1.082678 2.153893

The second table on the LCA model same structure as the output from the group model. In addition, the LCA output starts with a table corresponding to the class estimation. This is a binary (logit) model used to find the two classes. In the latent class model all the equations are estimated jointly and all parameters affect each other, even when we estimate different parameters per class. How do we interpret these classes? We need to analyze our classes and see how they relate to other variables in the data. Also, we might interpret our classes in terms of a previous theory, provided that our analysis is in agreement with the theory. We will see post-estimation commands that implement the usual tools used for this task.

Latent class analysis in Stata is an extension of the classic latent class analysis. Stata documentation and formulas refer to the general model, and don t match the notation and approach you will see on the classic LCA literature (though results match). We ll introduce the classic approach to LCA and discuss how Stata approach generalizes it.

Example: Role conflict dataset. use gsem_lca1 (Latent class analysis). notes in 1/4 _dta: 1. Data from Samuel A. Stouffer and Jackson Toby, March 1951, "Role conflict and personality", _The American Journal of Sociology_, vol. 56 no. 5, 395-406. 2. Variables represent responses of students from Harvard and Radcliffe who were asked how they would respond to four situations. Respondents selected either a particularistic response (based on obligations to a friend) or universalistic response (based on obligations to society). 3. Each variable is coded with 0 indicating a particularistic response and 1 indicating a universalistic response. 4. For a full description of the questions, type "notes in 5/8".

. describe Contains data from gsem_lca1.dta obs: 216 Latent class analysis vars: 4 10 Oct 2017 12:46 size: 864 (_dta has notes) storage display value variable name type format label variable label accident byte %9.0g would testify against friend in accident case play byte %9.0g would give negative review of friend s play insurance byte %9.0g would disclose health concerns to friend s insurance company stock byte %9.0g would keep company secret from friend Sorted by: accident play insurance stock

. list in 120/121 accident play insura ~ e stock 120. 1 0 1 1 121. 1 1 0 0 For each observation, we have a vector of responses Y = (Y 1, Y 2, Y 2, Y 4 ) (I am omitting an observation index)

Classic approach Let s assume that we have two classes, C1 and C2. The probabilty of Y taking a value y can be expressed as: P(Y = y C1) P(C1) + P(Y = y C2) P(C2) Which, under the assumption of conditional independence, is: 4 P(Y j = y j C1) P(C1) + j=1 4 P(Y j = y j C2) P(C2) j=1

In short, the likelihood contribution for an observation would be: L = k=1,2 j=1 4 P(Y j = 1 Ck) y j (1 P(Y j = 1 Ck)) 1 y j P(Ck) Maximizing the sum of the log-likelihood contributions from all observations, we obtain the values P(Y j = rj Ck) and P(Ck). In the literature, you will see generalizations of this formula, like L = 4 Rj k=1,...m j=1 rj=1 P(Y j = rj Ck) (I (y j =rj)) P(Ck) where rj, j = 1... Rj are the possible values for variable Y j.

Stata (Model-based) approach The description before corresponds to a non-parametric estimation. We estimate the probabilities directly, not through a parameterization. Now, how do we do it in Stata?.gsem (accident play insurance stock <- ), logit lclass(c 2) We are fitting a logit model for each class, with no covariates. Because there are no covariates, estimating the constant is equivalent to estimating the probability: p = F (constant), where F is the inverse logit function.

The model-based approach can be represented as a mixed model: Where L = f (y; Θ 1 ) P(C1) + f (y; Θ 2 ) P(C2) f (y; Θ k ) = 4 i=1 p y i jk (1 p jk) 1 y i and p jk is expressed as exp(cons jk )/(1 + exp(cons jk ) gsem also represents class probabilities P(Ck) with a logit model. By default, we are fitting the non-parametric model, but this flexibility allows us to include covariates to model the class membership probabilities, the conditional probabilities, or both. Now, let s fit the model.

. gsem(accident play insurance stock <- ),logit lclass(c 2) /// > vsquish nodvheader noheader nolog 1.C (base outcome) Coef. Std. Err. z P> z [95% Conf. Interval] 2.C _cons -.9482041.2886333-3.29 0.001-1.513915 -.3824933 Class : 1 Coef. Std. Err. z P> z [95% Conf. Interval] accident _cons.9128742.1974695 4.62 0.000.5258411 1.299907 play _cons -.7099072.2249096-3.16 0.002-1.150722 -.2690926 insurance _cons -.6014307.2123096-2.83 0.005-1.01755 -.1853115 stock _cons -1.880142.3337665-5.63 0.000-2.534312-1.225972

Coef. Std. Err. z P> z [95% Conf. Interval] Class : 2 Coef. Std. Err. z P> z [95% Conf. Interval] accident _cons 4.983017 3.745987 1.33 0.183-2.358982 12.32502 play _cons 2.747366 1.165853 2.36 0.018.4623372 5.032395 insurance _cons 2.534582.9644841 2.63 0.009.6442279 4.424936 stock _cons 1.203416.5361735 2.24 0.025.1525356 2.254297

After our estimation, the predict command allows us to obtain many predictions: Probabilities of positive outcome, conditional on class P(Y 1 = 1 C2) P(Yj = 1 C 2) j P(Y 1 = 1 Ck) k P(Yj = 1 Ck) j, k P(Y 1 = 1) P(Yj = 1) j predict pr1c, mu outcome(accident) class(2) predict prc*, mu class(2) predict prc*, mu outcome(accident) predict prc*, mu Probabilities of positive outcome, marginal on class predict p1, mu outcome(1) pmarginal predict p*, mu pmarginal Prior probability of class membership, P(Ck) P(Y Ck) predict classpr*, classpr Posterior probability of class membership, (Bayes formula) P(Y C k Y = y) predict classpostpr*, classposteriorpr

To interpret the classes, we could compare the mean of the (counter-factual) conditional probabilities for each answer on each class; (the ones we get with predict by default) estat lcmean will do that.. estat lcmean Latent class marginal means Number of obs = 216 Delta-method Margin Std. Err. [95% Conf. Interval] 1 2 accident.7135879.0403588.6285126.7858194 play.3296193.0496984.2403572.4331299 insurance.3540164.0485528.2655049.4538042 stock.1323726.0383331.0734875.2268872 accident.9931933.0253243.0863544.9999956 play.9397644.0659957.6135685.9935191 insurance.9265309.0656538.6557086.9881667 stock.769132.0952072.5380601.9050206

marginal means on the title refers to means averaged over the observations, but they are conditional on the class. The probability of giving an universalistic response for each question is higher in group 2 than in group 1.

Also, we compute the predicted probabilities for each class. Prior probabilities are the ones predicted by the logistic model for the latent class, which (with no covariates) will have no variations across the data.. predict classpr*, classpr. summ classpr* Variable Obs Mean Std. Dev. Min Max classpr1 216.7207538 0.7207538.7207538 classpr2 216.2792462 0.2792462.2792462

This is an estimator of the population expected means for these variables. These estimates, and their confidence intervals can be obtained with estat lcprob.. estat lcprob Latent class marginal probabilities Number of obs = 216 Delta-method Margin Std. Err. [95% Conf. Interval] C 1.7207539.0580926.5944743.8196407 2.2792461.0580926.1803593.4055257

Stata provides some tools to evaluate goodness of fit:. estat lcgof Fit statistic Value Description Likelihood ratio chi2_ms(6) 2.720 model vs. saturated p > chi2 0.843 Information criteria AIC 1026.935 Akaike s information criterion BIC 1057.313 Bayesian information criterion

Model with covariates: Geometry dataset 1 Variables pyit1 and pyit2 contains binary responses for two Pythagorean test; alg is a score for a test on algebra. We fit three different models.. use algebra, clear. list in 1/5 alg_sc ~ e pyit1 pyit2 freq 1. 0 0 0 61 2. 0 0 1 24 3. 0 1 0 9 4. 0 1 1 6 5. 1 0 0 92. expand freq (1,213 observations created) 1 (see Hagenaars and McCutcheon, 2002)

Model 1: two classes are determined by the binary variables pyit1 and pyit2. gsem (pyit1 pyit2 <-, logit), lclass(c 2) ) Model 2: two classes are determined by the binary variables pyit1 and pyit2, and variable alg might contain helpful information to identify those groups. gsem (pyit1 pyit2 <-, logit) (C <- alg), lclass(c 2) Model 3: two classes are determined by the regressions of pyit1 and pyit2, on variable alg; We are accounting not only for variations on the response among groups, but also on how this reponse relates to the covariate.. gsem (pyit1 pyit2 <- alg, logit), lclass(c 2) )

gsem (pyit1 pyit2 <-, logit), lclass(c 2) startvalues(randomid, draws(5) seed(23)). estat lcmean, vsquish Latent class marginal means Number of obs = 1,241 Delta-method Margin Std. Err. [95% Conf. Interval] 1 2 pyit1.7707281 142.2577 0 1 pyit2.8156159 247.4665 0 1 pyit1.1721594 253.6474 0 1 pyit2.2158945 146.3729 0 1. estat lcprob,vsquish Latent class marginal probabilities Number of obs = 1,241 Delta-method Margin Std. Err. [95% Conf. Interval] C 1.506648 241.258 0 1 2.493352 241.258 0 1

gsem (pyit1 pyit2 <-, logit) (C <- alg), lclass(c 2). estat lcmean Latent class marginal means Number of obs = 1,241 Delta-method Margin Std. Err. [95% Conf. Interval] 1 2 pyit1.1985894.0236409.1562666.2489921 pyit2.3404315.0202552.3019188.3811744 pyit1.9923852.0292546.0619459.9999961 pyit2.8545888.0270487.7932187.9000403. estat lcprob Latent class marginal probabilities Number of obs = 1,241 Delta-method Margin Std. Err. [95% Conf. Interval] C 1.6512534.0237176.6034547.6961911 2.3487466.0237176.3038089.3965453

gsem (pyit1 pyit2 <- alg, logit), lclass(c 2) startvalues(randomid, draws(5) seed(15)). estat lcmean Latent class marginal means Number of obs = 1,241 Delta-method Margin Std. Err. [95% Conf. Interval] 1 2 pyit1.5846306.0193834.5462094.6220497 pyit2.6409796.0220191.596784.682905 pyit1.0633972.0363614.0199756.1835298 pyit2.0618345.036141.0190673.1826642. estat lcprob Latent class marginal probabilities Number of obs = 1,241 Delta-method Margin Std. Err. [95% Conf. Interval] C 1.7922178.0294795.728562.844139 2.2077822.0294795.155861.271438

From model 2, we see that variable alg helps us to identify groups with different scores; The identification of the high and low score groups doesn t improve when accounting for their dependence on alg, suggesting there might be a different interpretation for the last model.

Additional remarks: LCA order might vary when we vary the starting values. Fit the model repeateadly with different starting values to avoid local maxima. The conditional independence assumption might not be true; a way to account for dependence is to incorporate more discrete latent variables. Another way, for categorical responses, is to generate new categories with combinations of the correlated variables. The conditional independence is not necessary for Gaussian variables, we can include correlations among them.

Concluding remarks: gsem offers a framework where we can fit models accounting for latent classes. Responses might take one or more of the distributions supported by gsem. We can fit non-parametric models by using only binary or categorical responses. We can also parameterize the responses and the probabilities of class membership by introducing covariates. Discrete latent variables might have more than two groups, and more than one latent variable also might be included. Some latent class models are a special case of finite mixture models. The fmm prefix allows us to fit finite mixture models for a variety of distributions.