Computational Statistics and Data Analysis. Identifiability of extended latent class models with individual covariates

Size: px

Start display at page:

Download "Computational Statistics and Data Analysis. Identifiability of extended latent class models with individual covariates"

Beatrix Fitzgerald
5 years ago
Views:

Computational Statistics and Data Analysis 52(2008) 5263 5268 Contents lists available at ScienceDirect Computational Statistics and Data Analysis journal homepage: www.elsevier.

Perugia, Italy a r t i c l e i n f o Article history: Received 1 March 2007 Received in revised form 17 April 2008 Accepted 28 April 2008 Available online 4 May 2008 a b s t r a c t Identifiability

1 Computational Statistics and Data Analysis 52(2008) Contents lists available at ScienceDirect Computational Statistics and Data Analysis journal homepage: Identifiability of extended latent class models with individual covariates Antonio Forcina Dipartimento di Economia, Finanza e Statistica, University of Perugia, via Pascoli, 06100, Perugia, Italy a r t i c l e i n f o Article history: Received 1 March 2007 Received in revised form 17 April 2008 Accepted 28 April 2008 Available online 4 May 2008 a b s t r a c t Identifiability for a very flexible family of latent class models introduced recently is examined. These models allow for a conditional association between selected pairs of response variables conditionally on the latent and are based on logistic regression models both for the latent weights and for the conditional distributions of the response variables in terms of subject specific covariates. Generalized logits (global or continuation, which are relevant with ordered categorical responses and involve comparisons of cumulated probabilities) may be used as an alternative to the usual logits of type local which are log-linear. A compact matrix formulation for the Jacobian of the parametrization and a simple algorithm for checking local identifiability numerically is described. A few examples involving causal inference are examined Elsevier B.V. All rights reserved. 1. Introduction Conventional latent class models (see for example Goodman (1974)) assume that the association structure of a set of observed discrete responses is caused by a discrete latent variable which affects each response separately. Subsequently this approach has been extended in two directions: (i) by relaxing the assumption that the response variables are independent given the latent (Hagenaars (1988) used constrained latent log-linear models while Yang and Becker (1997), Ip (2001) and Bartolucci and Forcina (2005) among others used marginal association parameters), thus allowing, for instance, a limited number of bivariate associations conditionally on the latent;(ii) by allowing the marginal distribution of the latent and the conditional distribution of the responses to depend on individual covariates as in a generalized linear model (Dayton and Macready, 1988; Formann, 1992; Melton et al., 1994; Huang and Bandeen-Roche, 2004; Bartolucci and Forcina, 2006). While latent class models provide powerful tools for building flexible probabilistic explanations of the data, they are also extremely vulnerable with respect to identifiability. Most software for fitting latent class models can compute an estimate of the information matrix, which provides an indirect test of local identifiability at the maximum likelihood estimate, see for example Latent GOLD (Vermunt and Magidson, 2005, p. 55). Local identifiability (see Section 3 for a formal definition) is a crucial issue for the interpretation of results and the validity of asymptotic approximations. When latent class models are used to formulate problems of causal inference (which are meaningful only if the model is identifiable) one may want to assess the identifiability of different models even before collecting the data. For most conventional latent class models it is well known which one is identifiable and which is not. Instead, for the class of extended models discussed in this paper, results are available only in very special cases. Catchpole and Morgan(1997) investigate the closely related notion of parametric redundancy and suggest a symbolic algorithm for checking when this deficiency is effective; however the actual implementation of such algorithms is not at all simple in the context described above where the derivation of analytical results is, in most instances, a very challenging task. Instead, as we argue below, numerical tests of parametric redundancy based on the Jacobian of the transformation between the vector of canonical parameters of the manifest distribution and the vector of regression parameters are simple and fast; thus they may be used as a reliable diagnostics of model identifiability. Fax: address: forcina@stat.unipg.it /$ see front matter 2008 Elsevier B.V. All rights reserved. doi: /j.csda

2 5264 A. Forcina / Computational Statistics and Data Analysis 52(2008) The family of latent class models proposed by Bartolucci and Forcina (2006) are taken as the basis for the present investigation and their basic properties are described briefly in Section 2. The connection between local identifiability and parametric redundancy is recalled in Section 3. A simple matrix formulation for the Jacobian of the transformation between the canonical parameters of the manifest distribution and the regression parameters is provided and the properties of a numerical test of model identifiability are examined. A few applications are discussed in Section 4; some of these concern problems of causal inference. 2. Extended latent class models The family of latent class models presented below is essentially identical to the one proposed by Bartolucci and Forcina (2006) in the context of capture recapture date. Let Y 1,...,Y K be discrete response variables, with Y j having k j categories. Supposethatobservationsareavailableonnindividualscharacterizedeachbyapossiblydistinctvectorofcovariatesz i.let r denoteany ofthet = k j possible response configurations and q r (z) = Pr(Y 1 = r 1,...Y K = r K z) denote the corresponding cell probabilities. These probabilities (usually called manifest in contrast to the latent ones to be introducedbelow)maybearrangedintothet 1vectorq(z)inlexicographicorderbylettingtheelementsofr withagiven j index run from 1 to k j while those with a smaller k index are kept fixed. Let γ(z) denote a vector of canonical parameters for the saturated model of the corresponding multinomial distribution; these may be defined as γ(z) = H log[q(z)] where H isany matrixoft 1linearly independent rowcontrasts (see for examplebartolucciet al.(2007),p. 699). Let U denote a discrete latent variable having c categories which is assumed to explain the dependence among the response variables due to individual heterogeneity. Let π u,r (z) denote the joint distribution of (U,Y 1,...,Y K ) z; these probabilities may be arranged into the vector π(z), again in lexicographical order, by letting U run slowest from 1 to c. Clearly, the manifest probabilities q(z) may be obtained from the latent ones by marginalization: q(z) = Lπ(z), where L = 1 c I t.conventionallatentclassmodelsassumethaty 1,...,Y K areindependentconditionallyonu,z.asmentioned in the introduction, several extended models have been proposed which relax this assumption by allowing certain variables to be associated even conditionally on covariates and the latent. In the formulation adopted in this paper the task of defining a relevant family of latent class models with a parsimonious dependence structure is achieved in two steps. First, it is assumed that the latent multinomial distribution is determined by a vector of canonical parameters θ(z) = H log[π(z)] which corresponds to a hierarchical log-linear model and has a dimension v considerably smaller than ct 1. This is equivalent toassumingthatthereare (ct 1) v higherorderinteractionsconstrainedto0.secondly,theparametersofinterestare defined as regression parameters which determine how marginal logits and log-odds ratios depend on covariates; this is explained in more detail in 2.1 As concerns the first step, let G be the right inverse of H; the joint probabilities may be computed from the canonical parameters by the following reconstruction formula π(z) = exp[gθ(z)] 1 exp[gθ(z)]. ct Canonical parameters can, alternatively, be defined by writing explicitly the design matrix of the log-linear model G; a simple way of defining G is given by Bartolucci et al. (2007, p. 699). Then H may be determined as (G G G 1 ct 1 ct G/ct) 1 (G G 1 ct 1 ct /ct). A visual display of the matrix G (and of the matrices C,M to be introduced below) in a typical example is available from Marginal models The marginal parametrization adopted here is based on the work of Bergsma and Rudas(2002) who studied the properties of a class of models for categorical data which are, essentially, a combined set of log-linear models each defined within a different marginal distribution. These models may be useful, for example, when certain univariate and bivariate marginals are of direct interest. In addition, with ordered categorical variables, it may be more meaningful to use parameters based on global or continuation logits, which are not log-linear (see for example Agresti (1990), p ), as an alternative to ordinary logits. In the context of latent class models with continuous covariates and a few bivariate associations, marginal models offer, for example, the ability to formulate a regression model for the logits of any response variable conditionally on the latent but marginal with respect to the other response variables. This will usually be more meaningful than the logits of the same response conditionally on other responses in addition to the latent. In the context of causal inference they provide a direct way of modeling:(i) the marginal distribution of the latent given covariates,(ii) the distribution of the endogenous treatment given the latent and the covariates and (iii) the conditional distribution of each response given the latent, treatment and covariates It is well known (see for instance Lang (1996) p. 726) that any set of marginal parameters may be computed from the general formula η(z) = C log[mπ(z)],

3 A. Forcina / Computational Statistics and Data Analysis 52(2008) where C is a matrix of v row contrasts and M is a matrix of 0 s and 1 s which select cell probabilities to be cumulated or marginalized. The C and M matrices are determined by the following elements: The number of categories of each variable. The type of logits used for each variable; local, global or continuation. A list of the marginal parameters of interest, this may be coded into a matrix where each row corresponds to a type of parameterandeachcolumntoavariable.withineachrowavariablemaybecodedas 1 ifactive,as 0 ifmarginalized and as 2 if parameters of active variables are to be computed conditionally on each possible configuration of the conditioning variables. When only one variable is active, the row defines a set of univariate logits and when two variables are active the rowdefinesaset oflog-odds ratios. A set of Matlab functions for constructing these matrices is available from matfun.pdf Linear models The choice of a specific marginal parametrization is equivalent to the choice of a link function in a generalized linear model; as such it requires a suitable design matrix that specifies how each marginal parameter depends on covariates. Because the response distribution is multivariate, each individual (unit) contributes a whole vector of linear predictors η i (z i ) = X i β where X i is a regression matrix whose elements depend on the vector of covariates z i and β is a vector of parameters composed of intercepts and regression coefficients. UsuallyX i willbeblockdiagonalwithablockforeachcomponentofthemodel:(i)marginaldistributionofthelatent,(ii) conditional distribution of an endogenous treatment given the latent,(iii) univariate marginal distribution of responses given the latent and the treatment,(iv) specific bivariate marginal interactions between responses and treatment given the latent. The block diagonal structure follows from the fact that each of these components depend on distinct parameters. However, when experimenting with parsimonious models, it may be meaningful to impose linear constraints that involve parameters belonging to different blocks. For example, when modeling the univariate marginal distribution of certain response variables conditionally on the latent, one could constrain the slope coefficients to be constant across latent classes. Implementation of such constraints produce columns having non zero elements in different blocks. In the following let X be the matrix obtained by stacking thex i, i = 1,...,nmatrices one above the other. We assume that thismatrixis offullcolumn rankr. 3. Assessment of local identifiability For notational simplicity, in the following we will write γ i instead of γ(z i ) to denote the vector of canonical parameters for the i-th individual with covariate values z i. A similar convention will be used for vectors and matrices associated with the probability distribution conditionally on z i. We will also write γ to denote the vector obtained by stacking the vectors γ i,i = 1,...,noneabovetheother.Byasimilarconventionwewillwrite θ, π, η.thoughtheelementsofthesevectorsare functions of β, this dependence will be marked explicitly only when ambiguity may arise. Following Catchpole and Morgan (1997, p. 187) we recall that Definition 1. Amodelissaidtobelocallyidentifiableif,forany β 0,thesetof βforwhich γ(β) = γ(β 0 )satisfy β β 0 > δ for some δ > 0. Ifthisconditionwasviolatedataparametervalueβ 0,therewouldexistaneighborhoodofβ 0 whosepointscorrespondtothe same manifest distribution. As a consequence, the likelihood function would be flat around β 0 and the information matrix computed at β 0 would be singular (Catchpole and Morgan, 1997, Theorems 2 and 3). As it is explained below, following Catchpole and Morgan(1997), local identifiability is closely related to the rank of the matrix of derivatives of the canonical parameters of the manifest distribution with respect to the regression parameters β The Jacobian matrix Thematrixofderivatives of γ with respect to β may be computed by the chain rule as D = γ β = γ θ θ η η β = QRX; because the canonical parameters for the multinomial distribution of different individuals are distinct, Q and R are block diagonal matrices, so that ( ) Q1 R 1 X 1 D = Q n R n X n.

4 5266 A. Forcina / Computational Statistics and Data Analysis 52(2008) Thefirst two factors within each rowmay be computed again by the chain rule asfollows: Q i = γ i q i π i q i π i θ i = H diag(qi ) 1 LΩ i G [ ] ηi π 1 i R i = π i θ = [ C diag(mπ i ) ] 1 1 MΩ i G i where Ω i = diag(π i ) π i π i. The crucial assumption in the calculations above is the non singularity of the matrix R i ; this follows from the fact that, within the class of marginal models considered here, there is a diffeomorphism between η i and θ i (Bartolucci et al., 2007, Theorem 1). Thus, having assumed that X is of full column rank, the Q i matrices are the only component which may induce rank deficiency. On the other hand, D may still be of full rank even if the Q i matrices are not, because the presence of covariates may restore full rank and thus make identifiable a model which would be not within a single strata (see the examples in Section 4). The results of Catchpole and Morgan (1997, Theorem 4) imply that, the fact that D is of full rank for any admissible β, is a necessary and sufficient condition for the model to be locally identifiable. Thus, to show that a model is not locally identifiable, it is sufficient to find a single β for which D is not of full rank; on the other hand, the fact that D is of full rank onagridof βsmayonlyprovidesubstantialevidencethatthemodelislikelytobelocallyidentifiable.foranumericaltest, it is much easier to establish lack of identifiability than its opposite, because there may exist parameter points where local identifiability fails even if we have been unable to find one. The strategy for a numerical assessment of local identifiability proposed here is to randomly sample a sufficiently large set of parameter points and to examine the distribution of the inverse condition number; this is below when the matrix is rank deficient. Thus, if, say, on 20,000 points the inverse condition number never goes below 10 10, we may conclude, with reasonable confidence, that the model is locally identified with probability close to one Computational issues The web page describes a set of MatLab functions available on the same address which perform the following tasks: (1) Computation of the design matrix G: this requires a vector containing the number of categories for the latent and each response variable and a set of generators defining the maximal interactions to be included in the log-linear model; The matrixofgenerators is a binary matrixwith agenerator in each row. (2) Computation of the C and M matrices require the specification of a marginal parametrization as described above. (3) Computation of the X i matrices: this may be performed by a user defined function with z i as input argument. This function must specify which covariates affect each marginal parameter. The same function may also be used to impose suitable restrictions, like, for instance, that certain marginal parameters are equal or have equal intercepts. 4. Some examples In the following a few examples are described where the presence of covariates seems to restore identifiability of models which, otherwise, would not be identifiable. Within each example, local identifiability is tested by drawing a sample of 20,000parameterpoints βfroman(0 b,4i b ),wherebisthesizeof β.whencovariatesareinvolved,5observationsforeach covariate are sampled independently from a N(0, 4). A set of MatLab functions to replicate each example are provided on the same web site mentioned above. Simple latent class. Suppose that Y 1,Y 2 are binary response variables conditionally independent given a binary latent U; it is well know that this model is not identifiable. In fact the manifest distribution is determined by 3 canonical parameters, while the latent has 5 parameters (the marginal logit of U and two conditional logits for each response given U). Here D is a 3 5 matrix and thus cannot be of full column rank. Now suppose that there are two covariates, X 1 affecting the logits of the latent and X 2 affecting the conditional logits of both responses. Under the assumption that the regression coefficient for the conditional logit of Y j U does not depend on U, there are 8 marginal parameters and the model seems to be identifiable even if observations are available on very few individuals. The distribution of the condition number for D is givenintable1;becausematlabworkswithaprecisionofatleast10 12,thisindicatesthattheprobabilitythattheredoes notexist parameter valuesin the range ±10which can make the matrixdrank deficientisclose to absolute certainty. Causal inference. Consider a context where a binary treatment T may affect two binary response variables. If there is a binary latent U which is assumed to affect both the treatment and the responses, the model is not identifiable. This is obvious because γ for the joint distribution of T,Y 1,Y 2 has dimension 7 while the simplest latent class model has 9 parameters: 1 for the marginal of U, 2 for the conditional distribution of T U and 3 for the distribution of each response Y j T,U (under the assumption that T,U act additively on Y j ). Now assume that three covariates are available: X U which affects the latent, X T whichaffectstheassignmentoftreatmentandx Y whichaffectsbothresponses.intheregressionmodelweassumethat

5 A. Forcina / Computational Statistics and Data Analysis 52(2008) Table 1 Frequency distribution of the inverse condition number of the D matrix for the latent class models described in the text Type of model Condition number of D <10 8 Simple latent class 19, Causal inference 19, Conditional dependence 19, the effect of T on the responses may be different for the two latent classes but the effect of covariates does not depend on the latent. This model has 15 parameters of which 11 are intercepts (1 for U, 2 for T U and 4 for each Y i T,U) and 4 are regression parameters concerning the effect of X U on U, X T on T and X Y on Y 1 and Y 2. The results of our numerical test are given in Table 1 and indicate that the probability that the model is locally identifiable is close to absolute certainty. Conditional dependence. Consider again a problem of causal inference with an endogenous treatment and three responses, all binary. Now assume that Y 1,Y 2 and Y 1,Y 3 are not independent given U,T, though, for simplicity we assume that the two log-odds ratios do not depend on U,T. This model has 17 parameters while the canonical parameter of the manifest distributionoft,y 1,Y 2,Y 3 hassize15.nowassumeagainthattherearethreecovariates,affectingthelatent,thetreatment and the responses, and that the regression coefficients do not depend on the latent. This model has 22 parameters and, according to the simulation reported in Table 1, it should be locally identifiable. The results summarized in Table 1 suggest the following strategy in order to assess local identifiability of a given latent class model numerically. Start with only a few sample points (this would normally take few seconds) and, if an instance of rank deficiency is detected, the model is not identifiable. Otherwise, the model is probably locally identified. Almost absolute certainty may beachievedby a simulation ofthe kind reported intable 1which would normally take just a fewminutes. 5. Discussion As regards the flexibility of the latent class models considered here, it may be worth noting that log-linear parameters are a special case of marginal parameters where all variables are either active or conditioning. Thus the models considered by Hagenaars (1988) and their extension by Vermunt (1996) belong to the setting considered here. Though, for the sake of simplicity, only bivariate associations conditional on the latent have been considered explicitly, the definition of the C,M matricesand the results concerning non singularity of ther i matrices hold irrespective of the conditional association structure allowed among responses. However, in practice, it is very unlikely that interactions higher than the third order may be of interest. Note in addition that models with a complex structure of conditional dependence are very unlikely to be identifiable. The examples indicate that the presence of individual covariates may help in restoring identifiability. The price for this is intherestrictionsimposedbythelinearmodel.itmaybeusefultocompareamodelwherecovariatesareassumedtoaffect marginal logits linearly with a model where individuals are grouped into strata sharing similar covariate configurations and no restriction is imposed across strata; such a model would require a much larger number of parameters and would probably be not identifiable in most instances. The Approach advocated here has some advantages with respect to the usual diagnostic based on the information matrix: it could be performed before collecting the data; it is very fast and computationally efficient because it does not require the computation of the maximum likelihood estimate or even to write the likelihood function; the matrix D is the crucial component; this is required to compute the information matrix whose rank depends on the rankofd. Finallynotethatthestructureofthedatamaybesuchthatthemaximumlikelihoodestimateisclosetotheboundaryof the parameter space where identifiability may fail even if the model is identifiable for a wide range of the parameter values. Thus, the test proposed here cannot ensure that the information matrix is far from being singular when the maximum is close to the boundary of the parameter space. Acknowledgments The author would like to thanks F. Bartolucci and two referees for helpful comments. The author s work was supported by the Italian MIUR funds. References Agresti, A., Categorical Data Analysis. Wiley and Sons, NewYork. Bartolucci, F., Colombi, R., Forcina, A., An extended class of marginal link functions for modeling contingency tables by equality and inequality constraints. Statistica Sinica 17, Bartolucci, F., Forcina, A., Likelihood inference on the underlying structure of IRT models. Psychometrika 70,

6 5268 A. Forcina / Computational Statistics and Data Analysis 52(2008) Bartolucci, F., Forcina, A., A class of latent marginal models for capture recapture data with continuous covariates. Journal of the American Statistical Association 101, Bergsma, W., Rudas, T., Marginal models for categorical data. Annals of Statistics 30, Catchpole, E.A., Morgan, B.J.T., Detecting parameter redundancy. Biometrika 84, Dayton, C.M., Macready, G.B., Concomitant-variables latent class models. Journal of the American Statistical Association 83, Formann, A.K., Linear logistic latent class analysis for polytomous data. Journal of the American Statistical Association 87, Goodman, L., Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika 61, Hagenaars, J.A., Latent structure models with direct effects between indicators: local dependence models. Sociological Methods & Research 16, Huang, G., Bandeen-Roche, K., Building an identifiable latent class model with covariate effects on underlying and measured variables. Psychometrika 69, Ip, E.H., Testing for local dependency in dichotomous and plytomous items response models. Psychometrika 66, Lang, J.B., Maximum likelihood methods for a generalized class of log-linear models. Annals of Statistics 24, Melton, B., Liang, K.Y., Pulver, A.E., Extended latent class approach to the study of familial/sporadic forms of a disease: Its application to the study of the heterogeneity of schizophrenia. Genetic Epidemiology 11, Vermunt, J.K., Log-linear event history analysis: A general approach with missing data, unobserved heterogeneity, and latent variables, Ph.D. Thesis. Tilburg: Tilburg University Press, 350 pages. Vermunt, J.K., Magidson, J., Technical Guide for Latent GOLD 4.0: Basic and Advanced. Statistical Innovations Inc., Belmont, MA. Yang, I., Becker, M.P., Latent variable modeling of diagnostic accuracy. Biometrics 53,

Regression models for multivariate ordered responses via the Plackett distribution

Journal of Multivariate Analysis 99 (2008) 2472 2478 www.elsevier.com/locate/jmva Regression models for multivariate ordered responses via the Plackett distribution A. Forcina a,, V. Dardanoni b a Dipartimento