Log-mean linear regression models for binary responses with an application to multimorbidity

Size: px

Start display at page:

Download "Log-mean linear regression models for binary responses with an application to multimorbidity"

Eustacia Preston
5 years ago
Views:

1 Log-mean linear regression models for inary responses with an application to multimoridity arxiv: v3 [stat.me] 16 May 2016 Monia Lupparelli and Alerto Roverato March 30, 2016 Astract In regression models for categorical data a linear model is typically related to the response variales via a transformation of proailities called the link function. We introduce an approach ased on two link functions for inary data named log-mean (LM) and log-mean linear (LML), respectively. The choice of the link function plays a key role for the interpretation of the model, and our approach is especially appealing in terms of interpretation of the effects of covariates on the association of responses. Similarly to Poisson regression, the LM and LML regression coefficients of single outcomes are log-relative risks, and we show that the relative risk interpretation is maintained also in the regressions of the association of responses. Furthermore, certain collections of zero LML regression coefficients imply that the relative risks for joint responses factorize with respect to the corresponding relative risks for marginal responses. This work is motivated y the analysis of a dataset otained from a case-control study aimed to investigate the effect of HIV-infection on multimoridity, that is simultaneous presence of two or more noninfectious commoridities in one patient. Keywords: Categorical data; Link function; Multimoridity pattern; Relative risk; Response association. epartment of Statistical Sciences, University of Bologna, IT (monia.lupparelli@unio.it) epartment of Statistical Sciences, University of Bologna, IT (alerto.roverato@unio.it) 1

2 1 Introduction In many research fields it is required to model the dependence of a collection of response variales on one or more explanatory variales. McCullagh and Nelder (1989, Sec. 6.5) specified that in this context there are typically three lines of inquiry: (i) the dependence structure of each response marginally on covariates, (ii) a model for the joint distriution of all responses and (iii) the joint dependence of response variales on covariates. When in a regression model responses are categorical, a linear model is typically related to the response variales via a transformation of proailities called the link function, and the choice of the link function plays a key role for the interpretation of the model along the lines (i) to (iii). We refer to Tutz (2011) and Agresti (2013) for a full account of regression models for categorical data; see also Ekholm et al. (2000, Section 5) for a review of some link functions commonly used in the inary case. This work is motivated y a research aimed to investigate the effect of HIVinfection on multimoriditiy which is defined as the co-occurrence of two or more chronic medical conditions in one person. It is well known that multimoridity is associated with age and, furthermore, that HIV-infected patients experience an increased prevalence of noninfectious comoridities, compared with the general population. Guaraldi et al. (2011) considered a dataset otained from a cross-sectional retrospective case-control study, and investigated the effect of HIV-infection on the prevalence of a set of noninfectious chronic medical conditions y applying an univariate regression to each response. However, multimoridity is characterised y complex interactions of co-existing diseases and to gain relevant insight it is necessary to use a multivariate approach aimed to investigate the effect of HIV on the way different chronic conditions associate. The main scientific ojective of this study is thus the line of enquiry (iii). However, to the est of our knowledge, this line has never een explicitly addressed in the literature, and this paper is fully devoted to this issue. The application we consider naturally requires a marginal modelling approach ecause the main interest is for the effect of HIV on the marginal association of susets of comoridities; see Tutz (2011, Chapter 13) and Agresti (2013, Chapter 12). For this reason, we focus on the case where the link function satisfies upward compatiility, that is every association term among responses can e computed in the relative marginal distriution. In this way, the parameterization of the response variales will include terms that can e regarded as single outcomes, computed marginally on univariate responses, and terms that can e regarded as association outcomes, hereafter referred 2

3 to as response associations, which are computed marginally on susets of responses. Regression models typically include coefficients encoding the effect of the covariates, as well as of interactions of covariates, on response associations and difficulties involve oth the interpretation of the response associations and the interpretation of the relevant regression coefficients. More seriously, the effect of a covariate on a response association might e removale in the sense that it disappears when a different link function is used. It is therefore crucially relevant to e ale to define models with interpretale regressions coefficients; see Berrington de González and Cox (2007) for a review on statistical interactions, with emphasis on interpretation. In marginal modelling a central role is played y the multivariate logistic regression model ecause it maintains a marginal logistic regression interpretation for the single outcomes (McCullagh and Nelder, 1989; Glonek and McCullagh, 1995). Nevertheless, this family of regression models does not provide a satisfying answer to the multimoridity application. This is due to the fact that, although the regression coefficients for the single outcomes can e interpreted in terms of odds ratios, this feature does not translate to the higher order regressions where oth the response association and the relevant regression coefficients are high-level log-linear parameters which are difficult to interpret. We consider two different parameterizations, the log-mean (LM) (rton and Richardson, 2008; rton, 2009) and the log-mean linear (LML) parameterization (Roverato et al., 2013; Roverato, 2015), and investigate the use of these parameterizations as link functions. In this way, we introduce an approach where, similarly to Poisson regression, regression coefficients can e interpreted in terms of relative risks. Furthermore, and more interestingly, the relative risk interpretation can e extended from the regressions of the single outcomes to the regressions of the response associations, therey providing interpretale coefficients. The LM and the LML links can e used to specify the same classes of sumodels ut the LML link has the advantage that relevant sumodels can e specified y setting regression coefficients to zero. Specifically, we show that certain collections of zero LML regression coefficients imply that the relative risks for joint responses factorize with respect to the corresponding relative risks for marginal responses. The paper is organized as follows. In Section 2 we descrie the motivating prolem concerning the analysis of the multimoridity data. Section 3 gives the ackground concerning the theory of regression for multivariate inary responses, as required for this paper. Section 4 introduces the LM and the LML regression models and descries the relevant properties of these models. The analysis of multimoridity data is carried 3

4 out in Section 5 and, finally, Section 6 contains a discussion. 2 Motivating prolem: multimoridity in HIVpositive patients Antiretroviral therapy (ART) for human immunodeficiency virus (HIV) infection has een a great medical success story. Nowadays, in countries with good access to treatment, clinical AIS is no longer the inevitale outcome of HIV infection and this disease, previously associated with extremely high mortality rates, is now generally thought of as a chronic condition (Mocroft et al., 2003; May and Ingle, 2011). espite a marked increase in life expectancy, mortality rates among HIV-infected persons remain higher than those seen in the general population. Some of the excess mortality oserved among HIV-infected persons can e directly attriuted to illnesses that occur as a consequence of immunodeficiency, however, more than half of the deaths oserved in recent years among ART-experienced HIV infected patients are attriutale to noninfectious comoridities (NICMs) (Phillips et al., 2008; Guaraldi et al., 2011). Multimoridity is defined as the co-occurrence of two or more chronic medical conditions in one person that is, for HIV positive patients, as the simultaneous presence of two or more NICMs. Multimoridity, which is associated with age, is perhaps the most common disease pattern found among the elderly and, for this reason, it is turning into a major medical issue for oth individuals and health care providers (Marengoni et al., 2011). It is well known that HIV-infected patients experience an increased prevalence of NICMs, compared with the general population, and it has een hypothesized that such increased prevalence is the result of premature aging of HIV-infected patients (eeks and Phillips, 2009; Shiels et al., 2010; Guaraldi et al., 2011). Multimoridity is characterised y the co-occurrence of NICMs and therefore investigating the effect of HIV-infection on multimoridity requires to investigate the effect of HIV on the way different chronic conditions associate. The dataset we analyse here comes from a study of Guaraldi et al. (2011) who investigated the effect of HIV-infection on the prevalence of a set of noninfectious chronic medical conditions. ata were otained from a cross-sectional retrospective case-control study with sample size n = (2854 cases and 8562 controls). Cases were ART-experienced HIV-infected patients older than 18 years of age who were consecutively enrolled at the Metaolic Clinic of Modena University in Italy from 2002 to Control sujects were matched according to age, sex, race (all white), 4

5 and geographical area. The oserved variales include oth a set of inary (response) variales encoding the presence of NICMs of interest and a set of context and clinical covariates. See Guaraldi et al. (2011) for details and additional references. 3 Background and notation 3.1 Möius inversion In this susection we introduce the notation used for matrices and recall a well known result named Möius inversion that will e extensively used in the following. For two finite sets V and U, with V = p and U = q, we write θ = {θ (E)} V,E U to denote a 2 p 2 q real matrix with rows and columns indexed y the susets of V and U, respectively. Furthermore, we will write θ(e) to denote the column of θ indexed y E U and θ to denote the row of θ indexed y V. Note that the notation we use may e easier to read if one associates with iseases and E with Exposure. Example 3.1 (Matrix notation.) For the case V = {, c, d} and U = {h, a} the matrix θ has eight rows indexed y the susets, {}, {c}, {d} {, c}, {, d}, {c, d}, {, c, d} and four columns indexed y the susets, {h}, {a}, {h, a}. The matrix θ, with row and column indexes, is given elow; note that we use the suppressed notation θ c (ha) to denote θ {,c} ({h, a}), and similarly for the other quantities. θ = {h} {a} {h, a} θ ( ) θ (h) θ (a) θ (ha) θ ( ) θ (h) θ (a) θ (ha) θ c ( ) θ c (h) θ c (a) θ c (ha) θ d ( ) θ d (h) θ d (a) θ d (ha) θ c ( ) θ c (h) θ c (a) θ c (ha) θ ( ) θ (h) θ (a) θ (ha) θ cd ( ) θ cd (h) θ cd (a) θ cd (ha) θ cd ( ) θ cd (h) θ cd (a) θ cd (ha) {} {c} {d} {, c} {, d} {c, d} {, c, d} 5

6 Furthermore, the rows and columns of θ are denoted as follows. θ = θ θ θ c θ d θ c θ θ cd θ cd ( ) = θ( ) θ(h) θ(a) θ(ha) This matrix notation is not standard in the literature concerning categorical data ut, in the case of regression models for inary data, it allows us to provide a compact representation of model parameters in matrix form and to compute alternative parameterizations, as well as regression coefficients, y direct application of Möius inversion. Let ω e another real matrix indexed y the susets of V and U. For a suset V Möius inversion states that θ (E) = ω (E ), E U ω (E) = E E E E( 1) E\E θ (E ), E U; (1) see, among others, Lauritzen (1996, Appendix A). Let Z U and M U e two (2 q 2 q ) matrices with entries indexed y the susets of U U such that the entry of Z indexed y the pair E, H U is equal to 1(E H) and the corresponding entry of M is equal to ( 1) H\E 1(E H), where 1( ) denotes the indicator function. Then, the equivalence (1) can e written in matrix form as θ = ω Z U ω = θ M U and Möius inversion follows y noticing that M U = Z 1. Note that it is straightforward to extend this result to the matrices ω and θ as U θ = ωz U ω = θm U (2) and, furthermore, that it makes sense to consider Möius inversion also with respect to the columns of ω and θ so that θ = ZV ω ω = M V θ. 6

7 3.2 Multivariate inary response models Let Y V = (Y v ) v V e a inary random vector of response variales with entries indexed y V and X U = (X u ) u U a vector of inary covariates with entries indexed y U. Without loss of generality, we assume that Y V and X U take value in {0, 1} p and {0, 1} q, respectively. The values of covariates denote different oservational or experimental conditions and we assume that, for every x U {0, 1} q, the distriution of Y V (X U = x U ) is multivariate Bernoulli. Furthermore, we assume that the latter distriutions are independent across conditions and that, when X U is regarded as a random vector, then also (Y V, X U ) follows a multivariate Bernoulli distriution. We can write the proaility distriutions of Y V X U y means of a matrix π = {π (E)} V,E U where, for every E U, the column vector π(e) is the proaility distriution of Y V given (X E = 1 E, X U\E = 0 U\E ) and, more specifically, π (E) = pr(y = 1, Y V \ = 0 V \ X E = 1 E, X U\E = 0 U\E ). We assume that all the entries of π are strictly positive. In the following, for a suset V we use the suppressed notation Y = 1 to denote Y = 1 and similarly for Y = 0 and the suvectors of X U. Given three random vectors X, Y and Z, we write X Y Z to say that X is independent of Y given Z (awid, 1979) or, in the case where Y and Z are not random, that the conditional distriution of X given Y and Z does not depend on Y. In regression models for categorical responses a linear regression is typically related to the response variales via a link function θ(π). All the link functions considered in this paper are such that, for every E U, the vector θ(e) = θ(π(e)) parameterizes the distriution of Y V (X E = 1, X U\E = 0). In this way, the link function induces a matrix θ = {θ (E)} V,E U and the associated linear regression, in the saturated case, has form θ (E) = E E β θ (E ) for every V, E U (3) where β θ = {β θ (E)} V,E U is a matrix of regression coefficients. It follows from (1) and (2) that (3) can e written in matrix form as θ = β θ Z U so that β θ = θm U. (4) The regression setting (3) involves multivariate cominations of oth the responses and the covariates and for this reason it is important to explicitly distinguish etween the -response associations which are given y θ for V with > 1, and the 7

8 E-covariate interactions given y β θ (E) for E U with E > 1. Hence, β θ (E) encodes the effect of the E-covariate interaction on the -response association, given a fixed level of X U\E that, without loss of generality, in our approach is the zero level. Example 3.2 (Matrix notation continued.) In all the examples of this paper we use the variales of the multimoridity data as given in Section 5. Specifically, we consider three of the response variales, which are Y = Bone fracture, Y c = Cardiovascular disease and Y d = iaetes, with level 1 encoding the presence of the disease, and the most relevant covariates, that is, X h = HIV with the level 1 encoding the presence of the infection and X a = Age with the value 1 for patients aged 45 or more. Hence, if the link function is the matrix θ given in Example 3.1, then the matrix of regression coefficients β θ is β θ = β θ ( ) β θ (h) β θ (a) β θ (ha) β θ ( ) β θ (h) β θ (a) β θ β c θ ( ) β c θ (h) β c θ (a) β c θ β θ d ( ) β θ d β θ c ( ) β θ ( ) β θ cd ( ) β θ cd ( ) β θ c (h) β θ (h) β θ cd (h) β θ cd (h) (h) β θ d β θ c (a) β θ (a) β θ cd (a) β θ cd (a) (ha) (ha) (a) β θ d (ha) β θ c (ha) β θ (ha) β θ cd (ha) β θ cd (ha) = Equation (3) states that there exists a one-to-one Möius inversion relationship etween every row θ of θ and the corresponding row β θ of β θ. For instance, θ is the link function of the marginal distriution of one fracture, and each of its entries, θ (E) for E U, can e computed y considering β θ and then y taking the sum of its entries which are indexed y a suset of E. More specifically, for E = {h} it holds that θ (h) = β θ ( ) + β θ (h) whereas for E = {h, a} it holds that θ (ha) = β θ ( ) + β θ (h) + β θ (a) + β θ (ha). Hence, β θ (h) and β θ (ha) encode the effect of HIV and of the interaction of HIV and age, respectively, on one fracture. Similarly, θ is the response association of one fracture and diaetes. Hence, for E = {h} it holds that θ (h) = β θ ( ) + β θ (h) whereas for E = {h, a} it holds that θ (ha) = β θ ( ) + β θ (h) + β θ (a) + β θ (ha). Here, β θ (h) and β θ (ha) encode the effect of HIV and of the interaction of HIV and age, respectively, on the {, d}-response association θ. 8 β θ β θ β θ c β θ d β θ c β θ β θ cd β θ cd

9 There exists an extensive literature on models for categorical data analysis ut, remarkaly, Lang (1996), extending previous work y Lang and Agresti (1994), introduced a very general method to specify regression models for categorical data therey defining an extremely road class of models named generalized log-linear models. This includes, as special cases, many of the existing models for multiple categorical responses such as log-linear and, more generally, marginal log-linear models (Bergsma et al., 2009). In particular, in marginal regression modelling a relevant instance within this class is otained when θ( ) is the multivariate logistic link function denoted y η( ) (McCullagh and Nelder, 1989; Glonek and McCullagh, 1995; Molenerghs and Lesaffre, 1999; Bartolucci et al., 2007; Bergsma et al., 2009; Marchetti and Lupparelli, 2011). This induces the parameterization η = {η (E)} V,E U where η (E) is the -way log-linear interaction computed in the margin Y (X E = 1, X U\E = 0) and, more specifically, η {v} is the usual logistic link for every v V. Marchetti and Lupparelli (2011) considered the regression framework (3) with θ = η and showed that β η (E) is the E -way log-linear interaction computed in the distriutions of (Y, X U ). Example 3.3 (Multivariate logistic regression) Consider the case V = {, c} and U = {h}. The multivariate logistic parameters are η i ( ) = log pr(y i = 1 X h = 0) 1 pr(y i = 1 X h = 0) and η i (h) = log pr(y i = 1 X h = 1) 1 pr(y i = 1 X h = 1) for i {, c}, which are logit links. Furthermore, if we denote y OR(Y c ) and y OR(Y c h) the odds ratio etween one fracture and cardiovascular disease for HIVnegative and HIV-positive patients, respectively, then η c ( ) = log OR(Y c ) and η c (h) = log OR(Y c h) so that the coefficients for the regression of η c are β η c ( ) = log OR(Y c ) and β η c (h) = log OR(Y c h) OR(Y c ). We now move from the saturated model to sumodels defined y means of linear constraints on the regression coefficients and, more specifically, to sumodels with regression coefficients equal to zero. If for every suset of U, denoted as E, that has non-empty intersection with U U, it holds that β θ (E) = 0, then the covariates X U have no-effect on θ. The following lemma states a connection etween no-effect 9

10 of a suset of variales and linear constraints in θ. Lemma 3.1 Let θ = {θ (E)} V,E U e a real matrix with entries indexed y two nonempty sets V and U. If β θ = θm U and U U then the following are equivalent, for every V, as they oth say that X U has no effect on θ ; (i) β θ (E) = 0 for every E U such that E U ; (ii) θ (E 1 ) = θ (E 2 ) for every E 1, E 2 U such that E 1 \U = E 2 \U. Proof. See Appendix A. 4 Log-mean and log-mean linear regression models The multivariate Bernoulli distriution elongs to the natural exponential family, and the mean parameter associated with the distriution π(e) is µ(e) where µ (E) = pr(y = 1 X E = 1, X U\E = 0) for V. rton and Richardson (2008) used the mean parameter to parameterize graphical models of marginal independence and called it the Möius parameter ecause µ(e) = Z V π(e) for every E V. Susequently, rton (2009) used the matrix µ = Z V π to parameterize regression graph models; see also Roverato et al. (2013) and Roverato (2015). The mean parameterization has the disadvantage that sumodels of interest are defined y, non-linear, multiplicative constraints. Roverato et al. (2013) introduced the log-mean linear parameter γ defined as a log-linear expansion of the mean parameters, formally, γ = MV log(µ), (5) and showed that this approach improves on the mean parameterization as sumodels of interest can e specified y setting certain zero log-mean linear interactions. We remark that oth the mean and the log-mean linear parameters are not variation independent, in the sense that setting some parameters to particular values may restrict the valid range of other parameters. As a consequence, unlike variation independent parameterizations, it might e more difficult to interpret separately each parameter. Here, we show that the analysis of multimoridity data can e effectively approached y applying the theory descried in the previous section to the log-mean (LM) and the log-mean linear (LML) parameterizations to develop the LM and the LML regression model, respectively. 10

11 4.1 Log-mean regression The LM regression model is otained y setting θ in (3) equal to the logarithm of the mean parameter, log(µ), so that, in the saturated case, log(µ) = β µ Z U that, in its extended form, is log µ (E) = E E β µ (E ) for every V, E U (6) where for = the equation (6) is trivial for every E U ecause µ is a row vector of ones so that oth log(µ ) = 0 and β µ = log(µ )M U = 0. LM regression can e regarded as one of the possile alternative ways to parameterize the distriution of Y V X U, ut it is of special interest for the application considered in this paper. This can e seen y noticing that one can write µ (E) = pr(y = 1 X E = 1, X U\E = 0) where Y = v Y v is the inary random variale associated with the multimoridity pattern. More specifically, pr(y = 1) is the proaility that the multimoridity pattern occurs, and therefore (6) is a sequence of regressions corresponding to the univariate inary responses Y for V. Similarly to Poisson regression, the parameters of LM regression can e interpreted in terms of relative risks. We first consider the case U = {u}, so that U = 1. Hence, if we denote the relative risk of an event E with respect to the two groups identified y X u = 1 and X u = 0 y RR u (E) = pr(e X u = 1) pr(e X u = 0) (7) then for every v V it holds that β v µ (u) = log RR u (Y v = 1) whereas, more generally, for V β µ (u) = log RR u(y = 1) = log RR u ( v {Y v = 1}) (8) where we use the convention that RR u (Y = 1) = 1. Example 4.1 (LM regression) Consider the case where V = {, c, d} and U = {h}. Then the saturated LM regression model contains a regression equation for every 11

12 V, where the case = is trivial. For = 1, equation (6) has form log µ i ( ) = log pr(y i = 1 X h = 0) = β µ i ( ) log µ i (h) = log pr(y i = 1 X h = 1) = β µ i ( ) + β µ i (h) where, for every i V, β µ i (h) = log RR h (Y i = 1) that is the log-relative risk of the disease Y i for HIV-positive patients compared to HIV-negative patients. For = 2, log µ ij ( ) = log pr(y i = 1, Y j = 1 X h = 0) = β µ ij ( ) log µ ij (h) = log pr(y i = 1, Y j = 1 X h = 1) = β µ ij ( ) + β µ ij (h) where, for every i, j V with i j, β µ ij (h) = log RR h(y ij = 1) that is the logrelative risk of the co-occurrence of the diseases Y i and Y j for HIV-positive patients compared to HIV-negative patients. Finally, for = 3 so that = V, log µ cd ( ) = log pr(y = 1, Y c = 1, Y d = 1 X h = 0) = β µ cd ( ) log µ cd (h) = log pr(y = 1, Y c = 1, Y d = 1 X h = 1) = β µ cd ( ) + β µ cd (h) where β µ cd (h) = log RR h(y cd = 1) is the log-relative risk of the co-occurrence of the three diseases for HIV-positive patients compared to HIV-negative patients. Consider the regression equation relative to the suset V. It follows from (6) that if for an element u U it holds that β µ (E) = 0 for every E U such that u E, then Y X u X U\{u}. This can e easily extended to a suset of covariates X U with U U, whereas to generalize this result to the vector Y it is necessary to consider a collection of regression equations as shown elow. Proposition 4.1 Let Y e the suvector of Y V indexed y V. For a suset U U, it holds that Y X U X U\U if and only if β µ (E) = 0 for every and E U such that E U. Proof. See Appendix A. We can conclude that it makes sense to focus on sumodels characterized y zero LM regression coefficients ecause they encode interpretale relationships, possily implying that one or more covariates have no-effect on the distriution of Y, or even on the joint distriution of Y. However, this approach did not identify any missing effects in the application to multimoridity data. One reason for that is that the zero 12

13 pattern of regression coefficients descried in Proposition 4.1 implies no-effect of X U on Y and therefore that X U has no-effect on Y v for every single v. Consider the case U = 1 of a unique covariate X h representing the HIV-infection. It is well estalished that HIV has a relevant effect on each of the comoridites considered singularly, that is, β v µ (h) 0 for every v V. Although in principle it is possile for a covariate X h with a non-zero effect on Y v for some v to have coefficients equal to zero in the regression relative to Y, this hardly happens in practice ecause of the strong association existing etween Y and every Y v with v ; recall that Y v = 0 for any v implies Y = 0 and, conversely, Y = 1 implies Y v = 1 for every v. In other words, it is well known that, if X h = 1 for HIV-postive patients, then pr(y v = 1 X h = 1) > pr(y v = 1 X h = 0) for every comoridity v V, and therefore it is reasonale to expect that also pr(y = 1 X h = 1) > pr(y = 1 X h = 0) for every comoridity pattern V, with > 1. To disclose the usefulness of LM regression for the application considered in this paper, it is necessary to propose a different approach. In the multimoridity analysis, it is useful to distinguish etween two different effects of HIV, specifically, one can e interest in the effect of HIV (i) on the prevalence of a comoridity pattern and (ii) on the association among comoridities in. As discussed aove, the former is in general a well known matter as multimoridity shows a higher prevalence in infected patients. Indeed, the main question is whether HIV plays a role in the way single comoridities comine together to produce the comoridity pattern, that is in the association of variales in Y. We approach the prolem y considering the extreme case of no-effect of HIV on the -response associations ecause responses are conditionally independent given the covariates. Our standpoint is that if for a suset V there exists a proper partition A B = such that Y A Y B X U, then there is no-effect of the covariates on the -way association of Y. This allows us to compare the performance of different link functions in multimoridity analysis and, more importantly, to provide a clear interpretation to the effect of HIV in LM regression. In the Example 4.2 elow we illustrate how this idea can e formalized. Example 4.2 (LM regression vs. multivariate logistic regression) For the case V = {, d} and U = {h}, assume Y Y d X h so that we can say that there is no-effect of X h = HIV on the association of Y and Y d. To see this in practice it is sufficient to consider any measure of association that represents independence y a constant value, for instance the value zero, so that the association is the same for the two values of X h. Exploiting the factorization of the proaility function of Y X h implied y conditional independence, it is not difficult to see that under the LM regression 13

14 oth Y Y d X h β µ ( ) = β µ ( ) + β µ d ( ) and β µ (u) = β µ (h) + β µ d (h), and, under the multivariate logistic regression, Y Y d X h β η ( ) = 0 and β η (h) = 0. The effect of HIV on the joint distriution of Y is represented y β µ (h) in LM regression and y β η (h) in multivariate logistic regression and therefore oth and β µ (h) = β µ (h) + β µ (h) (9) d β η (h) = 0 (10) can e used to state that there is no-effect of HIV on the association of Y and Y d. However, (9) and (10) refer to different kinds of association ecause, although they are necessary conditions for Y Y d X h, they are not sufficient for the same condition to hold true, and it can e easily checked that neither (9) implies (10) nor (10) implies (9). From this perspective, it makes little sense to state that HIV has no-effect on the association of Y and Y d if a clear interpretation to equalities (9) and (10) is not provided. The interpretation of equality (9) is straightforward. It states a connection etween regressions of different order and implies that the effect of HIV on the distriution of Y is explained y the effects of HIV on the marginal distriutions of Y and Y d. More interestingly, it provides an useful insight on the ehaviour of relative risks ecause (9) is equivalent to RR h ({Y = 1} {Y d = 1}) = RR h (Y = 1) RR h (Y d = 1) (11) that is, if we are willing to interpret the effect of HIV y means of relative risks, then (9) allows one to carry out the analysis marginally on the regression equations for the main responses, ecause the computation of the relative risk of the joint event {Y = 1} {Y d = 1} does not require the joint distriution of Y X h ut only the marginal distriutions of Y X h and Y d X h as in the case where Y Y d X h. Multivariate logistic regression parameters are naturally associated with odds ra- 14

15 tios, which play a very fundamental role among association measures for categorical data. However, we deem that, in this context, the interpretation of (9) is more straightforward than (10). Indeed, the latter is equivalent to the identity OR(Y ) = OR(Y h) that, unlike (11), does not explains how the marginal effects of HIV on one fracture and diaetes comine together to give the effect of HIV on the {, d}-response association. This example shows that the LM regression coefficients provide a clear way to investigate the effect of a covariate on the association of two inary variales and the following theorem generalises this result to -response associations. Theorem 4.2 Let µ e the mean parameter of Y V X U and β µ = log(µ)m U. Then, for a pair of disjoint nonempty susets A and B of V it holds that Y A Y B X U if and only if for every A B and E U, with oth A and B, it holds that β µ (E) = B µ (E) where B µ (E) = ( 1) \ β µ (E). (12) Proof. See Appendix A. Theorem 4.2 shows that, whenever Y can e split into two conditionally independent suvectors Y A and Y B, then every coefficient of the LM regression with response log(µ ) can e written as a linear comination of the corresponding coefficients in lower-order regressions. Hence, the difference etween case and control patients with respect to Y is not given y the effect of HIV on the association of all the comoridities in, ut it is only the consequence of the effect that HIV has on the occurrence of the susets A and B of comoridity patterns. For the case U = {u}, the relationship β µ (u) = B µ (u) can e used to state that X u has no-effect on the -response association and, as well as in Example 4.2, this statement means that the effect of X u on the distriution of Y is explained y the corresponding coefficients in lower-order regressions. Furtheremore, it follows immediately from (8) that the equality β µ (u) = B µ (u) can e equivalently stated in terms of factorization of the relative risk RR u (Y = 1) with respect to the collection of relative risks RR u (Y = 1) for. Example 4.3 (LM regression cont.) For the LM regression in Example 4.1 consider the case where Y A Y B X h where A and B are nonempty, disjoint susets of V 15

16 such that A B = V. Then, it follows from Theorem 4.2 oth that and that More generally, for the case U > 1 it follows from Theorem 4.2 that the relationship β µ cd ( ) = β µ ( ) β c µ ( ) β µ d ( ) + β µ c ( ) + β µ ( ) + β µ cd ( ) β µ cd (h) = β µ (h) β c µ (h) β µ d (h) + β µ c (h) + β µ (h) + β µ cd (h) RR h (Y cd = 1) = RR h(y c =1) RR h (Y =1) RR h (Y cd =1) RR h (Y =1) RR h (Y c =1) RR h (Y d =1). β µ (E) = B µ (E) for every E U such that u E (13) implies that every regression coefficient involving X u is the linear comination of the corresponding coefficients of lower-order regressions, and can e used to state that X u has no-effect on the -response association. Also in this case, (13) can e interpreted in terms of relative risks. Consider the collection of relative risks of the event Y = 1, with respect to X u, conditionally on the values of the remaining covariates X U\{u}, given y RR u (Y = 1 E) = pr(y = 1 X u = 1, X E = 1, X U\(E {u}) = 0) pr(y = 1 X u = 0, X E = 1, X U\(E {u}) = 0) for E U\{u};(14) we recall that we set RR u (Y = 1 E) = 1. Then we associate to every element of (14) a reference value defined as follows. efinition 4.1 For V with > 1, the reference relative risk of the event Y = 1 with respect to X u and a suset E U\{u} is defined as RR u (Y = 1 E) = RR u (Y = 1 E) ( 1) \ +1. The following lemma states the connection etween relative risks and regression coefficients as well as etween reference relative risks in efinition 4.1 and Theorem 4.2. Lemma 4.3 Let µ e the mean parameter of Y V X U and β µ = log(µ)m U. Then, 16

17 for every V, u U and E U\{u} it holds that log RR u (Y = 1 E) = E E β µ (E {u}) and, for > 1, that log RR u (Y = 1 E) = E E B µ (E {u}). Proof. See Appendix A. The introduction of the reference relative risks is motivated y the following result, Corollary 4.4 Let the suset V e such that there exists a proper partition = A B, with A, B, satisfying Y A Y B X U. Then for every u U it holds that RR u (Y = 1 E) = RR u (Y = 1 E) for every E U\{u}. (15) Proof. See Appendix A. Hence RR u (Y = 1 E) is a reference value in the sense that it is the value taken y the corresponding relative risk when Y can e split into two conditionally independent suvectors. If (15) is satisfied, then all the conditional relative risks of the events {Y = 1}, with respect to X u, are equal to their reference value. This implies that, as far as the relative risk of the multimoridity pattern is of concern, the analysis can e carried out marginally on the distriutions of Y X U with. The following corollary shows that one can interpret (13) in terms of relative risks ecause it is equivalent to (15). Corollary 4.5 Let µ e the mean parameter of Y V X U and β µ = log(µ)m U. For any V, with > 1, and u U the relationship (13) is satisfied if and only if (15) is satisfied. Proof. See Appendix A. We can conclude that LM regression provides a useful framework to investigate the effect of covariates on the association of responses. Sumodels of interest involve possily zero regression coefficients, as in Proposition 4.1, ut, more commonly, regression 17

18 coefficients which are linear comination of lower-order coefficients as in Theorem 4.2. This approach can e easily implemented ecause (i) it involves sumodels defined y linear constraints on the regression parameters and (ii) the computation of B µ (E) in (12) does not require to specify explicitly the partition of Y into independent suvectors. A shortcoming of LM regression is that in order to interpret the values of regression coefficients in terms of conditional independence the coefficients must e contrasted with the theoretical values (12) given in Theorem 4.2. In the next section we show that LML regression provides a solution to this prolem. 4.2 Log-mean linear regression The LML regression model is otained y setting θ in (3) equal to the LML parameter γ = M V log(µ) in (5), so that, in the saturated case, it follows from (4) that β γ = γm U and γ = β γ Z U ; the latter can e written as γ (E) = E E β γ (E ) for every V, E U. (16) There is a close connection etween LM and LML regression given y a linear relationship etween β γ and β µ. Lemma 4.6 Let µ and γ e the mean and LML parameter, respectively, of Y V X U so that β µ = log(µ)m U and β γ = γm U. Then it holds that β γ = M V β µ, that is for every V and E U β γ (E) = ( 1) \ β µ (E). (17) Proof. See Appendix A. As a consequence of Lemma 4.6, sumodels defined y zero LM regression coefficients as in Proposition 4.1 can e equivalently stated y setting to zero the corresponding LML regression coefficients. Corollary 4.7 Let Y e the suvector of Y V indexed y = V. For a suset U U, it holds that Y X U X U\U if and only if β γ (E) = 0 for every and E U such that E U. Proof. See Appendix A. 18

19 β γ Furthermore, it follows immediately from (17) that for 1 it holds that = β µ whereas for > 1 LML regression coefficients allow one to immediately check whether a LM regression coefficients coincides with the associated theoretical value given in Theorem 4.2. Proposition 4.8 Let µ and γ e the mean and LML parameter, respectively, of Y V X U so that β µ = log(µ)m U and β γ = γm U. Then for every V, such that > 1, and E U it holds that β γ (E) = β µ (E) B µ (E) so that β γ (E) = 0 if and only if β µ (E) = B µ (E). Proof. This is an immediate consequence of (17). Hence, the relationship β γ (E) = 0 for every E U such that u E (18) is equivalent to oth (13) and (15) and we can conclude that LML regression is equivalent to LM regression for the purposes of our analysis, with the advantage that all the sumodels of interest can e specified y setting LML regression coefficients to zero. Furthermore, the value of the regression coefficients β γ can e used to contrast relative risks with the corresponding reference values as follows. Corollary 4.9 Let γ e the LML parameter of Y V X U so that β γ = γm U. Then for every V, such that > 1, and E U it holds that log RR u(y = 1 E) RR u (Y = 1 E) = β γ (E {u}) (19) E E Proof. See Appendix A. Note that, interestingly, Corollary 4.9 implies that for the case U = 1 the LML regression coefficients such that > 1 have an immediate interpretation as deviation of a relative risk from its reference value β γ (u) = log RR u(y = 1) RR u (Y = 1). (20) We close this section y noticing that also the LM and the LML regression models elong to the family of generalized log-linear models. This allows us to exploit 19

20 the asymptotic results of Lang (1996) for the computation of maximum likelihood estimates (MLEs) and for model comparison. Nevertheless, we remark that the regression framework introduced in this paper is entirely novel. Both the LM and the LML link functions allow us to specify the functional relationship etween π and the regression coefficients in closed form. This is a specific property that is not shared y the wider class of generalised log-linear models, and that makes it possile to specify closed-form functional relationships etween regression coefficients of different responses. In turn, this is the asis for the factorization of relative risks with respect to the corresponding relative risks of lower order regressions. 5 Analysis of multimoridity data We now apply the LM and the LML regression models to the analysis of the multimoridity data descried in Section 2. We consider four response variales: Y = Bone fracture, Y c = Cardiovascular disease, Y d = iaetes and Y r = Renal failure. Hence, V = {, c, d, r} and Y V takes values in {0, 1} 4 where, for each variale, the value 1 encodes the presence of the disease. The four responses define 11 multimoridity patterns denoted y the susets V with 2 and we say that k = is the size of the multimoridity pattern. For the computation of MLEs we applied the algorithm and the maximization procedure given in Lang (1996) properly adjusted to account for the inclusion of the LM and LML link functions; for technical details and a review of further maximization approaches see also Evans and Forcina (2013) and references therein. It is well estalished that asymptotic methods are not efficient when the tale of oserved counts is sparse, that is when many cells have small frequencies (see Agresti, 2013, Section 10.6). For a given sample size, sparsity increases with the numer of variales included in the analysis, therefore inference is less reliale for response associatons of higher-order since they are computed on the relevant marginal tales. In order to keep sparsity at an acceptale level, we restrict the analysis to the effect of two inary covariates indexed y U = {a, h}; specifially, X h = HIV with the level 1 encoding the presence of the infection and X a = Age with the value 1 for patients aged 45 or more. Firstly, in Susection 5.1, we consider the regression model including only the covariate X h, so that each regression coefficient represents the marginal effect of HIV on a -response associations. Next, in Susection 5.2 we consider a regression model with two covariates including also the effect of age. We remark that the case-control design for this study is not ased on an outcome- 20

21 dependent sampling. In fact the enrollment of each patient is independent of the disease status given the HIV status and further individual covariates. Therefore, for this case-study the population relative risk represents an identifiale measure of association, unlike case-control designs where the sample selection depends on the outcome of interest. 5.1 The single covariate case When only the covariate X h is included in the model, the regression coefficients have a straightforward interpretation in terms of relative risks for cases versus controls. Indeed, in the LM regression model, it holds that for = 1 the coefficient β µ (h) is the log-relative risk of the occurrence of single commoridities and, otherwise, it is the log-relative risk of the occurrence of the multimoridity pattern ; see (8). In the LML regression, the coefficient β γ (h) is equal to β µ (h) in case of single responses, otherwise it is the log-ratio of the relative and the reference relative risks for the occurrence of the multimoridity pattern, as shown in (20). Hence, from Proposition 4.8, the no-effect of HIV defined y β γ (h) = 0 with > 1 implies that the relative risk equals the reference relative risk for the multimoridity pattern. Conversely, a positive or a negative value of β γ (h) states that the relative risk for the pattern is higher or lower than its reference relative risk, respectively. As a preliminary analysis, we provide in Tale 1 the MLEs under the saturated LM and LML regression models. To clarify the meaning of the values given in this tale consider, for instance, the disease pattern {, c}, that is one fracture diaetes. The µ estimated LM coefficient ˆβ c (h) = provides a strong evidence of the existence of an effect of HIV on the co-occurrence of the two diseases whereas the corresponding γ LML parameter estimate ˆβ c (h) = is not significantly different from zero, and this can e interpreted as no-effect of HIV on the association of the two diseases. More specifically, the estimated relative risk of the co-occurrence of one fracture and µ diaetes takes value exp{ ˆβ c (h)} = exp(2.941) = 18.9 and is significantly different from 1; however, it is not significantly different from the product of the two marginal relative risks given y exp{β µ (h)} exp{β c µ (h)}. The latter is the value taken y the relative risk of the co-occurrence of the two diseases when Y Y c X h and therefore it implies that the relative risk of the co-occurrence of the two diseases only depends on the marginal relative risks of the two diseases rather than on the way the two diseases associate to give the multimoridity pattern. ifferent is the case of the disease pattern {c, d} ecause in this case the estimated LML coefficient 21

22 ˆβ γ cd (h) = is significantly different form zero and negative and therefore the relative risk of the co-occurrence of the two diseases is significantly smaller than expected in the case the two diseases are conditionally independent given HIV. We remark that, when we say that a regression coefficient is significantly different from zero we refer to the statistical test, at level 5%, ased on the asymptotic normal distriution of the MLEs. As expected (see the discussion following Proposition 4.1) in Tale 1 the estimated log-relative risks ˆβ µ (h) are positive for every single disease and for every multimoridity pattern. Furthermore, all the corresponding regression coefficients are significantly different from zero, except for the pattern {, c, d}. The lowest relative risk corresponds to the pattern {c, d}, whereas the highest relative risks are in correspondence of the patterns {, r}, {c, r} and {, d, r}. More informative is the analysis of the LML regression coefficients. Indeed, most of the coefficients β γ (h) are not significantly different from zero, therey providing an empirical evidence that the log-relative risks for the corresponding disease patterns are positive as a consequence of the effect of HIV on the single diseases they are composed, rather than for an effect of HIV on the associations of diseases. It is interesting that most of the γ coefficients ˆβ (h), for 2, take a negative value. This is especially true for the multimoridity patterns of size two, suggesting that the corresponding relative risks, although positive, take a value that is smaller than the value one would expect in case of conditional independence of the diseases. The results provided y the saturated model can e summarized considering the average HIV-effect for disease patterns of the same size k, under the LM and LML γ approaches. In particular, we compute the average of (h) and of ˆβ (h), for - response associations of the same size; see Appendix B for further details. Confidence intervals for these effects are plotted in Figure 1. The plot in the left panel of Figure 1 gives the estimated LM average effect, i.e. the average log-relative risk for the occurrence of multimoridity patterns of the same size, which clearly increases with the size of the disease pattern. On the other hand, the plot in the right panel of Figure 1 shows the estimated LML average effect, that is the average effect of HIV on the association among diseases forming patterns of the same size. This effect appears to e negative for patterns of size k = 2 and it might e null for k = 3 and k = 4. Next, we apply the forward inclusion stepwise procedure descried in Appendix B and select, in this way, the LML regression sumodel given in Tale 2. The latter provides a very good fit of the data with a deviance 3.45 on 12 degrees of freedom and p-value, computed on the asymptotic chi-square distriution of the deviance, ˆβ µ 22

23 Tale 1: The saturated LML and LM regression models for Y V X h. The tale gives the MLEs of the regression coefficients with their standard error (s.e.) and p-value. ˆβ γ ( ) γ µ µ s.e. p-value ˆβ (h) s.e. p-value ˆβ ( ) s.e. p-value ˆβ (h) s.e. p-value {} < < < <.001 {c} < < < <.001 {d} < < < <.001 {r} < < < <.001 {, c} < <.001 {, d} < <.001 {, r} < <.001 {c, d} < < < {c, r} < <.001 {d, r} < < <.001 {, c, d} < {, c, r} < < <.001 {, d, r} < <.001 {c, d, r} < <.001 {, c, d, r} < <.001 LM regression model LML regression model average effect average effect disease pattern size disease pattern size Figure 1: Confidence intervals of the average HIV-effect for response associations for disease patterns of the same size under the saturated LM and LML regression models. 23

24 Tale 2: The selected LML, and the corresponding LM, regression model for Y V X h. The tale gives the MLEs of the regression coefficients with their standard error (s.e.) and p-value. ˆβ γ ( ) γ µ s.e. p-value ˆβ (h) s.e. p-value ˆβ ( ) s.e. ˆβ µ (h) s.e. {} < < {c} < < {d} < < {r} < < {, c} {, d} {, r} {c, d} < < {c, r} {d, r} < {, c, d} {, c, r} < < {, d, r} < < {c, d, r} {, c, d, r} < equal to In the selected model there is no effect of HIV on the associations {, c}, {, d}, {, r}, {c, r}, {d, r}, {, c, d} and {c, d, r}. Furthermore, it follows from Theorem 4.2 and Proposition 4.8 that, under the selected model, three pairs of responses are conditionally independent given HIV, specifically, Y Y d X h, Y Y r X h and Y c Y r X h. We now look more closely at the values taken y the estimated regression coefficients given in Tale 2. We ase this analysis on the asymptotic normal distriution of the MLEs, ut it is important to remark that, as we are dealing with post-selection parameter estimates, there might e distortions on the sampling distriutions; see Berk et al. (2013) for a discussion and recent developments. The highest estimated relative risks are for multimoridity patterns {, r} and {, d, r} with 95% confidence intervals (5.793; 6.130) and (5.146; 6.130), respectively. More generally, high relative risks are given in correspondence of multimoridity patterns including Renal failure. γ The negative values taken y the estimates ˆβ (h) for the remaining patterns of size two and three suggest that the relative risks of the multimoridity patterns {c, d}, {, d, r} and {, c, r} are lower than their reference relative risks. For the disease pattern of size 4 the estimate of β γ (h) is positive suggesting that the relative risk of the pattern {, c, d, r} is higher than its reference relative risk. 24

ARTICLE IN PRESS. Journal of Multivariate Analysis ( ) Contents lists available at ScienceDirect. Journal of Multivariate Analysis

ARTICLE IN PRESS. Journal of Multivariate Analysis ( ) Contents lists available at ScienceDirect. Journal of Multivariate Analysis Journal of Multivariate Analysis ( ) Contents lists available at ScienceDirect Journal of Multivariate Analysis journal homepage: www.elsevier.com/locate/jmva Marginal parameterizations of discrete models