Hierarchical Linear Models and Structural Equation Modelling for the Children of Siblings model

Size: px

Start display at page:

Download "Hierarchical Linear Models and Structural Equation Modelling for the Children of Siblings model"

Miles May
5 years ago
Views:

1 U.U.D.M. Project Report 2008:20 Hierarchical Linear Models and Structural Equation Modelling for the Children of Siblings model Ralf Kuja-Halkola Examensarbete i matematisk statistik, 30 hp Handledare: Paul Lichtenstein, Medicinsk epidemiologi och biostatistik, Karolinska Institutet Examinator: Ingemar Kaj December 2008 Department of Mathematics Uppsala University

3 Hierarchical Linear Models and Structural Equation Modelling for the Children of Siblings model Ralf Kuja-Halkola Abstract The main aim of this paper is to examine and apply the Children of Siblings model to normally distributed outcome data. The model refers to siblings who are mothers/aunts and/or fathers/uncles and on whose children an outcome variable is measured. This is done through two statistical models. First a Hierarchical Linear Model (also known as multi-level model) which takes into account clustering of data in three levels. The levels correspond to increasing sizes of a family: individual, nuclear family and extended family. This model makes it possible to examine whether a specific exposure is responsible for a change in the outcome variable after confounders and clustering have been considered. Furthermore the genetical relatedness is taken into account making it possible to draw conclusions regarding the familial effects on the association between exposure and outcome. The second model is a 2-level Structional Equation Model which aims at splitting up variance/covariance in three different predetermined components representing genetic influence, environmental similarity and environmental dissimilarity using two levels of clustering in the data. The clustering levels are within- and between- nuclear families, which also are considered hierarchical. This model investigates the importance of genetic heritability contra environmental heritability. An example study is performed using data for smoking habits of mothers during pregnancy and the psychological functioning capacity, a prognosis of the ability to cope with war-time stress, of their children as measured by Försvarsverket during the medical examination for military service. We found that smoking during pregnancy does not affect the psychological functioning capacity. The exposures effect on the outcome has familial confounding, and this is mainly explained through genetical relatedness 1

4 Contents 1 Introduction and aims Background Quasi-experimental study The Children of Siblings model Methods - Theory Hierarchical Linear Models CoS - a Hierarchical Linear Model The submodels included in the CoS Interaction terms Estimation and degrees of freedom Structural Equation Models What is Structural Equation Models? Matrix notation approach Model fit level SEM, ACE model level SEM for ACE Estimation: Maximum Likelihood using Expectation Maximization 21 3 Data The data The variables Results The problem of belonging to multiple families HLM results SEM results Discussion Limitations of the model Future improvements References 35 2

5 1 Introduction and aims The aim of this paper is to make the Children of Siblings (CoS) model accessible for researchers, both in theory and in practice. By using this model it is possible to draw conclusions from data that can be influenced by familial confounding. The confounding can be due to genetics or environment, and the model also tries to distinguish the importance of these two sources. An example, it is not unlikely that a child s reading ability is influenced by the environment in which he/she is brought up (access to books, parental encouragement). Probably his/her sibling will have a similar environment. Thus they will be more similar than when comparing to a random child taken from the same population (e.g. swedish children born a certain year). Furthermore they are genetically similar, but the similarity is different for full- and half- siblings. They will also share genes with their cousins (here the genetical similarity also differs depending on whether they are full- or half-cousins), but the environment in which cousins are brought up may differ. By considering all these (dis-)similarities we are hopefully able to isolate a specific exposure s influence on an outcome, and if there is familial confounding isolate the source of this. The outcome variable will be limited to be normally distributed. A study will be performed whether a (male) child s exposure to mother s Smoking During Pregnancy (SDP) influence his Psychological Functional capacity score (PF), as measured by Försvarsverket during the medical examination for military service (henceforth called conscript). The data concerns boys born between 1973 and 1988 (99% born between 1979 and 1988). We will try to explain the model in a conceptually simple way, and when examining the theory behind the model rather than focusing on all the small problems we try to keep it at an overviewlevel. This since a non-mathematician researcher should be able to grasp, and apply, the result. In the theory section some more advanced theoretical problems are addressed. The Background section introduces the idea of the model and indicates its applicability. The Methods - Theory section explains the models of CoS and deals with the mathematical theory behind the models and the obstacles to overcome. In the Data- Analysis- and Results-section we perform the analysis of how SDP influence the PF-score. The data is taken from a large data base called the MgrCrime data base which include information about relatedness, maternal and paternal social and economic status and birth-data. Merging of data from different sub-data bases and forming of variables are made with SAS, analyses are made with SAS and Mplus 1. Finally the Discussion section consider possible conclusions to be drawn from the anlyses. We also discuss and suggest improvements, both for the current study and for the model to be applicable to other problems. 1.1 Background Researchers in the epidemiological field often try to address causation. Does a mother s SDP influence the child s intellect? Does a father s bipolarity cause his infant child to 1 Data- and analysis- program code can be obtained from the author, raku9405@student.uu.se. 3

6 have a higher mortality rate? Does a certain lifestyle increase the risk of cardiovascular diseases? Such questions are hard to answer since the causality is confounded by the interplay between outcome and exposure; an exposed child s mother also provides genes and other environment. For example, the exposure to SDP might be a negligible part of the whole picture making the child perform poorly in e.g. a test of PF. The problem of parents providing (possibly) hazardous genes as well as the child having a likelier risk to be exposed to a hazardous environment is called Gene-Environment correlation (rge). The child is at double risk to be exposed to hazardous environment, both due to the upbringing and to her/his own genes making her/him more prone to put herself/himself in such environment 1. The causality in these studies are difficult to assess, but with designs such as sibling comparison or the Children of Twins (CoT) model there is a possibility to give some answers. The CoS model is another part of this tool-box. The model CoS extends the CoT model, which has been used for some time by researchers in e.g. epidemiology and psychology (see for example (D Onofrio, 2005; Harden et al., 2007)). By using CoS rather than CoT a greater population is made available for analysis, and using CoS rather than ordinary sibling comparison can help differentiate the possible underlying mechanisms (D Onofrio et al., In press). The CoT model, as well as the CoS model, aims at distinguishing if a risk factor effects the outcome, and whether this is environmentally- and/or genetically impelled. This is not a simple task because of the rge. In the CoS an important part consist of considering the genetic relatedness between siblings (full or half) and/or cousins (full or half) and some conclusions regarding genetical importance can, hopefully, be drawn. The approach that is used is similar to that used by Brian M. D Onofrio in (D Onofrio et al., In press; D Onofrio et al., 2008). The idea to use this design is also mentioned in (Rutter, 2007). 1.2 Quasi-experimental study The basis of statistical theory in general relies on the data being a valid statistical sample, e.g. acquired through an experimental design randomizing subjects to different exposures. For many studies, as the study that we have performed, this is not true. By treating the data as experimental a number of assumptions are made, which more easily can be violated in database studies or natural experiments (Rutter, 2007) as often used in epidemiology. By labeling the study quasi-experimental this is emphasized. Examples when then assumptions possibly will not hold are Voluntary or involuntary self-censorship. When data is collected through self-reporting the supplier of the information may be less likely to report certain variables. E.g. mothers smoking during pregnancy might feel ashamed and choose not to report this, making the data biased. 1 That the parent passing their genes also can create a hazardous environment is called passive geneenvironment correlation, that the child creates her/his own hazardous environment due to genes inherited from parents is called active gene-environment correlation (Rutter, 2007). 4

7 Sampling bias reflecting the possibility that data collected in a voluntary manner is susceptible to e.g. social bias; people of low socioeconomic status may have less incentive to supply wanted information. Often a randomized case-control design is out of the question, for example it would be unethical to randomize mothers to smoke or not to smoke during pregnancy. And randomizing parent to have a certain mental disorder and thus effecting their child is impossible. The second best option in this case is to use available data, and analyze this data as if it came from a designed experiment. Even though the researcher has to be aware of the possible pitfalls, these kind of studies fills an important role. And even if the inferences can be questionable they can at least be said to be hypotheses generating. 1.3 The Children of Siblings model Our goal is to explore the influence of an exposure on an outcome after controlling for confounders and clustering effects. The confounders are the covariates that influence the outcome and might affect the causation. In the CoS model the clusters are families and subfamilies. The extended families are clusters which consists of grandparents, their children and spouses and their grandchildren. Clusters within these clusters are the nuclear families consisting of mothers, fathers and their children. These clusters can be said to be hierarchical. Another level of the hierarchy is the individual level, consisting of the children. By approaching the model in this way it is reasonable to consider a model with three levels of hierarchicity. The CoS model is designed to detect if the (possible) influence of the exposure is due to genetics or environment. By assuming that siblings are raised equally (i.e. have the same environment in upbringing), and that cousins are not, a possibility to differentiate the variation arises. When comparing cousins any similarity in the correlation between exposure and outcome will be due to both genetic- and environmental (dis-)similarities (full cousins share of their genes and half cousins share of their genes). And when comparing siblings (0.5 shared genes) with maternal half siblings (0.25 shared genes) any detection of difference between the two groups will be (thought of to be) mainly due to genetics. The confounders are just nuisance and the in analyses and will not be investigated in depth. 2 Methods - Theory The CoS model is hierarchical by construct. To cope with this a Hierarchial Linear Model (HLM) will be applied. This model is able to incorporate the dependency structure within each cluster/family and divide the variance into level-specific parts. This way of taking care of the dependence using clusters can be thought of to take into acount unmeasured confounders apart from the measured ones included in the model (D Onofrio et al., 2008). The hierarchicity will also be incorporated into a Structural Equation Model (SEM). The SEM aims to explain the variability in the outcome variable in terms of genetical and 5

8 environmental similarity/dissimilarity. The model has two levels; one regarding children of two female siblings, within each of the siblings separately. The other between the two siblings. This model uses an approach which divides the variation into three components: The A-component which measures the genetical influence on likeness, the C-component measuring the environmental influence that makes the siblings similar and the E-component measuring the environmental influence that makes sibling different. When we are referring to the ACE-model this is the model to have in mind. 2.1 Hierarchical Linear Models The Hierarchical Linear Model (HLM) (also known as multi level model) is a way to cope with dependent and nested structures in data using clustering in many levels. The basis is an ordinary linear model. By letting e.g. the intercept vary randomly in each level the model produced represent a hierarchical structure (Raudenbush & Bryk, 2002). The variation of the intercept depends on cluster-belonging of the outcome variable. This structure allows for hierarchical nesting, e.g. in CoS the nuclear family is nested within extended families (or households) and offspring is nested within nuclear families. To conceptualize this model we imagine how the data is created. Having in mind a three level model, we try to understand the formation of the unit at the individual level. First, at the extended family level, a cluster is drawn from the extended family clusters according to some rules. The nuclear family level cluster is then drawn, and the information about the extended family level cluster is incorporated into this cluster. Finally the unit at the individual level is drawn, and this is affected by which clusters at the nuclear family- and extended family- level the unit belongs to. When comparing this unit to other units, the extended family level cluster of the unit is compared with the extended family level cluster of the unit to which it is compared, the same holds for the nuclear family level cluster CoS - a Hierarchical Linear Model The hierarchicity of the model enables a partitioning of the variance into parts corresponding to each level of the hierarchy. Let Y int denote the outcome of the i:th offspring in the n:th nuclear-family within the t:th extended family. The model without any covariates is Y int = λ 0 + e int + r nt + u t. (1) This model relies upon the assumptions that the random errors are normally distributed, e int N(0, σ 2 ), the errors at the nuclear family level are normally distributed as well, r nt N(0, τ 2 1 ), as are the errors at the extended family level u t N(0, τ 2 2 ). We interpret the equation (1) as: The outcome Y int equals a grand mean λ 0 and includes three random effects: e int - a random effect at the individual level for individual i, r nt - a random effect at the nuclear level for nuclear family n where the individual i is included and u t - a random effect at the extended family level for extended family t including the nuclear family n and individual i. Another way to describe the random parts of (1) is that the variation of the 6

9 offspring from the nuclear-family mean is measured by e int, the variation of the nuclearfamily from the extended family mean is accounted for in r nt and the variation for the extended family from the overall mean is found in u t. At each level there is a possibility to include covariates which are thought to have an impact on the outcome and therefore confounding the causality. At the individual level (level 1) consider a total of p covariates, and the simple linear model Y int = π 0nt + p π jnt (α jint ) + e int, (2) j=1 where the meaning of the notation π jnt (α jint ) is explained below. To include the other levels random errors as in (1), let the intercept π 0nt vary randomly while including covariates specific to the nuclear family level (level 2). With q covariates at the nuclear family level we obtain q π 0nt = β 0t + β jt (χ jnt ) + r nt. (3) j=1 Similarly, the extended family level (level 3) terms are introduced by letting the intercept vary randomly once again and including covariates that are similar for the whole extended family. Using the linear model with s covariates at the extended family level, β 0t = λ 0 + s λ j (ξ jt ) + u t. (4) j=1 By combining (2) - (4) and using the fact that only the intercepts, not the slopes, vary between clusters (i.e. π jnt = π j, n, t, similar for β), the model is Y int = λ 0 + p π j (α jint ) + j=1 q β j (χ jnt ) + j=1 s λ j (ξ jt ) + e int + r nt + u t. (5) Assumptions made are that Cov(e int, r nt ) = 0, Cov(e int, u t ) = 0 and Cov(r nt, u t ) = 0, i.e. all random effects are uncorrelated with each other. The variables α jint, χ knt and ξ lt represent indicator variables for covariate j, k and l at level 1, 2 and 3, respectively, for fixed effect covariates. In case of a continuous effect (considered to be measured without error) they represent a centered version of the varibles, creating contrast codes. E.g. the model containing (only) one regression coefficient π 1 associated with α 1int = continuous variable X 1int would look like Y int = λ 0 + π 1 (X 1int X 1.nt ) + e int + r nt + u t. But if the continuous variable were on the level 2, e.g. χ 1nt = X 1nt, the centering would look like Y int = λ 0 +β 1 (X 1nt X 1.t )+e int +r nt +u t. The reason for the cluster-level centering is that possible bias from correlation between the covariate and random effects will be (approximately) removed (Neuhaus & McCulloch, 2006), furthermore this contrast-coding approach yields parameters equivalent to fixed effects (Greene, 2003). There are possibilities of letting the slopes vary randomly as well, but in this application they are not, see (Raudenbush & 7 j=1

10 Bryk, 2002) for a more complete treatment of the HLM models. As can be seen the model in (5) is a linear mixed model, with a certain design of the random- and fixed-effects. The HLM design is reflected in the covariance matrix. For example let the outcome vector Y = (Y 111, Y 211, Y 121, Y 221, Y 112, Y 122 ) T be all the data from the three level HLM used in CoS. In this small example the first extended family consists of two nuclear families with two children in each and the second extended family exists of two nuclear families with one child in each. The covariance matrix have the blocked structure of (6). Cov(Y ) = σ 2 + τ1 2 + τ2 2 τ1 2 + τ2 2 τ2 2 τ τ1 2 + τ2 2 σ 2 + τ1 2 + τ2 2 τ2 2 τ τ2 2 τ2 2 σ 2 + τ1 2 + τ2 2 τ1 2 + τ τ2 2 τ2 2 τ1 2 + τ2 2 σ 2 + τ1 2 + τ σ 2 + τ1 2 + τ2 2 τ τ2 2 σ 2 + τ1 2 + τ2 2 (6) In the covariance matrix (6) the different families (nuclear and extended) can be found through looking at the covariance parameters of r 11, r 21, r 12, r 22 (τ 2 1 ) and u 1, u 2 (τ 2 2 ). Let us turn the attention to the different level of the covariates. We will describe each level, and exemplify with variables used in the study in this paper. The individual level, level 1, includes the outcome variable and confounders specific to the individual, e.g. a mother s SDP, a mother s age at childbirth and a child s age when measuring the outcome variable. The level 2 is the nuclear family level, variables common to a nuclear family belongs here. Examples are socioeconomic status for mother/father, whether the siblings are full or half and the cohabitation of parents. The level 3 is the extended family level, the variables common for the whole extended family is counted to this level, an example is whether the aunts are full- or half-siblings. In figure 1 a schematical illustration of the three levels of hirachicity is found. The outcome is measured at level 1, and the ρ 1 represents that the aunts have some correlation because of shared genes and environment. The ρ 2 points out that correlation between siblings is present, this time also due to genes and environment, while the ρ 3 represents a correlation between cousins that is (thought of to be) mainly due to genetics. The ρ 4 indicates correlation between half siblings, which is unlike that of ρ 2 in terms of genetic similarity. There are more correlations which have not been indicated in the figure. All of these correlations of exposure, confounding variables and outcome has to be taken into account as rigorous as possible and the CoS model deals with this through clustering and submodelling. 8

11 F M M F ρ 1 F M F Level 3 Level 2 M ρ 2 M F M ρ 4 M ρ 3 Level 1 Figure 1: An illustration of the different levels, M=male F= female The submodels included in the CoS The ingenuity of the CoS model lies in taking good care of the information about genetical similarity/dissimilarity in the individuals included in the data. This will be addressed in a number of submodels. The first model, M 1, is an ordinary HLM with no consideration of genetics or covariates, M 1 performs a simple test of any correlation between the exposure and the outcome. The next considered model, M 2, where all confounders are included, is an ordinary way of testing for significance for the exposure. But the idea of the CoS model is that there is a possibility that significance of exposure in model M 2 can be confounded by non-independence between e.g. cousins due to partly shared genetics. In model M 3 siblings are compared with siblings and cousins with cousins regarding the exposure. This means that the similarity of siblings and cousins are addressed, but there is no differentiation between environmental and genetical effects. A mean across all cousins included in an extended family supplies an unrelated comparison. In models M 4 (cousins) and M 5 (siblings) consideration will be taken to whether the cousins (for M 4 )/siblings (for M 5 ) are full or half. These models uses subsamples which include two sisters and one of their children in M 4 or two children per mother in M 5, where possible. The subsamples are indicated with superscripts (c) for cousins in M 4 and (s) for siblings in M 5. We favour the inclusion of mothers/sisters with diverse exposure where there is a possibilty of important comparisons. For example if a mother in M 5 has more than two children and of these just one has been exposed we will pick the one with exposure and randomly sample one of the others. In M 4 and M 5 the interaction term between the exposure-variable and the full-/half- sibling/cousin will be included in the models. This term measures the difference between the differentially exposed sibling/cousin in the full/half pairs yielding a genetically/environmentally informative result. A description of the submodels in equation (7) 9

12 - (11) below uses the exposure variable SDP, from the study performed, as an example, the confounders are unspecified though. Model 1, M 1 - An ordinary HLM, examines the magnitude of association between SDP (=0 or 1) and PF using the entire sample. Y int = λ 0 + π 1 (SDP int ) + e int + r nt + u t. (7) Model 2, M 2 - As above but controlling for covariates/confounders. p q s Y int = λ 0 +π 1 (SDP int )+ π j (α jint )+ β j (χ jnt )+ λ j (ξ jt )+e int +r nt +u t. (8) j=2 j=1 Model 3, M 3 - This model compares cousins (with cousins) and siblings (with siblings) who are differentially exposed to SDP, in these calculations we have included contrast coding. p Y int =λ 0 + π 1 (SDP int SDP.nt ) + π j (α jint ) + β 1 (SDP.nt SDP..t ) + j=2 q β j (χ jnt ) + j=2 The contrast codes are calculated as follows: j=1 s λ j (ξ jt ) + e int + r nt + u t. For siblings: the average SDP for a mother over all her pregnancies is subtracted from the SDP for each son, thus comparing siblings differently exposed to SDP within nuclear family. For cousins: the average SDP for sisters over all their pregnancies is subtracted from the mean SDP for each nuclear family, comparing nuclear families differently exposed to SDP, between nuclear families; within extended families. Model 4, M 4 - Compares cousins whose parents are full siblings with those who have parents that are half siblings to further examine the aspect of genetic relatedness. For this a subsample is created, including two adult siblings, and one of their children, within each extended family (where possible). The same approach to contrast coding as above is used, recalculated for the new subset, and including a full-/half-cousin variable (values 0 and 1). The interaction term between the contrast for smoking and the full-/half-sibling variable now estimates the difference between cousin pair types. j=1 Y (c) int = λ 0 + β 1 (SDP (c).nt SDP (c)..t ) + + q β j (χ jnt ) + j=2 p π j (α jint ) j=1 s λ j (ξ jt ) + e int + r nt + u t. j=1 (9) (10) 10

13 Model 5, M 5 - Compares full- and half- siblings. As above, but with sibling pairs and their types. Y (s) int = λ 0 + π 1 (SDP (s) int SDP(s).nt) + + q β j (χ jnt ) + j=1 p π j (α jint ) j=2 s λ j (ξ jt ) + e int + r nt + u t. j=1 (11) Of course each model is estimated separately, meaning that for example the grand mean λ 0 will (probably) have different values in each model. We have chosen to include all confounders in our model, and by stepwise elimination remove the non-significant ones, each step eliminating the least significant confounder. Another way to pick confounders to be in the models is to test for correlation prior to fitting the big models and thus come up with the risk factors/confounders (as in (D Onofrio et al., In press)). The contrast coding is of great importance. Since the mean over all cousins in an extended family is included we can find an estimate for unrelated individuals in each model. The differences when comparing within extended- or nuclear family are found in the contrast. E.g. siblings in a family in M 3 might have a high mean SDP, and also a lower mean result in PF, the contrast allows us to examine whether the non-sdp siblings are different from the SDP sibling within this family. By doing this we can eliminate the main family effect, and make statements regarding the comparisons of individuals with similar starting position Interaction terms When deciding which, if any, interaction terms to be included in the model there are a couple of things to be aware of. First, interaction between levels seems unfeasible (Montgomery, 2005) because all levels of factors at the lower level is possible not present for every higher level. In our analysis the interaction between the contrast coded SDP-variable and the full or half sibling/cousin-variable must be there because it is of a main interest for the conclusions. From (D Onofrio et al., In press) (as mentioned this is one of the papers using the CoS model which we are using as a template) it can be found that none of the interaction terms between confounders has been taken into consideration. This approach is also favorable when dealing with large datasets to lower the time for computation. Another way to approach this is to test for all possible interactions and stepwise eliminate the non-significant ones. Or perhaps the middle-way; pick interaction terms to be included using knowledge about them. No matter which approach to be taken an informed decision has to be made. 11

14 2.1.4 Estimation and degrees of freedom The mixed model of the type in (5) is to be estimated using SAS:s proc mixed. Proceeding from this to estimation of the parameters π, β and λ and the variance parameters is done using the General Linear Mixed Model. We rewrite the equation (5) in matrix form Y = Xβ + Zu + ε. (12) The X are the values of the fixed effects in β, the values in Z are the dummy variables which indicates which random effects in u that are of interest and ε are the random errors corresponding to the e int :s. To illustrate the transition from (5) to (12) we use the outcome vector Y = (Y 111, Y 211, Y 121, Y 221, Y 112, Y 122 ) T as we did when illustrating the covariance matrix (6). For each level of the hierarchy we include two effects, one fixed with two levels and one continuous. The β-vector specifying the covariates is β (10 1) = (λ 0, π 1 (1), π 1 (2), π 2, β 1 (1), β 1 (2), β 2, λ 1 (1), λ 1 (2), λ 2 ) T. (13) With made-up values for the fixed effect covariates π 1, β 1 and λ 1 and centered versions for the continuous covariates π 2, β 2 and λ 2 we have (X 2111 X 2.11 ) 1 0 (X 211 X 2.1 ) 1 0 (X 21 X 2. ) (X 2211 X 2.11 ) 1 0 (X 211 X 2.1 ) 1 0 (X 21 X 2. ) X (6 10) = (X 2121 X 2.21 ) 0 1 (X 221 X 2.1 ) 1 0 (X 21 X 2. ) (X 2221 X 2.21 ) 0 1 (X 221 X 2.1 ) 1 0 (X 21 X 2. ). (14) (X 2112 X 2.12 ) 1 0 (X 212 X 2.2 ) 0 1 (X 22 X 2. ) (X 2122 X 2.22 ) 1 0 (X 222 X 2.2 ) 0 1 (X 22 X 2. ) Within each of the two extended families we have two nuclear families, this yields a random effects vector u (6 1) = (r 11, r 21, r 12, r 22, u 1, u 2 ) T. (15) The specifications are dependent on which family each individual belongs to Z (6 6) = (16) The random errors vector is simply ε (6 1) = (e 111, e 211, e 121, e 221, e 112, e 122 ) T. (17) Following the notation in (Littell et al., 2006), let u N(0, G), ε N(0, R) and also Cov(u, ε) = 0 is an assumption. Proceeding from this to write a joint normal distribution 12

15 for u and ε 1 f(u, ε) = (2π) (n+g)/2 G 0 1/2 0 R ( exp 1 [ u 2 y Xβ Zu ] T [ G R 1 ] [ u y Xβ Zu ] ) (18) with n = the sample size and g = the number of elements in u. The joint distribution f(u, ε) can be thought of as a joint likelihood. But since the randomness in Zu is present the maximation of this (quasi-) likelihood would yield an incorrect solution, the estimation method starts out here and then refines the estimation. By maximizing the distribution (likelihood) with respect to β and u solutions β and ũ can be obtained for β and u, (see (Littell et al., 2006) for details) via partial derivatives the mixed models equations can be found: [ X T R 1 X X T R 1 Z Z T R 1 Z Z T R 1 Z + G 1 ] [ β ũ ] = [ X T R 1 X Z T R 1 Z ]. (19) This can be solved as [ ] [ ] [ ] β (X = T V 1 X) 1 X T V 1 y (X T ũ GZ T V ( 1 y X(X T V 1 X) 1 X T V 1 y ) V 1 X) 1 X T V 1 y = GZ T V (y 1 X β ) (20) where V = V ar(y) = ZGZ T +R. We have ignored the possibility of needing to deal with generalized inverses (to which there are solutions which are implemented in proc mixed.) These equations can of course be solved, but it can be time consuming to find the inverse of V and the estimation ũ of u is dependent on the estimation of β. Through a method SWEEP the time is considerably lowered, for explanation of the method see for example (Smith & Graser, 1986). The default method of estimation using SAS:s proc mixed for solutions to the HLM:s is Restricted Maximum Likelihood (REML). The reason for using REML rather than ordinary Maximum Likelihood (ML) is that ML tends to bias the estimation of the standard errors negatively (Littell et al., 2006). This results in smaller confidence intervals and in the end hypotheses tests that are more likely to be overly liberal. The Minimum Variance Quadratic Unbiased Estimators (MIVQUE) procedure is non-iterative, in contrast to ML and REML, which finds the quadratic variance/covariance estimators that comes as close as analytically possible to satisfying the Lehmann-Scheffé criterion for uniformly minimum variance (Slanger, 1996). This procedure provide unbiased estimations of the variance parameters. The reason for choosing MIVQUE over REML is that sometimes convergence can be hard to obtain using REML or ML and also to save computing time when handling large data sets 1. Sometimes there can be some trouble with convergence due to problems involving one or more confounders, a way to approach this problem is to use MIVQUE 1 In the current study the data sets for M 1, M 2 and M 3 can be considered large, and convergence sometimes fails using REML. 13

16 during the stepwise elimination of non-significant variables to identify problem-variables, but the final analysis should be made using REML. The REML procedure uses the values of MIVQUE as a starting point. REML actually is just a estimation technique for the covariance parameters, the fixed effects β are estimated at the REML estimates of the covariance parameters. Let us start with the ordinary ML method. The ML minimizes the -2 log likelihood, let θ be some covariance parameters, y be the observed sample, from (20) we can find a solution β(θ) for the fixed effects β(θ) = (X T V (θ) 1 X) 1 X T V (θ) 1 y. (21) Thus the -2 log likelihood will be 2l(θ; y) = (n + g) ln(2π) + ln V (θ) + (y X β(θ)) T V (θ) 1 (y X β(θ)). (22) By minimizing this function with regard to θ the ML solution can be found. The ML estimators for a sample y = (y 1,..., y n ) T independent identically distributed from N(µ, σ 2 ) with unknown µ and σ 2 are ˆµ = 1 n i y i and ˆσ 2 = 1 n i (y i y) 2. The ˆσ 2 is biased by n 1 σ 2 but if the mean was known the unbiased estimator σ 2 = 1 n i (y i µ) 2 would constitute the ML estimand for σ 2. Thus the unknown mean seems to be a source of bias for the covariance parameter, the REML incorporates this fact. The REML estimation procedure can be seen as a maximation of the likelihood of transposed data. Instead of, as in ordinary ML, minimizing the -2 log likelihood of Y REML minimized the -2 log likelihood of KY where K is chosen such that E(KY ) = 0, i.e. the maximization occurs on error contrasts. A known result shown by Harville 1974 (see (Littell et al., 2006) for details) is that the restricted likelihood for REML is Thus the -2 log likelihood is 1 L R (θ) = X T X 1/2 1 1 (2π) (n p)/2 V (θ) 1/2 X T V (θ) 1 X 1/2 ( exp 1 ) 2 (y X β(θ)) T V (θ) 1 (y X β(θ)). 2l R (θ; y) = ln V (θ) +ln X T V (θ) 1 X +(y X β(θ)) T V (θ) 1 (y X β(θ))+c R (24) where the extra term ln X T V (θ) 1 X is the only difference to the ML -2 log likelihood in terms of what is minimized. The c R term is a constant term which is not affected of the choice of matrix K, and is independent of θ. The minimization of (24) with regards to θ is performed iteratively since the covariance parameters V (θ) is included non-linearly. The solution of the fixed effects are found through the estimated covariance components θ as β( θ) = (X T V ( θ) 1 X) 1 X T V ( θ) 1 y. Proc mixed uses a ridge-stabilized Newton-Raphson algorithm for for minimization of the -2 log likelihood (Littell et al., 2006). Now we turn to the problem of picking the correct degrees of freedom for the F -tests and confidence interval construction for the fixed factors in the model. Since the regular 14 (23)

17 ANOVA assumptions (e.g. the independence of outcomes and homoscedasticity) do not hold when using the mixed model a correction of the degrees of freedom has to be made. If no corrections are made the resulting confidence intervals will be too narrow, and testing of significance will be overly liberal. Thus the importance of the correct degrees of freedom can not be stressed enough. There are a couple of ways to do this, one is the Fai-Cornelius (FC) method as implemented in the satterthwaithe option in specifying the model using proc mixed. Another option is available: Kenward-Rogers which often is more suiting when the covariance structure is more complex, but in this study the covariance structure is relatively simple and thus the FC option is better (Littell et al., 2006) (Schaalje et al., 2001). The FC uses a Sattertwaite approximation of the degrees of freedom found through spectral decomposition of the covariance matrix of the estimated β (for details see (Schaalje et al., 2001)). A third option is the between-within option, this is most suitable for highly unbalanced data. The best choice for the actual study is the sattertwaithe option, but sometimes there can be data that are too unbalanced and the between-within approach can be suitable. 2.2 Structural Equation Models We will use the SEM on data for siblings and their children to examine whether an exposure s correlation to an outcome is due to genetics, environment that make siblings similar or environment that makes siblings dissimilar. The model used to do this is called the ACE model, where the A represents genetical influence, C represent environment that makes siblings similar and E represent environment that makes siblings dissimilar What is Structural Equation Models? The Structural Equation Models are models which, through mean- and covariance matrixanalyses, depict relationships between observed variables (Schumacker & Lomax, 2004). Behind this fuzzy formulation lies a complex theory mainly developed by applied researchers in psychology, psychometrics and econometrics (Hays et al., 2005). The method is closely related with factor analysis. The difference is that while factor analysis focus on the covariance matrices and through this latent (unmeasured) factors, SEM includes more possible dependencies within variables, within latent factors and between variables and latent factors. SEM is able to cope with multiple latent variables in combination with measured indicator and outcome variables using dependence between and within the latent and indicator and outcome variables. Since the dependences can be complicated researchers often choose to use a type of diagram to indicate the models. An example is shown in figure 2 where the F 1 and F 2 represent latent (unobserved) independent variables, x 1 and x 2 represent observed dependent variables and y is the outcome variable. The V 1 and V 2 imply that the latent unobservable variables are estimated with an error. Further explanation of arrows are found below. 15

18 V 1 V 2 F 1 F 2 x 1 x 2 y Figure 2: An illustration of SEM. The illustrations often use circles to indicate latent unmeasured factors and squares for measured indicator and outcome variables. By using this presentation it is easy to create a picture of whats going on, which is quite suitable if you are more interested in drawing conclusions rather than understanding the theory behind. The figure 2, for example, says that the indicator variables x 1 and x 2 and the outcome variable y all three are manifests of two unmeasured factors. The indicator variables also has a correlation. In general the model consists of two parts, the measurement model and the structural model. In the measurement model the measured variables are included, in the structural model the structures of the latent variables are described Matrix notation approach Apart from the distinction between observed- and latent variables there is also a distinction between dependent- and independent- variables. The independent variables are variables that are not influenced by any other variable and the dependent variables are variables which are influenced by other variables. The SEM can be written in matrix notation, the notation follows that of (Schumacker & Lomax, 2004). First the structural model. Let η (m 1) be the latent dependent variables, a (m 1) vector, and ξ (n 1) be the latent independent variables, a (n 1) vector. Let the relationships between the latent variables be described by B (m m) for latent dependent variables amongst themselves and let Γ (m n) describe the relation of latent independent variables to latent dependent variables. Let the equation prediction errors be ζ (m 1). Then the model is η = Bη + Γξ + ζ. (25) The covariance matrix Cov(ξ) = Φ (n n) describes the variances and covariances among the independent latent variables, and Cov(ζ) = Ψ (m m) is the covariance matrix for the latent dependent prediction equation errors. Let us turn the attention to the measurement model. For the latent dependent variables η, let Y (p 1) be the observed measures (dependent or independent) and let Λ y (p m) be the relationship between observed variables and latent 16

19 dependent variables. Let the measurement errors for Y be ε (p 1), the measurement model will be Y = Λ y η + ε. (26) In the same manner the independent latent variables are found through the measured X (q 1), and the relationship between observed variables (dependent or independent) and latent dependent variables is defined to be found in Λ x (q n). Let the measurement errors be δ (q 1), the measurement model will be X = Λ x ξ + δ. (27) The covariance matrices of ε and δ can be labeled Cov(ε) = Θ ε (p p) and Cov(δ) = Θ δ (q q) and contain the covariances between errors of the observed dependent- and independent variables respectively. These rather complex notations to describe the model include eight matrices, of which four (Φ, Ψ, Θ ε, and Θ δ ) are random covariance matrices. All matrices can be estimated and used for inferences. This is done by letting parameters of the matrices be free, fixed or constrained. Which parameters that are of interest is of course dependent on how the model is formulated, all matrices might not be of interest for a model (Schumacker & Lomax, 2004) Model fit We will use two model fit measures, Root Mean Squared Error of Approximation (RMSEA) and Comparative Fit Index (CFI). The RMSEA is defined as (Muthén, ) RMSEA = max((2f ML (ˆπ)/d 1/n), 0) G (28) where F ML is the ML fitting function for G groups, d is the number of degrees of freedom for the model, n is the number of observations and ˆπ is the ML estimate under H 0. The fitting function is F ML (π) = ln L H 0 n + ln L H 1 (29) n where L H0 is the likelihood of the fitted model and L H1 is the likelihood of a unrestricted model. The unrestriction denotes that the means and covariances are not constrained. The RMSEA is a global measure, and it improves when more variables are added to the model. A value of 0.05 or less is deemed to be acceptable (Schumacker & Lomax, 2004). The CFI is defined as CF I = 1 max(χ 2 H 0 d H0, 0) max(χ 2 H 0 d H0, χ 2 B d B, 0) where H 0 and B represents the fitted- and baseline- model respectively. The baseline model has uncorrelated outcomes with unrestricted means. As can be seen the CFI utilizes the χ 2 - distribution, this relies on the normal assumption of the outcome variable. The CFI spans 17 (30)

20 from 0 to 1, where 0 indicates no fit and 1 perfect fit. A value over 0.90 indicates a good fit (Schumacker & Lomax, 2004). The χ 2 H for the models are based on that the estimated parameters in the covariance matrix have a normal error. The difference between the observed and the hypothesed parameters squared divided by the hypothesed parameter follows a χ 2 -distribution with degrees of freedom = number of estimated parameters - 1. The actual χ 2 H-statistic is more advanced though taking into account stratification, non-independence of observations due to cluster sampling, and/or unequal probability of selection. Subpopulation analysis is also available. (Muthén & Muthén, ). When modifying the model according to findings some parameters might be excluded. The estimation procedure includes the forming of confidence intervals constructed with standard errors of the parameters. Following (Schumacker & Lomax, 2004) the rules of exclusion of parameters is based on three criteria: The parameter should be in the expected direction, be statistically different from zero and make practical sense. But Schumacker & Lomax also stresses that If a parameter is not significant but is of substantive interest, then the parameter should probably remain in the model. The actual ACE-model is based on that the parameters associated with the respective parts A, C and E must be considered as a group. Thus exclusion of just one parameter is probably unwise, if one is excluded the rest must be excluded as well. This also applies in the opposite direction, if one of the parameters is slightly non significant but the rest is not then that parameter can, and should, stay in the model level SEM, ACE model The ACE model is described in figure 3 for two sisters, their SDP and their childrens PF. The left part of the figure represents one sibling, and the right part the other. The parameters connected to the A-part (V A, b A and A cov ) models the variation in outcome due to genetics. The parameters in connection with the C-factor represent the variation due to shared environment, the environmental influence that makes siblings similar. The parameters associated with the E-factor represents the variation due to unshared environment, making siblings dissimilar. The arrows with a 1 besides them says that a parameter is (perfectly) measured by the parameter to which the arrow points. The arrows with parameters b F (F = A, B, C) besides them are regression coefficients, e.g. PF 1 is regressed on A 1 and the resulting regression coefficient is b A. The arrows pointing in two directions indicates correlation/covariation, e.g. A 1 covaries with A 2 with value A cov. 18

21 A cov C cov V A V C V E V A V C V E A 1 C 1 E 1 A 2 C 2 E SDP 1 b A b C b E SDP 2 b A b C b E PF 1 PF 2 r Figure 3: An illustration of SEM. By using constraints on the parameters we can find the value of V A, V C and V E ; the variations of the A-, C- and E-parameters. We may also be interested in b A, b C and b E ; the regression parameters of PF on the A-, C- and E-parameters. The r is the covariation/correlation between the outcome for both cousins, this might be of interest as well. The ACE-parameters are independent latent variables (the covariances between the A:s and C:s are not considered to induce dependent variables) and the SDP:s and PF:s are dependent variables. The covariance matrices Φ and Θ δ are the ones of main interest in the model, since the latent independent variables are measured by all observed variables. The constraints to put upon the model are those that arises from the assumptions for the model: 1. The genetical likeness of full siblings are twice of that for half siblings. 2. The environment that makes siblings alike are equal for siblings. The constraints also include setting covariances to zero for variables that are not considered to covary in the model. This includes only independent latent variables, so all of the information will be handled by (27) and the vectors and matrices X, Λ x, ξ, δ, Φ and Θ δ. The structural equation of interest will be X = Λ x ξ + δ SDP 1 P F 1 SDP 2 P F 2 = b A b C b E b A b C b E A 1 C 1 E 1 A 2 C 2 E 2 + δ 1 δ 2 δ 3 δ 4 (31) 19

22 And the covariance matrix for the latent independent variables is V A 0 0 A cov V C 0 0 C cov 0 Φ = 0 0 V E A cov 0 0 V A C cov 0 0 V C V E (32) The last matrix is the observed dependent variables measurement error covariance matrix Θ δ = 0 ϕ P F1 0 r (33) 0 r 0 ϕ P F2 The model indicates that SDP is measured without error, but not PF. Further restrictions will also be laid upon A cov and C cov to fit our modelling assumptions. We assume equal environment for the siblings, this means that C cov = V c. For the genetic part the full- and half- sibling induce different restrictions, full siblings share 0.5 of their genes so A covfull = 1 2 V A. Half siblings share half of the amount shared by full siblings thus A covhalf = 1 2 A covfull = 1 4 V A. The last two of the restrictions are the most informative parts, by enforcing this restriction on the model the genetic influence can be measured since this differs between the group of full- and half- siblings level SEM for ACE The idea of multiple levels can be applied to SEM. The model is then splitted up in a within- and between-part, within and between refers to within- and between- nuclear families. This does not mean that each part is estimated separatly but as in the HLM they are estimated simultaneously. We apply this approach to the ACE-model to incorporate the nuclear families into the model. The model looks like figure 4 where the upper portion of the figure represent the within part, and the lower portion the between. 20

23 SDP 1 SDP 2 w w WITHIN PF 1 PF 2 BETWEEN A cov C cov V A V C V E V A V C V E A 1 C 1 E 1 A 2 C 2 E SDP 1 b A b C b E SDP 2 b A b C b E PF 1 PF 2 r Figure 4: An illustration of the 2-level SEM describing the ACE model. We want to explain how the variation can be splitted up between the ACE-parameters and from this draw conclusions regarding environmental and genetical relevance for the influence of SDP on PF. To do this we want to incorporate the confounders in the model, one simple way to do this is to perform a regression analysis for PF on the confounders and then do the SEM-analysis on the residuals. Since the SEM aims to use the familial information we perform a simple regression analysis without using clustering. The SDPvariable is not included in the regression since this is the indicator variable in the ACE model Estimation: Maximum Likelihood using Expectation Maximization The estimation procedure uses a ML-approach with Expectation Maximization. A full description of the procedure is out of the scope of this paper, but details are given in (Muthén, ). The model incorporates that two classes are possible, full- and halfsiblings (more classes can be included if wanted). The method maximizes the likelihood, which is a mixture distribution of the separate classes. Through conditioning on data and class belonging the likelihood is approximated and maximized. 21

24 3 Data 3.1 The data The main idea of this study is to examine whether SDP influences the childs Psychological Functioning capacity PF. This is a variable that is obtained from a standardized psychological interview for Swedish boys during conscript. PF is a prognosis of the ability to cope with stress in war-time (Nilsson et al., 2001). Any conclusions made by the study concerns how the exposure of prenatal nicotine affects the susceptibility to stress in late adolescence/early adulthood. The boys are born between 1973 and 1988 and were at conscript between 1997 and The ages varies between 18 and 30, but the majority are 18 or 19 ( of , 98.1%). To be able to make correct inferences regarding the association of interest other effect must be taken into account. To cope with the rge an analysis must be made with covariates which may affect the outcome. How to pick these is always a subjective task, and prior knowledge must be used to make correct decisions regarding possible confounders to include in the model. The data used for the CoS model is taken from a large dataset called the MgrCrime data base. This data base consist of many merged Swedish data bases, a couple of which is used in this study. The data bases used includes information collected from: BRÅ (Swedish National Council for Crime Prevention), EpC (Centre for Epidemiology at the National Board of Health and Welfare), Pliktverket (National Service Administration) and SCB (Statistics Sweden). Variables of importance to this study are extracted from the data base and re-merged as suitable for the models M 1 M 5. The data used is stored in different data sets as described in figure 5. Central is the Multi-Generation Register which enables connection between children and parents/grandparents through the usage of unique identification numbers for each individual (variables originating from this registry: lopnr, idfam, idextfam, idspouse, idgrandpa, lopnrmor, lopnrfar, lopnrmormor, lopnrmorfar, lopnrfarmor, lopnrfarfar). The National Crime Register is used to find out whether the parents are convicted of a crime (crimem, crimef). The Medical Birth Register provides information concerning the birth of the children such as weight (birthweight), gestational length (pregtime), mothers age at childbirth (agem) and mothers SDP (rok1, sdp). From the National Service Administration the outcome variable PF is collected, the age at conscript (conscriptage) is also a variable of interest. To find out about socioeconomic status (seim, seif), income (incomem, incomef) and cohabitation (cohab) of parents the Population and Housing Census of 1990 is used. Educational level (edum, eduf) for parents are obtained from the Register of Education. 22

25 National Crime Register crime. Medical Birth Register rok1, birthweight, agem, pregtime. National Service Administration pf, conscriptage. Multi-Generation Register lopnr, lopnrmor, lopnrfar, lopnrmormor, lopnrmorfar, lopnrfarmor, lopnrfarfar. Sample M 1, M 2, M 3 N = Population and Housing Census 1990 sei, income, cohab. Register of Education edu. Subsample cousin comparison M 4 N = Subsample sibling comparison M 5 N = Subsample ACE data N = Figure 5: The data flow The variables agem Mother s age at childbirth, categorical variable taking on 5 values; 1: less than 20 years, 2: years, 3: years, 4: years, 5: greater than 35 years. birthnr An integer which indicates the number in the birtorder the child has. birthnr2 As birthnr but recalculated for the M 5 -subsample. birthweight all The child s birthweight, centered for unrelated comparison created from bviktbs. 23

26 birthweight cou The child s birthweight, centered for cousin comparison. birthweight sib The child s birthweight, centered for sibling comparison. bviktbs The child s birthweight. crimef A variable indicating whether the father has been convicted for a crime between 1973 and crimem A variable indicating whether the mother has been convicted for a crime between 1973 and cohab The cohabitation of parents at the time of child s birth. 0: Lives toghether with child s father,1: Mother single, 2: Not married to/divorced from- child s father (no cohabitation data),. : no information conscriptage Child s age at conscript, a categorical variable taking on 3 values; 1: less than 17.5 years, 2: years, 3: greater than 18.5 years. eduf The education for father 2004, a categorical variable taking on 7 values; 1: Primary and lower secondary school, less than 9 years of education, 2: Primary and lower secondary school, 9 years of education, 3: Upper secondary school, 2-3 years of education, 4: Post secondary school, less than 2 years of education, 5: Post secondary school/college/university, 2-5 years of education, 6: Postgraduate education, 7: Unknown. edum The education for mother 2004, a categorical variable taking on values as above. fullhalf c A variable indicating if the cousins are full- or half- cousins. fullhalf s A variable indicating if the siblings are full- or half- siblings. idextfam The index of the extended family. idfam The index of the family. idgrandpa The id-number of the man with whom the grandmother with id-number idextfam has the children included in the extended family. idspouse The index of the man with whom to the mother with id-number idfam has the child. incomef The income of the father 1990, a categorical variable taking on 4 values; 1: less than 100 kkr, 2: kkr, 3: kkr, 4: greater than 300 kkr. incomem The income of the mother 1990, values as above. lopnr The index of the child of interest. 24

27 lopnrfar The index of the father of the child. lopnrfarfar The index of the paternal grandfather of the child. lopnrfarmor The index of the paternal grandmother of the child. lopnrmor The index of the mother of the child, the same as idfam. lopnrmorfar The index of the maternal grandfather of the child. lopnrmormor The index of the maternal grandmother of the child. pf The Psychological Functioning capacity as measured by Försvarsverket at conscript, this is a discrete variable taking on integer values 1 9 and has a stipulated mean of 5 and variance of 4. The distribution is called stanine. pregtime The gestational time, a categorical variable taking on 4 values; 1: weeks, 2: weeks, 3: weeks, 4: greater than 41 weeks. rok1, sdp A dichotomous variable (= 0, 1) indicating whether the mother smoked during pregnancy. sdp cou A contrast coded variable indicating the difference of a mother s mean SDP during her pregnancies and the extended families SDP mean during pregnancies. sdp mean cou The mean value of sdp for cousins in an extended family. sdp sib A contrast coded variable indicating the difference of a mother s SDP for a single child and the mean over all her pregnancies. seif The socioeconomic status of the father 1990, this is a categorical variable taking on values; 1: Blue collar worker (In production of goods), not specially trained, 2: Blue collar worker, specially trained, 3: White collar worker, lower level, 4: White collar worker, intermediate level, 5: White collar worker, upper level or Self-employed, academic work, 6: Self-employed, 7: Uncategorized employed or no information. seim The socioeconomic status of the mother 1990, a categorical variable taking on values as above. 4 Results Since we are considering normally distributed variables we will treat the PF variable as a (proposed) N(µ, σ 2 ) variable, where µ = 5 and σ 2 = 4 are the predefined values. No major problems has been encountered by doing this, but the F -tests, and consequently the inference and conclusions of the analyses, are questionable. We will only consider main effect of the confounders, not the interactions, to save computational time and simplify 25

28 Figure 6: Left: The M 1 quantile-quantile-plot. Right: The M 3 quantile-quantile-plot interpretation. In the left part of figure 6 the quantile-quantile plot shows that even though the data approximately follows the normal distribution the discrete nature of the variable is still present after controlling for SDP and clustering as in M 1. The deviations from the normal-line must be seen as major since the sample size is so large. A second q- q-plot is included, see right part of figure 6, regarding the model M 3, the model controlled for SDP, covariates and clustering effects. The data seems to connect better to the normalquantile line, but the deviations at the ends are still major. The histogram of the scaled (i.e. they are scaled as a proposed N(0, 1) distribution) residuals of M 4 are displayed in figure 7 together with the assumed normal distribution and a fitted kernel distribution. This also shows that there are deviances from the normal assumption. Ways to deal with Figure 7: The histogram of the residuals of M 4. 26

Confounding, mediation and colliding

Confounding, mediation and colliding What types of shared covariates does the sibling comparison design control for? Arvid Sjölander and Johan Zetterqvist Causal effects and confounding A common aim of