Changing sets of parameters by partial mappings Michael Wiedenbeck and Nanny Wermuth

Size: px

Start display at page:

Download "Changing sets of parameters by partial mappings Michael Wiedenbeck and Nanny Wermuth"

Johnathan Booth
5 years ago
Views:

1 Changing sets of parameters by partial mappings Michael Wiedenbeck and Nanny Wermuth Center of Survey Research, Mannheim, Germany and Department of Mathematical Statistics, Chalmers/Göteborgs Universitet, Sweden Abstract: Changes between different sets of parameters are needed in multivariate statistical modeling. There may, for instance, be specific inference questions based on subject matter interpretations, alternative well-fitting constrained models, compatibility judgements of seemingly distinct constrained models, and different reference priors under alternative parameterizations. We introduce and discuss a partial mapping, called partial replication, and derive from it a more complex mapping, called partial inversion. Both operations are shown to be tools for decomposing matrix operations, for explaining recursion relations among sets of linear parameters, for deriving effects of omitted variables in alternative model formulations, for changing between different types of linear chain graph models, for approximating maximum-likelihood estimates in exponential family models under independence constraints, and for switching partially between sets of canonical and moment parameters in distributions of the exponential family or between sets of corresponding maximum-likelihood estimates. Key words and phrases: Exponential family, graphical Markov models, independence constraints, partial replication, partial inversion, reduced model estimates, REML-estimates, sandwich estimates, structural equations. 1. Definitions and properties of two operators 1.1. Partial inversion We start with a linear function connecting two real valued column vectors y and x via a square matrix M of dimension d for which all principal submatrices are invertible, My = x. (1.1) The index sets of rows and columns of M coincide and are ordered as V = (1,..., d). For an arbitrary subset a of V we consider first a split of V as (a, b) with b = V \ a. A linear operator for equation (1.1), called partial inversion, may be used to invert M in a sequence of steps and forms a starting point for proving properties of graphical 1

2 Markov models, see Wermuth and Cox (2004), Wermuth, Wiedenbeck and Cox (2006), and Marchetti and Wermuth (2007). Partial inversion is a minimally changed version of Gram-Schmidt orthogonalisation and of the sweep operator, the changes being such that an operator with attractive features results. After partial inversion applied to rows and columns a of M, denoted by inv a M, the argument and image in equation (1.1), relating to a, are exchanged ( ) ( ) ( x a y a inv a M =, inv a M = y b x b M 1 aa M ba M 1 aa ) Maa 1M ab M bb.a. (1.2) The matrix M bb.a = M bb M ba Maa 1 M ab is often called the Schur complement of M bb, after Issai Schur ( ), and Maa 1 denotes the inverse of the submatrix M aa of M. Partial inversion with respect to b applied to inv a M is denoted in several equivalent ways depending on the context inv b (inv a M) = inv b inv a M = inv V M = inv ab M, where inv V M yields the inverse of M. Some of the basic properties of partial inversion are (i) inv a inv a M = M, (ii) (inv a M) 1 = inv b M, (1.3) (iii) inv a M = inv b (M 1 ). The operator is commutative, can be undone, and gives the same result if applied with respect to an index set within a submatrix of M than if applied to M and then selecting the submatrix. More precisely, for three disjoint subsets α, β, γ of V, with a = α β (i) inv α inv β M = inv β inv α M, (ii) inv αβ inv βγ M = inv αγ M, (1.4) (iii) inv α M aa = [inv α M] a,a. Indeed all properties, given so far, result with equations (1.8) below, where partial inversion is expressed in terms of what will below be defined as partial replication. Further properties, not used in this paper, may be derived by viewing partial inversion as a group isomorphism. From this point of view, partial inversion operates on the power 2

3 set of V which has group structure with the symmetric difference, denoted by, as law of composition. The neutral element is the empty set and the inverse of a subset a of V is a itself. Then, the image space of partial inversion is the set of transformations {inv a, a V }, a group by the composition of transformations with the identity as neutral element. Again, for every subset a of V, inv a is the inverse of itself. Both groups are Abelian Partial replication We now introduce another operator for equation (1.1), called partial replication, which represents a partial mapping. After partial replication applied to rows and columns a of M, denoted by rep a M, the argument relating to a and denoted by y a, is replicated while the relation for the argument of b = V \ a is preserved as in equation (1.1) ( ) ( ) ( ) y a y a I aa 0 rep a M =, rep y b x a M = ab. (1.5) b M ba M bb Partial replication with respect to b applied to rep a M is denoted in a way analogous to partial inversion rep b (rep a M) = rep b rep a M = rep V M = rep ab M, where rep V M yields the identity matrix I. A basic property of partial replication is rep a rep a M = rep a M. (1.6) The following matrix forms relate directly to components of inv b M (i) (rep a M) 1 = ( (ii) M(rep a M) 1 = I aa M 1 bb M ba ( M aa.b 0 ab M 1 bb ) M ab M 1 bb 0 ba I bb and the last matrix product has been used in the numerical analytical technique of block Gaussian elimination as one special form in which the Schur complement M aa.b of M aa occurs. The operator is commutative, but cannot be undone. It gives the same result as when applied with respect to an index set within a submatrix of M as when applied to M and, ), 3

4 then selecting the submatrix. For three disjoint subsets α, β, γ of V and a = α β (i) rep α rep β M = rep β rep α M, (ii) rep αβ rep βγ M = rep αβγ M, (1.7) (iii) rep α M aa = [rep α M] a,a Thus, it shares commutativity (i) and exchangeability (iii) with partial inversion, but differs regarding property (ii). Nevertheless, partial inversion can be expressed in terms of partial replication. By direct computation and with property (1.3)(ii) inv b M = (rep b M)(rep a M) 1, inv a M = (rep a M)(rep b M) 1. (1.8) 1.3. Partial inversion combined with partial replication Basic properties of the two operators combined are (i) inv a rep a M = rep a M, (ii) inv b rep a M = (rep a M) 1 = rep a inv b M (1.9) (iii) rep b inv b M = M(rep a M) 1 For a partition of V into α, β, γ, δ and s = V \ δ, some of the derived properties are (i) inv α rep β M = rep β inv α M, (ii) inv αβ rep βγ M = inv α rep βγ M, (1.10) (iii) rep α inv β M ss = [rep α inv β M] s,s. Thus, the order of combining the operators matters unless applied to disjoint index sets and for partial inversion applied after partial replication, overlapping index sets are ignored, a contraction instead of the expansion property (1.7)(ii) of partial replication. Partial replication applied after partial inversion and to an overlapping index set coincides, just like (1.8), with a matrix product involving two partially replicated matrices rep βγ inv αβ M = (rep αγ M)(rep γδ M) 1. (1.11) For the first component on the right-hand side, partial replication is with respect to the symmetric difference of c = β γ and a = α β. The complement to the argument of 4

5 partial inversion, b = V \a = γ δ, defines the partial replication in the second component. Conversely, given the partially replicated matrices on the right-hand side, partial inversion of M on the left-hand side is with respect to V \ (γ δ) = a and the result is partially replicated with respect to β γ, the symmetric difference of a and a c. Proof. For the proof of equation (1.11), we obtain vectors from the basic equation (1.1) and partitioned in the order (α, β, γ, δ), with as before a = α β and b = γ δ, by y α y y = (y a ; y b ) = β y γ, (x x a; y b ) = β y γ, (x x βδ; y αγ ) = β y γ. y δ y δ x δ From the definitions of partial replication (1.5) and partial inversion (1.2) we use for (1.11) x α rep b My = (x a ; y b ), inv a M(x a ; y b ) = (x b ; y a ) and let N = inv a M. Partial replication of N with respect to c = β γ modifies the vector (x a ; y b ) into (x βδ ; y αγ ) so that (rep c N) rep b My = (x βδ ; y αγ ) and rep αγ My = (x βδ ; y αγ ), thus completing the proof. It is instructive to see also a direct matrix proof. For this, we denote the matrix obtained by partial replication with respect to a c by Q and with respect to b by R I αα 0 αβ 0 αγ 0 αδ M αα M αβ M αγ M αδ M rep a c M = βα M ββ M βγ M βδ 0 γα 0 γβ I γγ 0 γδ, rep M b M = βα M ββ M βγ M βδ 0 γα 0 γβ I γγ 0 γδ. M δα M δβ M δγ M δδ 0 δα 0 δβ 0 δγ I δδ For equation (1.11) to hold, one needs to show that QR 1 = rep c N, where R 1 = ( M 1 aa ) Maa 1 M ab 0 ba I bb = ( N aa N ab 0 ββ I ββ The rows of components α and γ in QR 1 coincide directly with those of rep c N. The 5 ). y α

6 rows of component β in rep c N result since for Q and R, the rows of β coincide so that the product Q βv R 1 gives zeros whenever the row index within β for Q βv differs from the column index in R 1 and equals one, otherwise. Finally, for the rows of components δ we have for Q δv R 1 by the defining equation for partial inversion (1.2) (M δa M 1 aa M δγ M δa M 1 aa M aγ M δδ M δa M 1 aa M aδ) = (N δa N δb ), which completes the proof. The following use of equation (1.11) represents a change of basis from (x a ; y b ) to (x d ; y c ) rep a c N(x a ; y b ) = (x d ; y c ) = inv c M(x c ; y d ). (1.12) Proof. For equation (1.12), note that, by the definition of partial replication (1.5) with respect to a c = α γ, components x α and y γ of (x a ; y b ) are replicated. By the definition of partial inversion of M with respect to a (1.2), one knows further that N βa x a + N βb y b = y β, N δa x a + N δb y b = x δ. In the following sections, partial replication is applied to quite distinct statistical themes. In section 2, we simplify proofs of a number of results known for different aspects of linear models. In sections 3 and 4, linear relations are obtained for sets of parameters and for sets of estimates in nonlinear models within the exponential family of distribution. 2. Decompositions of partial inversion 2.1. A partially inverted covariance matrix Partial replication provides with (1.8) a decomposition of partial inversion. Let Σ denote the joint covariance matrix of mean-centred random vector variables Y a, Y b, then for instance ( ) ( )( ) Σaa b Π a b Iaa 0 ab inv b Σ = gives with Π T a b Σ 1 bb = rep b Σ (rep a Σ) 1 = Σ aa b = Σ aa Σ ab Σ 1 bb Σ ba, Σ aa Σ ab 0 ba I bb Π a b = Σ ab Σ 1 bb, Σ 1 bb Σ ba Σ 1 bb, a one-to-one transformation of marginal covariances and variances, namely the elements of Σ, into parameters in the concentration matrix Σ 1 bb of Y b and into linear least-squares 6 Σ 1 bb

7 regression parameters of Y a on Y b, i.e. in the model defined by equations Y a Π a b Y b = ǫ a, E(ǫ a ) = 0, cov(ǫ a, Y b ) = 0, cov(ǫ a ) = Σ aa b, where the interpretation of Π a b as a matrix of regression coefficients results by post multiplication with Yb T and taking expectations. Zero constraints on (Π a b, Σ aa b ) have been studied especially in econometrics. The possible existence of multiple solutions of estimating equations, derived by maximizing the Gaussian likelihood function, in a Zellner model, also called seemingly unrelated regressions (Zellner, 1962), has more recently been demonstrated by Drton and Richardson (2004). Zero constraints on concentrations, such as in Σ 1 bb, had been introduced as a tool for parsimonious estimation of covariances, called covariance selection (Dempster, 1972). The Gaussian likelihood function has a unique maximum for all possible sets of zero concentrations though for some models it is necessary to use iterative algorithms to find the solution. Zero constraints on covariances, such as in Σ aa b, had been studied by Anderson (1969, 1973). When zero constraints on Σ 1 are preserved for all variable pairs under the transformation from inv V Σ 1 = Σ, is one typical problem of independence equivalence, studied within the field of graphical Markov models. Similarly, one investigates in this context when zero constraints on Σ 1 are preserved in inv a Σ 1 = inv b Σ Three recursion relations among linear parameter matrices Let a set of variables be partitioned according to V = (a, γ, δ), again with b = γ δ, but a = α and let an estimate of inv b Σ be given, so that implicitly Y a is regressed on Y b. Assume that inv δ Σ has been estimated in another study. Then, which systematic changes are to be expected among the two sets parameters and the corresponding estimates? Written explicitly, the two sets of parameters are inv b Σ = Σ aa b Π a γ.δ Π a δ.γ Σ γγ.a Σ γδ.a Σ δδ.a, inv δ Σ = Σ aa δ Σ aγ δ Π a δ Σ γγ δ Π γ δ Σ δδ.aγ where indicates an entry in a symmetric matrix and an entry in a matrix that is symmetric except for the sign. The matrices Σ aa b, Π a b are as defined before, but the latter is split into two components corresponding to the explanatory variables Y γ, Y δ as Π a b = (Π a γ.δ Π a δ.γ ) and, 7

8 Σ γγ.a denotes the submatrix of components γ in the concentration matrix of Y b, that is after having marginalized in the overall concentration matrix Σ 1 over Y a. Since inv δ Σ = inv γa Σ 1, one knows from (1.8) that inv δ Σ = (rep γ inv b Σ)(rep aδ inv b Σ) 1, or Σ aa b Π a γ.δ Π a δ.γ I aa 0 aγ 0 aδ inv δ Σ = 0 γa I γγ 0 γδ Σ γa δ Σ γγ δ Π γ δ (2.1) Π T a δ.γ Σ δγ.a Σ δδ.a 0 δa 0 δγ I δδ by using the following implications of (1.3)(iii) Σ γγ δ = (Σ γγ.a ) 1, Π γ δ = (Σ γγ.a ) 1 Σ γδ.a Σ γγ δ Π T a γ.δ = Σ γa δ. By the matrix product (2.1), one obtains directly several known recursion relations, one for covariances (Anderson, 1958, section 2.5) in position (a, a) Σ aa δ = Σ aa γδ + Π a γ.δ Σ γa δ, for concentrations (Dempster, 1969, chapter 4) in position (δ, δ) Σ δδ.aγ = Σ δδ.a + Σ γδ.a Π γ δ, and for linear least-squares regression coefficients (Cochran, 1938) in position (a, δ) Π a δ = Π a δ.γ + Π a γ.δ Π γ δ. (2.2) Each of the three equations relates a marginal to a conditional parameter matrix. There will be no change in each of the three types of parameter matrices if and only if at least one of the terms in the added matrix products is zero. For the regression coefficient matrices in (2.2), this can be achieved by study design. If for instance Y δ represent treatments, Y γ factors influencing treatments, and there is randomized allocation of individuals to treatments, then the resulting independence of Y δ and Y γ implies Π γ δ = 0. The same result is achieved by an intervention in which all levels y γ of variable Y γ can be fixed. By contrast, an only notional intervention is assumed in literature on causal models (Rubin 1974; Pearl, 2000; Lauritzen, 2000) to describe the special situation in which overall effects (Π a δ y δ ) coincide with conditional effects (Π a δ.γ y δ ). For a more detailed discussion of assumptions used in the literature on causal models, 8

9 especially of the assumed partial order, here V = (a, γ, δ), see Lindley (2002), Cox and Wermuth (2004), for extensions of the recursion relation for regression coefficients to nonlinear relations, see Cox and Wermuth (2003), Ma, Xie and Geng (2006), and Cox (2007), and for conditions under which other types of association measures remain unchanged, see Xie, Ma and Geng (2007). Recursion relations are important for comparisons of results of any two empirical studies in which similar research questions are investigated on a core set of common variables but that differ with respect to some variables that are explicitly included in one study but not in the other Reduced form equations and omitted variables Suppose that, on subject matter grounds, a full ordering of variables can be given to mean-centred random variables Y i, for i = 1,...,d, and that linear equations are generated which have uncorrelated residuals. Then this system can be described with the random vector Y, in which element i corresponds to Y i and AY = ε, cov(ε) = diagonal, (2.3) where A is a unit upper-triangular matrix for which the negative element in position (i, j) is a least-squares regression coefficient A ij = β i j.r(i)\j, with r(i) = (i + 1,...,d). Let there be some zero constraints on the upper off-diagonal elements of A and the effects of the generating process be of interest on the joint dependency of each component of Y a on Y b where a = 1,...d a, and b = d a + 1,...,d; and then we call V = (a, b) an order-compatible split of V = (1,...,d). From ( ) ) A = A aa A ab 0 ba A bb, inv a A = ( A 1 aa A 1 aa A ab 0 ba A bb and the mapping inv a A ( ε a Y b ) ( ) Y a =, ε b the resulting equation in Y a is Y a = A 1 aa A aby b + η a, η a = A 1 aa ε a. (2.4) 9

10 In the context of structural equations, this type of equation has been called a reduced form equation. For it, the starting equation (2.3) implies cov(η a, Y b ) = 0 so that each component Y i of Y a is related by a linear least-squares regression to Y b and Π a b = A 1 aa A ab, Σ aa b = A 1 aa aaa T aa. The equations imply that, typically, some zero constraints of the generating equations (2.3) will carry over directly to the reduced form equations (2.4) while others take on a more complex form than zero constraints. For more general types of structural equations than those of a triangular system (2.3) questions of identifiability of parameters arise and early results (Fischer, 1966) provided answers for models in which the residual covariance matrix in the reduced form equation is unconstrained, see also Eusebi (2007), or in which it is still a diagonal matrix. Conditions under which such assumptions are satisfied are provided by more recent results on Markov equivalence of models. Let now a = (α, β) be a further order-compatible split of a and Y β be unavailable in a new study. Which effects are to be expected for the dependence of Y α on Y b? By equation (1.8) inv β A = (rep β A)(rep αb A) 1, which gives A αα A αβ A αb I αα 0 αβ 0 αb 0 βα I ββ 0 βb 0 βα A 1 ββ -A 1 0 bα 0 bβ A bb 0 bα 0 bβ I bb ββ A βb = Thus, the equation for Y α in (2.3) is changed via (1.2) to A αα A αβ A 1 ββ A αb.β -A 1 ββ A βb 0 βα A 1 ββ 0 bα 0 bβ A bb. A αα Y α + A αb.β Y b = ξ a, ξ a = ε α A αβ A 1 ββ ε β, cov(ξ a ) = K, with residuals for some component Y i of Y α typically correlated with regressors for Y i so that the residual covariance matrix, K, is no longer diagonal. Then, the equations parameters A αα, A αb.β need not define least-squares regression coefficients in the variables Y α, Y b so that, in particular, a zero equation parameter for Y i, Y j need not correspond to Y i and Y j being uncorrelated given any subset of the remaining variables. In econometrics, one speaks then of an endogenous response variable Y i in a structural equation and of the endogeneity problem. The change in the dependence of Y α on Y b after ignoring the intermediate variable Y β may be large whenever A αb deviates much from A αb.β. There is no change if in the starting system (2.3) either A αβ = 0, i.e. the final response Y a is unaffected by the intermediate 10

11 variable, or A βb = 0, i.e. the intermediate variable does not depend on Y b. By contrast, the reduced form equations after omitting Y β, Y α = A 1 αα A αb.βy b + η α, are the subset of equations (2.4) corresponding to Y α with Π α b = A 1 ααa αb.β. Thus, the structural equations after omitting Y β make also the changes explicit in the dependence of Y α on Y b, when moving from least squares coefficients in (2.3) to the least squares coefficients in reduced form equations (2.4). It may also be verified for the given order V = (α, β, b), that marginalizing over Y α has no effect on the remaining parameters in (2.3) but that marginalizing over variables Y b leads, in general, to a non-diagonal residual covariance matrix K, say. As a consequence of a non-diagonal K, not only the interpretation of equation parameters as least-squares regression coefficients in the observed variables may be lost, but an identification problem may result also. However, a large identified subclass has been characterized recently, see Brito and Pearl (2002), Wermuth and Cox (2007), Drton, Eichler and Richardson (2007). It has K ij 0 only if A ij = 0. Expressed differently, the two sets of variable pairs with constraints in A and K may overlap, but the two unconstrained sets of variable pairs within A and within K are disjoint. This class of models helps to detect a type of confounding called indirect that, so far, had essentially been overlooked, see Wermuth and Cox (2007). Indirect confounding poses a problem even for intervention studies that use randomized allocation of treatments to individuals, i.e. for what is regarded as the gold standard of designed studies Changing to a different split of three types of linear parameters Let V = (α, β, γ, δ), again with a = α β, b = γ δ and another split of V be c = β γ and d = α δ and suppose that inv b Σ has been studied and estimated in one study, but inv d Σ in another. Then the change is between the following two sets of parameters Σ αα b Σ αβ b Π α γ.δ Π Σ αα.c Π T α δ.γ β α.δ ΠT γ α.δ Σαδ.c Σ ββ b Π β γ.δ Π β δ.γ inv b Σ = Σ γγ.a Σ γδ.a, inv d Σ = Σ ββ d Σ βγ d Π β δ.α Σ γγ.d Π. γ δ.α Σ δδ.a Σ δδ.c If we take E(Y ) = 0 and let Z = Σ 1 Y (2.5) 11

12 then the implicit corresponding change in linear systems is between Z α Y α inv a Σ 1 Z β Y γ = Y β Z γ, Y δ Z δ. Y α Z α inv c Σ 1 Z β Z γ = Y β Y γ. (2.6) Y δ Z δ Proof. For equation (2.6), we start with Σ 1 Y = Z in mean-centred Y, let σ ij denote the overall concentration of Y i and Y j, and Y a b = Y a Π a b Y b variable Y a corrected for linear least-squares prediction on Y b. Then E(ZZ T ) = Σ 1, E(Y Z T ) = I, so that Y i corrected for least-squares regression on all remaining variables is Y i j i = Y i + j i (σij /σ ii )Y j = Z i /σ ii. In matrix notation, with D 1 a diagonal matrix having elements 1/σ ii = σ ii j i and U = D 1 Z, the equation Σ 1 Y = Z is modified first into D 1 Σ 1 Y = U and, after partial inversion of D 1 Σ 1 with respect to a, into the left equation in (2.6), i.e. into inv a Σ 1 (Z a ; Y b ) = (Z b ; Y a ). (2.7) For the final step of the proof of (2.6), we use the following result. For any square matrix M, defined as before, and for any a positive definite diagonal matrix D inv a (D M) = (rep a D )(inv a M)(rep b D ) 1. Equation (2.7) can be rewritten as Y a b = Σ aa b Z a, Y b = Σ ba Z a + Σ bb Z b (2.8) so that e.g. linear independence of Y a b and Y b in the given form is verified directly with cov(σ ba Z a + Σ bb Z b, Z a ) = Σ ba Σ aa + Σ bb Σ ba = 0. 12

13 After writing (2.8) as Y a Π a b Y b = ǫ a, cov(ǫ a ) = Σ aa b, Σ bb.a Y b = Z b Π T a b Z a, (2.9) the change in parameter sets for equations (2.6) is seen to be from (Π a b, Σ aa b, Σ bb.a ) to (Π c d, Σ cc d, Σ dd.c ), and the change in the equations (2.6) is defined by the following three reversible transformations inv c Σ 1 = inv a c inv a Σ 1, (Z c ; Y d ) = (rep a c inv a Σ 1 )(Z a ; Y b ) (2.10) (Z d ; Y c ) = (rep a c inv a Σ 1 )(Z a ; Y b ). Proof. The transformations in (2.10) result by equations (1.8) and (1.12) and they lead, again with (1.8), to inv c Σ 1 (Z c ; Y d ) defined in terms of (rep a c inv a Σ 1 )(Z a ; Y b ) Changing from concentrations to parameters of linear chain graphs Parameter sets in linear multivariate regression chains are obtained when, starting with equations (2.5), (2.9), the equations on the right of (2.9), involving the marginalized concentration matrix, are repeatedly modified in the same way as Σ 1 Y = Z is transformed into equations (2.9). Multivariate regression chains are used for the study of panel data and are simplified using zero constraints, see Cox and Wermuth (1993, 1996). By using the same type of repeated transformations as for multivariate regression chains, but selecting different sets of parameters, distinct types of linear chain graph models are defined, see Wermuth, Wiedenbeck and Cox (2006). For instance, models result that have been called blocked concentration chains or LWF-chains (for Lauritzen and Wermuth (1989) and Frydenberg( 1990)) and those called mixed concentration-regression chains or AMP-chains (for Andersson, Madigan and Perlman (2001)). The interpretation of zero constraints depends on the selected type of parameter so that the different sets of zero constraints permit to formulate various types of research hypothesis that may be of interest in different subject-matter contexts. Some direct extensions are to constraints in nonlinear chain graph models within the context of exponential family models discussed in the next sections. Typically, any conditional independence hypothesis involves then several parameters that are constrained to be zero. 13

14 3. Estimation in reduced exponential families For a full exponential family, we take the log likelihood, after disregarding terms that do not depend on the unknown parameter, in the form s T φ K(φ), where φ is the canonical parameter and s the sufficient statistic, a realization of the random variable S. The cumulant generating function of S under the full exponential family, K(φ + t) K(φ), gives the mean or moment parameter, η = K(φ), as the gradient of K(φ) with respect to φ, i.e. as a vector of first derivatives. The maximum-likelihood estimate of η is ˆη = s. The gradient of η with respect to φ gives the covariance matrix of S but also the concentration matrix of the maximum-likelihood estimate ˆφ of the canonical parameter, that is cov(s) = T K(φ) = con(ˆφ), where ˆφ denotes for simplicity both a maximum-likelihood estimate of φ and the corresponding random variable. The notation I = T K(φ) reminds one that R. A. Fisher had interpreted minus this derivative as the information about the canonical parameter contained in a single observation. Given ˆη and Ī, the observed information matrix, i.e. minus I evaluated at s, studentized statistics for testing that an individual component η i of η is zero, are obtained for a large sample size n as ˆη i / Īii. Similarly, since the random variable I 1 (ˆη η) has mean zero and covariance matrix I 1 and hence the same mean and variance as (ˆφ φ), a studentized statistics for testing that an individual component φ i of φ is zero, may for large n be computed as ˆφ i / (Ī 1 ) ii. At a maximum of the likelihood function under a full exponential model, also called often the saturated or the largest covering model, it holds that Ī 1ˆη = z, z = Ī 1 s, (3.1) which is in the form of (2.5) so that the results of the previous section apply to it, especially as used below for (3.4). We now consider special reduced exponential models, those given by constraints η c = 0 14

15 for some subset of elements of η, writing η = (η u, η c ). By differentiating the Lagrangian s T φ K(φ) λ T η c. with respect to φ, the maximum likelihood estimating equations are obtained as ˆη u = s u ÎucÎ 1 cc s c, ˆη c = 0, (3.2) where the matrix I is evaluated at the maximum likelihood estimate in the reduced model. Since the solution of maximum-likelihood equations requires in general iterative algorithms and may not be unique, an efficient closed form approximation is useful, see Cox and Wermuth (1990), Wermuth, Cox and Marchetti (2006), that has been called the reduced model estimate of η u. Equation (3.2) is thereby modified into η u = s u ĨucĨ 1 cc s c, η c = 0, (3.3) where the (u, c) and (c, c) components of Î in (3.2) have been replaced by the corresponding components of the asymptotic covariance matrix Ĩ of S and do not involve unknown parameters. Equations (3.2) and (3.3) result also, by using (2.9), and inv u I 1 = inv c I in (3.1) when the (u, c) and (c, c) components of the matrix I are replaced by Î and by Ĩ, respectively, and the (u, u) component is evaluated at η c = 0. Since the use of reduced model estimates is recommended only in situations in which the constraints agree well with the observed data, choosing the full observed matrix, Ī of ˆφ, under the saturated model and then partially inverting it on c, should not give estimates which differ much from those in equations (3.2) and (3.3). By equations (2.9), (3.1), we then have η u = s u ĪucĪ 1 cc s c, Īcc.u η c = s c (ĪucĪ 1 cc )T s u (3.4) with η u = η u, cov( η u ) = Īuu c and cov( η c ) = Īcc. In the case of a poor fit to the hypothesis η c = 0, i.e. with some components of η c deviating much from zero, equations (2.10) permit a direct change to the fit under an alternative hypothesis. Since ĪucĪ 1 cc can be viewed as the observed coefficient of S c in linear least-squares regression of S u on S c, standard results apply for testing that 0 = I uc I 1 cc given an estimate of the appropriate covariance matrix. Expansions into relevant interaction parameters lead to studentized statistics of interaction terms and provide insight into where a possibly poor fit is located. See Cox and Wermuth, (1990) for examples and Lauritzen and Wermuth 15

16 (1989) for a discussion of interaction parameters in the case of Conditional-Gaussian (CG) distributions and of CG-regressions. The latter contain for instance logistic regression as a special case for which, in general, iterative fitting algorithms are needed to give ˆη even under the saturated model, see Edwards and Lauritzen (2001). The closed form estimates of η u in (3.4) provide a new justification for the reduced model estimates: they result by partially inverting equations (3.1) with respect to the subset given by the set of unconstrained parameters. These estimates can also be viewed as the generalized least squares estimates of Aitken (1935), which turn for instance for the multinomial distribution to the estimates of Grizzle, Starmer and Koch (1989), see Cox and Snell (1981, appendix 1), Cox and Wermuth (1990, section 7). Under some general regularity conditions, equations (3.1) and hence the estimates of η u in (3.4) arise in more general settings than for the exponential family from asymptotic theory, see Cox (2006, chapter 6). They are then called sandwich estimates which had been introduced in a special context by Huber (1964) or, in the context of generalized linear models, approximate residual maximum-likelihood (REML) estimates which had been derived by Patterson and Thompson (1971). By similar arguments as above, the maximum-likelihood equations with zero constraints on canonical parameters are obtained in a form comparable to equations (3.2). From a theoretical viewpoint, these may be more attractive than zero constraints on moment parameters. Firstly, if the constrained canonical parameters exist, then the maximum-likelihood equations have a unique solution. Secondly, the sets of minimal sufficient statistics are of reduced size. Thirdly, estimates are available in closed form for Gaussian and for multinomial distributions provided the model is decomposable that is the associated independence graph can be arranged in a sequence of possibly overlapping but complete prime graphs, see Cox and Wermuth (1999). For non-decomposable models, it is in addition often simple to find a not much larger, decomposable covering model. 4. Switching partially between canonical and moment parameters So far, we have considered only linear mappings even though these were relevant both for nonlinear model formulations and for correlated data. We shall now illustrate how changes between moment and canonical parameters in exponential families may be obtained in terms of partial replication. A general formulation has been given by Cox (2006), section 6.4, including a short proof for orthogonality, i.e. uncorrelatedness, of canonical and moment parameters and of the asymptotic independence of the corresponding maximum-likelihood estimates, see also Barndorff-Nielsen (1978, section 9.8). These general results have often been proven 16

17 for specific members of exponential families, involving the then necessary lengthy, detailed arguments. In the notation of the present paper, for the moment parameter η = K(φ), and the canonical parameter φ with a maximum-likelihood estimate denoted by ˆφ, let s = (s a, s b ), represent a split of the sufficient statistic into two column vectors and let φ b be replaced by η b, the corresponding component of η, to give the mixed parameter vector ψ = (φ a, η b ). Then with ) I = ηt φ, rep a I = ψt φ = ( Iaa 0 ab one obtains, given ˆφ, a maximum-likelihood estimate of ψ under the saturated model as ˆψ = rep a Ī ˆφ, I ba cov( ˆ ˆψ) = (rep a Ī) Ī 1 (rep a Ī) T = I bb, (Ī 1 aa b 0 ab 0 ba Ī bb ). (4.1) The two random variables corresponding to ˆφ a and ˆη b are uncorrelated and, hence given their asymptotic joint Gaussian distribution, they are also asymptotically independent. One simple example is a joint Gaussian distribution with ˆη b the observed mean of Y b and ˆφ a the observed overall concentration matrix of Y a. Another is for two dichotomous variables with ˆη b the difference in observed frequencies for Y b and ˆφ a the observed log-odds ratios of Y a. Two complementary mappings may, respectively, be represented by (φ a ; η b ) = rep a Iφ, (η a ; φ b ) = rep b Iφ so that for such mappings, we have with the specific choice M = I, as in equations (1.2), (1.8), (η a ; φ b ) = inv a I (φ a ; η b ), inv a I = (rep a I)(rep b I) 1. (4.2) One important consequence of (4.2) is that, given (4.1), the change from a split (a, b) to another split (c, d) with canonical parameters for c and moment parameters for d is possible by using (2.10) after just replacing Σ by I. Another application is to constrained chain graph models of different type when the conditional distributions are members of the exponential family. For the computation of the observed information matrix in the case of both discrete and continuous variables, see Dempster (1973), and Cox and Wermuth (1990). Whenever the population matrix I is replaced by its observed counterpart Ī, the linear relation in 17

18 equation (4.2) between sets of parameters in possibly nonlinear models turns into a linear relation between the corresponding maximum-likelihood estimates. Acknowledgement We thank the Swedish Research Society for supporting our cooperation via the Gothenburg Stochastic Center and D.R. Cox and G.M. Marchetti for their most helpful comments. References Aitken, A. C. (1935). On least squares and linear combination of observations. Proc. Roy. Soc. Edin. A 55, Anderson, T. W. (1958). An Introduction to Multivariate Statistical Analysis. Wiley, New York. Anderson, T. W. (1969). Statistical inference for covariance matrices with linear structure. In: Multivariate Analysis II (Edited by P.R. Krishnaiah), 55-66, Academic Press, New York. Anderson, T. W. (1973). Asymptotically efficient estimation of covariance matrices with linear structure. Ann. Statist. 1, Andersson, S., Madigan, D., and Perlman, M.D. (2001). Alternative Markov properties for chain graphs, Scand. J. Statist., 28, Barndorff-Nielsen, O. E. (1978). Information and Exponential Families in Statistical Theory. Wiley, Chichester. Brito, C. and Pearl, J. (2002). A new identification condition for recursive models with correlated errors. Struct. Equation Mod. 9, Cochran, W. G. (1938). The omission or addition of an independent variate in multiple linear regression. Suppl. J. Roy. Statist. Soc. 5, Cox, D. R. (2006). Principles of Statistical Inference Cambridge University Press, Cambridge. Cox, D.R. (2007). On a generalization of a result of W.G. Cochran. Biometrika 94. To appear. Cox, D. R. and Wermuth, N. (1990). An approximation to maximum-likelihood estimates in reduced models. Biometrika 77, Cox, D. R. and Wermuth, N. (1993). Linear dependencies represented by chain graphs (with discussion). Statist. Science 8, ; Cox, D. R. and Wermuth, N. (1996). Multivariate Dependencies Models, Analysis and Interpretation. Chapman and Hall, London. 18

19 Cox, D.R. and Wermuth, N. (1999). Likelihood factorizations for mixed discrete and continuous variables. Scand. J. Statist. 26, Cox, D.R. and Wermuth, N. (2003). A general condition for avoiding effect reversal after marginalization. J. Roy. Statist. Soc. B, 65, Cox, D. R. and Wermuth, N. (2004). Causality: a statistical view. Int. Statist. Rev. 72, Dempster, A. P. (1969). Elements of Continuous Multivariate Analysis. Addison-Wesley, Reading. Dempster, A. P. (1972). Covariance selection. Biometrics 28, Dempster, A.P. (1973). Aspects of the multinomial logit model. In: Proc. 3rd Aymp. Mult. Anal. (Edited by P.R. Krishnaiah), Academic Press, New York. Drton, M. and Richardson, T.S. (2004). Multimodality of the likelihood in the bivariate seemingly unrelated regression model. Biometrika 91, Drton, M., Eichler, M. and Richardson, T. S. (2007). Identification and likelihood inference for recursive linear models with correlated errors. Submitted. Edwards, D. and Lauritzen, S. L. (2001). The TM algorithm for maximising a conditional likelihood function. Biometrika 88, Eusebi, P. (2007). A graphical method for assessing the identification of linear structural equation models. Struct, Equation Modelling. To appear. Fischer, F. (1966). The Identification Problem in Econometrics. McGraw Hill, New York. Frydenberg, M. (1990). The chain graph Markov property. Scand. J. Statist. 17, Grizzle, J. E., Starmer, C. F. and Koch, G. C., (1969). Analysis of categorical data by linear models. Biometrics 28, Huber, P. J. (1964). Robust estimation of a location parameter. Ann. Math. Stat. 35, Lauritzen, S. L. (2000). Causal inference from graphical models. In Complex stochastic systems (Edited by O.E. Barndorff-Nielsen, D. R. Cox and C. Klüppelberg), Chapman and Hall, London. Lauritzen, S. L. and N. Wermuth (1989). Graphical models for associations between variables, some of which are qualitative and some quantitative. Ann. Statist. 17, Lindley, D. V. (2002). Seeing and doing: the concept of causation. Int. Statist. Rev. 70,

20 Ma, Z., Xie, X. and Geng, Z. (2006). Collapsibility of distribution dependence. J. R. Statist. Soc. B 68, Marchetti, G. M. and Wermuth, N. (2007). Matrix representations and independencies in directed acyclic graphs. Submitted. Patterson, H. D. and Thompson, R. (1971). Recovery of interblock information when block sizes are unequal. Biometrika, 58, Pearl, J. (2000). Causality. Cambridge University Press. Rubin, D. B. (1974). Estimating causal effect of treatments in randomized and nonrandomized studies. J. Educational Psychol. 66, Wermuth, N. and Cox, D. R. (2004). Joint response graphs and separation induced by triangular systems. J. Roy. Statist. Soc. B 66, Wermuth, N. and Cox, D. R. (2007). Distortion of effects caused by indirect confounding. Biometrika. To appear. Wermuth, N., Cox, D. R. and Marchetti, G. M. (2006). Covariance chains. Bernoulli 12, Wermuth, N., Wiedenbeck, M. and Cox, D. R. (2006). Partial inversion for linear systems and partial closure of independence graphs. BIT, Num. Math. 46, Xie, X., Ma, Z. and Geng, Z. (2007). Some association measures and their collapsibility. Statistica Sinica. To appear. Zellner, A. (1962). An efficient method of estimating seemingly unrelated regressions and tests for aggregation bias. J. Amer. Statist. Assoc. 57, Center of Survey Research, Mannheim, Germany wiedenbeck@zuma-mannheim.de Department of Statistics, Chalmers/Göteborgs Universitet, Sweden wermuth@math.chalmers.se 20

Matrix representations and independencies in directed acyclic graphs

Matrix representations and independencies in directed acyclic graphs Giovanni M. Marchetti, Dipartimento di Statistica, University of Florence, Italy Nanny Wermuth, Mathematical Statistics, Chalmers/Göteborgs