CAUSAL MODELS: THE MEANINGFUL INFORMATION OF PROBABILITY DISTRIBUTIONS

Size: px

Start display at page:

Download "CAUSAL MODELS: THE MEANINGFUL INFORMATION OF PROBABILITY DISTRIBUTIONS"

Olivia Anthony
6 years ago
Views:

1 CAUSAL MODELS: THE MEANINGFUL INFORMATION OF PROBABILITY DISTRIBUTIONS Jan Lemeire, Erik Dirkx ETRO Dept., Vrije Universiteit Brussel Pleinlaan 2, 1050 Brussels, Belgium ABSTRACT This paper claims that causal model theory describes the meaningful information of probability distributions after a factorization. If the minimal factorization of a distribution is incompressible, its Kolmogorov minimal sufficient statistics, the parents lists, can be represented by a directed acyclic graph (DAG). We showed that a faithful Bayesian network is a minimal factorization and that a Bayesian network with random and unrelated conditional probability distributions (CPDs) is faithful and thus a minimal factorization. The validity of faithfulness depends on the presence of other regularities. The Bayesian network is a canonical representation, it uniquely decomposes the distribution into independent submodels, the CPDs. In absence of further information, we may assume modularity and that the model offers a good hypothesis about the underlying mechanisms of the system. KEY WORDS Causal Models, Kolmogorov Complexity, Meaningful Information, Reductionism. 1 Introduction Kolmogorov Complexity gives an objective measure of the complexity of an object, which allows the formal application of Occam s Razor to modeling. The central idea is that modeling can be equated with finding regularities in data. An objective property of a regularity is identified by its ability to compress the data, i.e. to describe the data using fewer symbols than the number of symbols needed to describe the data literally. The more regularities there are, the more the data can be compressed. The regularities of the data constitute the meaningful information. A good model only captures the regularities, not the accidental, random information of the data. The simplest model that does so is called the Kolmogorov minimal sufficient statistics for the data. The theory of causal models, as developed by Pearl, Verma, Spirtes, Glymour, Scheines et al., gives a probabilistic view on causation and is based on the theory of Bayesian networks. A Bayesian network consists of a directed acyclic graph (DAG) and a conditional probability distribution (CPD) for each variable. A causal model gives a causal interpretation to the edges of a faithful Bayesian network. The causal interpretation of the edges and the validity of faithfulness are often criticized [4, 2, 18]. The causal interpretation is defined as the ability to predict the effect of changes to the system (by the so-called interventions). It is based on the modularity assumption. Faithfulness demands that the model reflects all conditional independencies of the probabilistic distribution. Conditional independencies are qualitative properties and, as we will show, are regularities that allow compression of the description of the probability distribution. By these regularities, we establish a correspondence between causal model theory and the Kolmogorov minimal sufficient statistics. Maximal use of the independencies leads to the minimal factorization. We show that in absence of other regularities the model is faithful and results in a decomposition of the distribution into independent CPDs which supports the modularity assumption. The next section shows how meaningful information can be separated from random information by use of the Kolmogorov structure function. Section 3 defines causal models and section 4 the related work. Section 5 applies the minimality principle on distributions and in the following section the correspondence with causal model theory is shown. Finally, section 7 discusses the assumptions. 2 Meaningful Information The Kolmogorov Complexity of an object x is defined to be the length of the shortest computer program that prints the sequence and then halts [8]: K(x) = min l(p) (1) p:u(p)=x with U a universal computer. p gives the shortest description of x, but not all bits of p can be regarded as containing meaningful information. We consider meaningful information as regularities that allow compression of x [15]. We therefore seek a description in 2 parts, one containing the meaningful information, which we put in the model, and one the remaining random noise, which put in the datato-model code. A model can be related to a model set, containing all objects the model can represent. We look for a model set S that contains x and the objects that share x s regularities. The Kolmogorov structure function of x is defined as the log-size of the smallest set including x which

Figure 2. Factorization based on variable ordering (A, B, C, D, E) and reduction by three independencies. how Bayesian networks describe probability distributions.

2 Figure 2. Factorization based on variable ordering (A, B, C, D, E) and reduction by three independencies. how Bayesian networks describe probability distributions. Secondly, faithful models are defined as describing all independencies of a distribution. Ultimately, a causal interpretation is given to the network. Figure 1. Kolmogorov structure function for n-bit string x, k is the Kolmogorov minimal sufficient statistic of x. can be described with no more than k bits [3]: K(k, x) = min p:l(p) k U(p)=S x S log S (2) with S the size of set S. A typical graph of the structure function is illustrated in Figure 1. We start with k = 0 and increase the allowed complexity k of the model set. When k = 0, the only set that can be described is the entire set {0, 1} n, so that the corresponding log set size is n. As we increase k, the model can take advantage of the regularities of x, in such way that each bit reduces the set s size more than halving it. The slope of the curve is smaller than -1. When all regularities are exploited, each additional bit of k reduces the set by half. We proceed along the line of slope -1 until k = K(x) and the smallest set that can be described is the singleton x n. The curve K(S) + log S is also shown on the graph. It represents the descriptive complexity of x by using the two-part code. From k = k it reaches its minimal and equals to K(x). For random strings the curve starts at log S = n for k=0 and drops with a slope of -1 until reaching the x-axis at k = n. Each k-bit reveals one of the bits of x and reduces the set by half. The Kolmogorov minimal sufficient statistic is defined as the program p that describes the smallest set S such that K(S ) + log S K(x) [5]. The two-stage description of x is as good as the best single-stage description of x. The descriptive complexity of S is then k. 3 Causal Models We elaborate the theory of causal models, or causally interpreted Bayesian networks, in three steps. First, we show 3.1 Representation of Distributions Causal models offer a probabilistic interpretation of causality. They are fundamentally Bayesian networks, which offer dense representations of joint distributions. A joint distribution is defined over a set of stochastic variables X 1... X n and defines a probability (P [0, 1]) for each possible state (x 1... x n ) X 1,dom X n,dom, where X i,dom stands for the domain of variable X i. The joint distribution can be factorized relative to a variable ordering (X 1,..., X n ) as follows: P (X 1,..., X n ) = n P (X i X 1,..., X i 1 ) (3) i Variable X j can be removed from the conditioning set of variable X i if it becomes conditionally independent from X i by conditioning on the rest of the set: P (X i X 1... X i 1 ) =P (X i X 1... X j 1, X j+1... X i 1 ). (4) Such conditional independencies reduce the complexity of the factors in the factorization. The conditioning sets of the factors can be described by a Directed Acyclic Graph (DAG), in which each node represents a variable and has incoming edges from all variables of the conditioning set of its factor. The joint distribution is then described by the DAG and the conditional probability distributions (CPDs) of the variables conditioned on its parents: P (X i parents(x i )). A Bayesian network is a factorization that is minimal, in the sense that no edge can be deleted without destroying the correctness of the factorization. Although a Bayesian network is minimal, it depends on the chosen variable ordering. Some orderings lead to the same networks, but others result in different topologies. Take 5 stochastic variables A, B, C, D and E. Fig. 2 shows the graph that was constructed by simplifying the factorization based on variable ordering (A, B, C, D, E) by the three given conditional independencies. However, the Bayesian network, describing the same distribution, but

3 Figure 3. Bayesian network based on variable ordering (A, B, C, E, D) and five independencies. based on ordering (A, B, C, E, D), depicted in Fig. 3 contains 2 edges less because of 5 useful independencies. We call the minimal factorization as the factorization that has the least number of variables in the conditioning sets. 3.2 Representation of Independencies Pearl, Verma, and others started to interpret the DAG of a Bayesian network as a representation of the conditional independencies of a joint distribution [10]. They constructed a graphical criterion, called d-separation, for retrieving independencies from the graph that follow from the Markov condition, which states that a node becomes independent of its non-descendants by conditioning on its parents. Take the graph of Fig. 3. The d-separation criterion tells us that variable B separates A from E, since B blocks the path A B E. On the other hand, the path A C D E is blocked by C D E, which is called a v-structure. This path gets unblocked given D. The Markov condition holds for a Bayesian network, so that every independency found with the d-separation criterion in its DAG appears in the distribution. A Bayesian network is called faithful to the distribution if it represents all conditional independencies of the distribution. 3.3 Representation of Causal Relations Where Bayesian networks are mainly concerned with offering a dense and manageable representation of joint distributions, causal models intend to graphically describe the structure of the underlying physical mechanisms governing a system under study. In a causal model the state of each variable, represented by a node in the graph, is generated by a stochastic process that is determined by the values of its parent variables in the graph. The stochastic variation of this assignment is assumed independent of the variations in all other assignments. Each assignment process remains invariant to possible changes in assignment processes that govern other variables in the system. This modularity assumption enables the prediction of the effect of interventions, which are defined as specific modifications of some factors in the product of the factorization (Eq. 3) [11]. A causal model corresponds to a joint distribution defined over the variables and this results in a close connection between causal and probabilistic dependence [14]. For a causal model, the Causal Markov Condition tells us how variables depend on each other: each variable is probabilistically independent of its non-effects conditional on its direct causes. The probabilistic aspect of the condition is similar to the Markov condition. Hence, a causal model can be regarded a Bayesian network in which all edges are interpreted as representing causal influences between the corresponding variables. This interpretation represents the second aspect of the Causal Markov Condition: every probabilistic dependence must have a causal explanation (the so-called Principle of the Common Cause) [18]. Furthermore, causal model theory is based on the Minimality Principle (minimality of the model) and the Faithfulness Property (model describes all independencies). Spirtes, Glymour and Scheines rely in their work on causal models on an axiomatization of these 3 conditions [13]. 4 Related Work Kolmogorov complexity and related methods, such as Minimum Message Length (MML) [17, 16] and Minimum Description Length (MDL) [12], are mostly used for selecting the best model from a given set of models. The choice of the model class, however, determines the regularities that are considered. During our discussion, we try not to stick to an a priori chosen set of regularities, but search for the relevant regularities. By Theorem of [11], Pearl describes for which distributions faithful graphs exist and can be learned: the absence of d-separation implies dependence in almost all distributions compatible with the graph G. The reason is that a precise tuning of the parameters is required to generate independency along an unblocked path in the diagram, and such tuning is unlikely to occur in practice. Pearl solves this problem by imposing a stability restriction on the distribution [11](sec. 2.4). The occurrence of any independency must remain invariant to any change in the distributional parametrization of the graph. This corresponds with regularities in the CPDs, as will be proved by theorem 4. A change of the CPDs would break the regularity. Pearl claims that there exists at least 1 distribution faithful with the model, while we show that all typical models of the DAG model set are faithful. The interventions viewpoint on causality describes only one aspect of causality, see [18] for an overview of different views. 5 Minimal Description of Distributions A joint distribution P (X 1... X n ) can be described shorter by a factorization that is reduced by conditional independencies. The minimal factorization leads to P (X 1... X n ) = CP D i. The descriptive size of the CPDs is determined by the number of variables in the conditioning sets. The total number of conditioning variables thus defines the shortest factorization. A two-part descrip-

4 tion is then: descr(p (X 1... X n )) ={parents(x 1 )... parents(x n )} + {CP D 1... CP D n } (5) Note that the parents lists can be described very compact. For example with a n bit string in which bit i is 1 if X i is present in the list. The following theorems show that the first part offers the minimal model if the CPDs are random and unrelated. Theorem 1 The parents lists, {parents(x 1 )... parents(x n )}, in the two-part code given by Eq. 5 contains meaningful information of a probability distribution. Every variable X j that can be eliminated from the conditioning set of X i due to a conditional independency as stated by Eq. 4, results in a reduction of the descriptive complexity by ( X i,dom 1). X 1,dom... X j 1,dom.( X j,dom 1). X j+1,dom... X i 1,dom.d (6) with X k,dom the size of the domain of X k and d the precision in bits to which each probability is described. The description of variable X j in the parents list takes no more than log n bits, which is almost always lower than the above complexity reduction (except when d is taken absurdly small). Every bit of the parents lists reduces the descriptive complexity by more than one bit and, hence, is meaningful information. 6 Equivalence with Causal Model Theory We hypothesize that the above decomposition is equivalent with the theory of causal models. The relation between both is proved by two theorems. 6.1 Relation between minimal factorizations and Bayesian networks Theorem 3 If a faithful Bayesian network exists for a distribution, it is the minimal factorization. Oliver and Smith define the conditions for sound transformations of Bayesian networks, where sound means that the transformation does not introduce extraneous independencies [9]. No edge removal is permitted, only reorientation and addition of edges. Additionally, if a reorientation destroys a v-structure or creates a new one, an edge should be added connecting the common parents in the former or in the newly created v-structure. Such transformations however eliminate some independencies represented by the original graph. Assume the existence of a Bayesian network based on a different variable ordering that has fewer edges than the faithful network. It must be possible to transform one into the other. The network has fewer edges, so edges must be added by the transformation, and this destroys independencies. But the network cannot represent more independencies, because the faithful network represents all independencies. The assumption leads to a contradiction. Theorem 2 If the two-part code description of a probability distribution, given by Eq. 5, results in an incompressible string, the first part is a Kolmogorov minimal sufficient statistic. If a more compact description of the distribution would exist, the two-part decomposition would contain redundant bits. Theorem 1 showed that the first part contains meaningful information. The second part does not, since it is incompressible. The first part, described minimally, is therefore the Kolmogorov minimal sufficient statistic. The distribution decomposes uniquely 1 and minimally into the CPDs, which are atomic and independent. The decomposition thus offers a canonical representation. The system under study is decomposed into independent subsystems that are only connected via the variables. In absence of further information, we may assume that each CPD represents a part of reality. This implies modularity, one subsystem can be replaced by another without affecting the rest of the system. 1 There can be multiple minimal factorizations, which are closely related though. We come back to this in the next section. Theorem 4 A Bayesian network with unrelated, random conditional probability distributions (CPDs) is faithful. Recall that a Bayesian network is a factorization that is edge-minimal. This means that for each parent pa i,j of variable X i holds that P (X i pa i,1,... pa i,j,... pa i,k ) P (X i pa i,1,... pa i,j 1, pa i,j+1,... pa i,k ) (7) The proof will show that any two variables that are d- connected are dependent, unless the probabilities of the CPDs are related. We consider the following possibilities. The two variables can be adjacent (a), related by a Markov chain (b) 2, a v-structure (c), a combination of both or connected by multiple paths (d). First we prove that a variable marginally depends on each of its adjacent variables (a). Consider nodes D and E of the Bayesian network of Fig. 3. For not overloading the proof, we will demonstrate that P (D E) P (D), but the proof can easily be generalized. The first term can be 2 Recall that a Markov chain is a path not containing v-structures.

5 written as: P (D E) = P (D E, c 1 ).P (c 1 ) + P (D E, c 2 ).P (c 2 ) +... (8) with c 1 and c 2 C dom. C is also a parent of D, thus, by Eq. 7, there are at least two values of C dom for which P (D E, c i ) P (D E) 3. Take c 1 and c 2 being such values, thus P (D E, c 1 ) P (D E, c 2 ). There are also at least 2 such values of E dom, take e 1 and e 2. Eq. 8 should hold for all values of E and equal to P (D) to get an independency. This results in the following relation among the probabilities: P (D e 1, c 1 ).P (c 1 ) + P (D e 1, c 2 ).P (c 2 ) = P (D e 2, c 1 ).P (c 1 ) + P (D e 2, c 2 ).P (c 2 ) (9) Note that the equation can not be reduced, the conditional probabilities are not equal to P (D) nor to each other. Next, by the same arguments it can be proved that variables connected by a Markov chain are by default dependent (b). Take A B E in Fig. 3, independence of A and E requires that P (E a) = b B P (E b).p (b a) = P (E) a A. (10) and this also results in a regularity among the CPDs. In a v-structure, both causes are dependent when conditioned on their common effect (c), for C D E, P (D C, E) P (D E) is true by Eq. 7. Finally, if there are multiple unblocked paths connecting two variables, then independence of both variables implies a regularity, too (d). Take A and D in Fig. 3: P (D A) = P (D c, e).p (c A).P (e b).p (b A). b B c C e E (11) Note that P (c, e A) = P (c A).P (e A) follows from the independence of C and E given A. All factors from the equation satisfy Eq. 7, so that the equation only equals to P (D) if there is a relation among the CPDs. Table 1 gives an example distribution of P (D E, C) for which D and E are independent, assuming that P (C = 0) = P (C = 1) = 0.5. The regularity of Eq. 9 applies for the distribution. From the theorem it follows that the Bayesian network is a minimal factorization. Bayesian networks not based on a minimal factorization, such as the one of Fig. 2, are always compressible, namely by the regularities among 3 P (D E) is a weighted average of P (D E, C). If one probability P (D E, c 1 ) is different than this average, let s say higher, than there must be at least one value lower than the average, thus different. E C P (D C, E) Table 1. Example of a CPD for which P (D E) = P (D), assuming that P (C = 0) = P (C = 1) = 0.5. the CPDs that follow from the independencies not represented by them. Multiple faithful models can exist for a distribution though. These models represent the same set of independencies and are therefore statistically indistinguishable. They define a Markov-equivalent class. It is proved that they share the same v-structures and only differ in the orientation of the edges [11]. The corresponding factorizations have the same number of conditioning variables, thus all have the same complexity. The observations cannot conclude on the correct model, but we have demarcated a set of closely related models which contains the correct model. 6.2 Equivalence The conditions for causal models, Minimality, Faithfulness and the Causal Markov Condition (section 3.3), are fulfilled for a minimal factorization with random CPDs. Minimality holds by definition, the faithfulness is proved by theorem 4 and the conditional independencies that follow from the Markov condition are present since it is a valid Bayesian network. Finally, the causal interpretation of the edges is correct as long as we define causality in terms of interventions. The modularity of the decomposition captures Pearl s interventions. An intervention, which Pearl considers an atomic operation, can be seen as replacing one specific CPD with a CPD that allows perfect control over the variable (for setting it to a certain state). We hypothesize that the consequences of causal models, like d-separation, the inference and identifiability algorithms, are conform with the CPD decomposition. They solely depends on the CPDs and the variables that link them. Take the flow of information through a causal model. In the model of Fig. 4, variables D and E contain information about A. This information is captured by C. The decrease of uncertainty of A depends on the information that D or E provide about C, but is independent whether the information comes from D or E. C screens A off from D and E, and also D from E. The interaction between the variables happens via C and is represented by the edges. The graphical representation of a causal model suggests that the edges constitute the atomic elements of the model. This can, however, not explain the interaction between A and B. C does not screen A off from B. Moreover, C should be known for having a dependency between A and B. This interaction pattern is captured by taking the CPDs as the atomic elements. We can say that

Figure 5. Causal model in which A is independent from D. Figure 4. Example Causal Model. the information travels between the CPDs through the variables. 7 Validity 7.

6 Figure 5. Causal model in which A is independent from D. Figure 4. Example Causal Model. the information travels between the CPDs through the variables. 7 Validity 7.1 Validity of Faithfulness Faithfulness of a causal model is the cornerstone of causal model theory and the accompanying learning algorithms. We showed that a causal model relies on a specific type of regularities, the conditional independencies that follow from the Markov condition. The simplest model should, however, exploit all regularities. There are regularities that a causal model does not capture. If such regularities appear, the minimal Bayesian Network can be either faithful or not. If the model remains faithful, the additional regularities do not interfere with the conditional independencies. They can thus be regarded as regularities of a lower level. A well-known example is when the description of individual CPDs can be further compressed. This regularity is called local structure [1] and appears inside a building block. If the minimal Bayesian Network is unfaithful, the regularities generate independencies not resulting from the Markov condition alone. This does not exclude that the distribution might be described minimally by a causal model augmented with a description of the additional regularities, ie. that the CPD decomposition is still valid. The mostknown example of unfaithfulness is when in the model of Fig. 5 A and D appear to be independent [13]. This happens when the influences along the paths A B D and A C D exactly balance, so that they cancel each other out and the net effect results in an independence. The independence of A and D is, however, not expected by the causal model. The distribution is not typical for the set of distributions that can be described by the model. d-seperation describes the independencies that can be expected from the typical distributions of the causal model set. Distributions with deterministic or functional relations can not be represented by a faithful graph too [13]. In [7] we show that this is related to the violation of the intersection condition, one of the conditions that Pearl imposes on a distribution in the elaboration of causal theory and its algorithms [10]. The solution we proposed in [7] is to incorporate the information about deterministic relations in an augmented causal model, and to extend the d-separation criterion so that it can be used to retrieve all conditional independencies from the model. In such way, the faithfulness of the model can be reestablished, and the model again incorporates all regularities from the data. These examples do not challenge the validity of the causal interpretation of the model. The next chapter focusses on other counterexamples. 7.2 Validity of the CPD Decomposition The CPD decomposition of a joint distribution implies that they represent independent mechanisms. In the model of Fig. 4, CP D D and CP D E are independent, the states of D and E only depend on C. This decomposition is, however, not valid for all systems. For some systems, the CPDs do not represent independent mechanisms. Take the example of particle decay, one of the counterexamples of the Causal Markov Condition reported by [18], p.55, taken from Fraassen (1980, p. 29): Suppose that a particle decays into 2 parts, that conservation of total momentum obtains, and that it is not determined by the prior state of the particle what the momentum of each part will be after the decay. By conservation, the momentum of one part will be determined by the momentum of the other part. By indeterminism, the prior state of the particle will not determine what the momenta of each part will be after the decay. Thus there is no prior screener off. The prior state S fails to screen off the momenta. But by symmetry, neither of the two parts momenta M 1 and M 2 can be the cause of the other. This system cannot be represented by a causal model. The generation of M 1 and M 2 by S should be considered as one (causal) mechanism, as shown in Fig. 6. Some of the other counterexamples of the Causal Markov Condition given in [18] are similar. Take the set of strings of n bits for which m consecutive bits are 1 and the others are 0. For n = 8 and m = 2, and represent valid strings. Every bit can be regarded as a discrete variable. By picking valid strings randomly, the joint distribution is observed. All bits

Finally, note that faithfulness can be interpreted in a broader sense as the ability of a model to explain all regularities of the data. References Figure 6.

are correlated, but each pair becomes independent by conditioning on some other bits.

7 Finally, note that faithfulness can be interpreted in a broader sense as the ability of a model to explain all regularities of the data. References Figure 6. Particle with state S decays into 2 parts with momenta M 1 and M 2. Figure 7. Two models for a pattern in a 8-bit string. are correlated, but each pair becomes independent by conditioning on some other bits. The simplest model for this pattern contains a latent variable, denoting the start position of the non-zero bit sequence. The causal model, shown in Fig. 7(a), however, considers each edge as a separate mechanism. But the mechanisms are not unrelated, the decomposition is not valid. The model fails to represent the many conditional independencies. The model of Fig. 7(b) is more accurate, it indicates that there is one mechanism generating the states of all bits. 8 Conclusions The conditional independencies, on which causal model theory is based, can be regarded as the regularities that allow compression of distributions and the construction of minimal models. We showed that the meaningful information of a causal model lies in its DAG, which defines the decomposition of the distribution into independent submodels, the CPDs. If this decomposition exploits all regularities, causal model theory describes what we can expect from such a system. For example, which conditional independencies appear, or the effect of interventions. In absence of more information, the model offers a good hypothesis about reality. This assumption is supported by the fact that science relies on falsification rather than on confirmation (Popper). One can never prove that an hypothesis is invariably correct, one can only search for observations that refute the hypothesis. [1] C. Boutilier, N. Friedman, M. Goldszmidt, and D. Koller. Context-specific independence in bayesian networks. In Uncertainty in Artificial Intelligence, pages , [2] N. Cartwright. What is wrong with bayes nets? The Monist, pages , [3] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, Inc., [4] D. Freedman and P. Humphreys. Are there algorithms that discover causal structure? Synthese, 121:2954, [5] P. Gács, J. Tromp, and P. M. B. Vitányi. Algorithmic statistics. IEEE Trans. Inform. Theory, 47(6): , [6] K. B. Korb and E. Nyberg. The power of intervention. Minds and Machines, 16(3): , [7] J. Lemeire, S. Maes, S. Meganck, and E. Dirkx. The representation and learning of equivalent information in causal models. Technical Report IRIS-TR-0099, Vrije Universiteit Brussel, [8] M. Li and P. M. B. Vitányi. An Introduction to Kolmogorov Complexity and Its Applications. Springer Verlag, [9] R. M. Oliver and J. Q. Smith. Influence Diagrams, Belief Nets and Decision Analysis. Wiley, [10] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA, Morgan Kaufman Publishers, [11] J. Pearl. Causality. Models, Reasoning, and Inference. Cambridge University Press, [12] J. Rissanen. Modeling by shortest data description. Automatica, 14: , [13] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. Springer Verlag, 2nd edition, [14] W. Spohn. Bayesian nets are all there is to causal dependence. In In Stochastic Causality, Maria Carla Galaviotti, Eds. CSLI Lecture Notes, [15] P. M. B. Vitányi. Meaningful information. In P. Bose and P. Morin, editors, ISAAC, volume 2518 of Lecture Notes in Computer Science, pages Springer, [16] C. S. Wallace. Statistical and Inductive Inference by Minimum Message Length. Springer, [17] C. S. Wallace and D. L. Dowe. An information measure for classification. Computer Journal, 11(2): , [18] J. Williamson. Bayesian Nets And Causality: Philosophical And Computational Foundations. Oxford University Press, 2005.

Inferring the Causal Decomposition under the Presence of Deterministic Relations.

Inferring the Causal Decomposition under the Presence of Deterministic Relations. Jan Lemeire 1,2, Stijn Meganck 1,2, Francesco Cartella 1, Tingting Liu 1 and Alexander Statnikov 3 1-ETRO Department, Vrije