Estimating Causal Networks from Multivariate Observational Data

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Estimating Causal Networks from Multivariate Observational Data"

Transcription

1 Research ollection Doctoral Thesis Estimating ausal Networks from Multivariate Observational Data uthor(s): Nowzohour, hristopher Publication Date: 2015 Permanent Link: Rights / License: In opyright - Non-ommercial Use Permitted This page was generated automatically upon download from the ETH Zurich Research ollection. For more information please consult the Terms of use. ETH Library

2 Diss. ETH No Estimating ausal Networks from Multivariate Observational Data. dissertation submitted to ETH ZURIH for the degree of Doctor of Sciences presented by HRISTOPHER NOWZOHOUR Master of Science, University of Oxford born July 2, 1986 citizen of Germany accepted on the recommendation of Prof. Dr. Peter ühlmann, examiner Prof. Dr. Marloes Maathuis, co-examiner 2015

3

4 To my family

5

6 cknowledgements First, I would like to thank my supervisor, Prof. Peter ühlmann, who was ever encouraging and set an example balancing an unbelievable number of personal and professional committments. I am also grateful to Prof. Marloes Maathuis, who co-supervised me during the second part of my doctoral studies and whose enthusiasm and attention to detail made our collaboration very enjoyable. Without the encouragement of Prof. Nicolai Meinshausen, while being a student at Oxford, I would have never considered a PhD in statistics. I would not have gotten through my PhD without the copious amounts of chocolate and good vibes, generously supplied by my two amazing office mates nna and Ewa. I won t forget the climbs done with lain, the Salsa dancing with Sophie, or the rounds of razy Dog played with Jan, nna, Jonas, Ruben, and all the others! I am thankful to all SfS colleagues for the great atmosphere at our institute, be it at work or outside. Finally, I am very grateful to my parents and my sister for their lasting support and encouragement. v

7

8 ontents bstract Zusammenfassung ix xi 1. Introduction ackground ausal Models Structure Learning Outline of Thesis Score-based ausal Learning in dditive Noise Models Introduction The Method Notation and Definitions Penalized maximum likelihood estimation Theoretical Results Numerical Results Identifiability depending on Linearity and Gaussianity Random Edge Functions Larger Networks and Thresholding Real Data onclusions onsistency Proof Structure Learning with ow-free cyclic Path Diagrams Introduction Model and Estimation Graph Terminology vii

9 ontents The Model Penalized Maximum Likelihood Equivalence Properties Greedy Search Score Decomposition Uniformly Random Restarts Greedy Equivalence lass onstruction Implementation Empirical Results ausal Effects Discovery on Simulated Data Genomic Data onclusions Distributional Equivalence Likelihood Separation onclusions and Future Work Specific Extensions for DG learning Specific Extensions for P learning node DGs Full Double Sink Double Source hain Single Edge Empty Equivalence lasses for 3-node Ps Finding Equivalent Ps Finding Equivalence lasses lgorithms DG Learning lgorithms P Learning lgorithms ibliography 95 viii

10 bstract The field of statistical causal inference is concerned with estimating cause-effect relationships between some variables from i.i.d. observations. This is impossible in general, e.g. one cannot distinguish X Y from X Y without making further assumptions. However, when more variables are involved or certain structural or distributional assumptions are made, causal inference becomes possible. This is relevant for applications, where randomized experiments are not feasible to test causal hypotheses (econometrics) or the large number of hypotheses requires some kind of pre-screening (genomics). This thesis is about structure learning, which means estimating the underlying causal graph from data. Specifically, the focus of interest is score-based methods, which assign every possible causal graph a numeric score (depending on the observed data) and then try to find the graph maximizing this score. The two main challenges are: 1. Defining a meaningful score, that is maximized by the true underlying graph only, and is easily computable at the same time. 2. Solving the combinatorial optimization problem of maximizing the score over all possible graphs. n important class of causal models are directed acyclic graphs (DGs), where there are no cyclic relations and no hidden variables. DGs (also known as ayesian networks) encode conditional independencies in the joint distribution, and are only identifiable up to their equivalence class in general (there is generally more than one DG encoding the same set of conditional independencies). When the model is restricted to additive noise, the independence of the noise terms can be used to identify DGs completely, unless the model is linear and Gaussian (in the continuous case). This thesis presents a score-based method for continuous identifiable additive noise models. Specifically, a penalized ix

11 bstract pseudo-likelihood score is developed for this nonparametric setting and proved to be consistent. The method is also successfully tested on simulated and real datasets. To also accomodate hidden variables, the class of DGs needs to be extended. useful way to do this are bow-free acyclic path diagrams (Ps), which put some restrictions on the hidden structure, but are statistically viable. The parametrization is assumed to be linear and Gaussian to facilitate likelihood scoring. This means full identifiability is not possible anymore. In contrast to DGs, no established theory exists about model equivalency for Ps. This thesis presents a greedy search method for this case, that estimates the equivalence class of the underlying graph, as well as some theoretical results about model equivalency. The method is shown to work on simulated data and is applied to a well-known genomics dataset, where the statistical fit is shown to be much better than for DG models. x

12 Zusammenfassung Das Gebiet der statistischen kausalen Inferenz beschäftigt sich mit dem Schätzen von Ursache-Wirkungs-Zusammenhängen zwischen einer Reihe von Variablen basierend auf i.i.d. Daten. Ganz generell ist das nicht möglich (z.. X Y von X Y zu unterscheiden) ohne zusätzliche nnahmen zu treffen. Dies ändert sich, sobald mehr Variablen involviert sind oder bestimmte strukturelle oder verteilungstechnische nnahmen getroffen werden. Kausale Inferenz ist besonders relevant für nwendungen, für die randomisierte Experimente nicht möglich sind (z.. in der Ökonometrie) oder wo die grosse nzahl der zu testenden Hypothesen eine rt Vorauswahl erfordert (z.. in der Genomik). In dieser Dissertation geht es darum, den zugrundeliegenden kausalen Graphen von Daten zu schätzen. Der Fokus liegt insbesondere auf Score-basierten Methoden, die jedem Graphen einen (von den Daten abhängigen) numerischen Vergleichswert die Score zuordnen und dann versuchen den wertmaximierenden Graphen zu finden. Die zwei Hauptherausforderungen sind: 1. Das Definieren einer sinnvollen Score-Funktion, die nur vom wahren kausalen Graphen maximiert wird und zugleich einfach zu berechnen ist. 2. Das Lösen des kombinatorischen Optimierungsproblems um die Score über alle Graphen zu maximieren. Eine wichtige Klasse von kausalen Modellen sind DGs (directed acyclic graphs), in denen es keine zyklischen Strukturen und keine verborgenen Variablen gibt. DGs (auch als ayes sche Netze bekannt) kodieren konditionelle Unabhängigkeiten in der gemeinsamen Wahrscheinlichkeitsverteilung und sind im llgemeinfall nur bis auf ihre Äquivalenzklasse identifizierbar (es gibt in der Regel mehrere DGs die die gleichen konditionellen Unabhängigkeiten kodieren). Wenn man das xi

13 Zusammenfassung Modell auf additive Fehlerterme beschränkt, kann die Unabhängigkeit dieser Fehlerterme dazu genutzt werden den DG komplett zu identifizieren, ausser das Modell ist linear und normalverteilt (im kontinuierlichen Fall). Diese rbeit präsentiert eine Score-basierte Methode für kontinuierliche und identifizierbare Modelle mit additiven Fehlertermen. Da dieses Szenario nichtparametrisch ist, wurde eine penalisierte pseudo-likelihood Score entwickelt und deren Konsistenz bewiesen. Die Methode wurde ausserdem erfolgreich an simulierten und reellen Daten getestet. Um auch verborgene Variablen zu modellieren muss die Klasse der DGs erweitert werden. Eine Möglichkeit dies zu tun sind Ps (bowfree acyclic path diagrams). Die verborgenen Variablen sind hier bestimmten Restriktionen unterlegen, aber dafür ist das statistische Modell praktikabel. Die Parametrisierung ist linear und normalverteilt, sodass eine likelihood score eingesetzt werden kann. Das heisst aber auch, dass das Modell nicht mehr komplett identifizierbar ist. Im Gegensatz zu DGs gibt es für Ps keine vollständige Theorie der Äquivalenzklassen. In dieser rbeit wird ein Greedy-lgorithmus für dieses Szenario präsentiert, der die Äquivalenzklasse des zugrundeliegenden Graphen schätzt. usserdem werden einige theoretische Resultate über die Äquivalenzstruktur von Ps vorgestellt. Die Methode wurde ebenfalls erfolgreich an simulierten Daten getestet und wurde darüberhinaus auf ein bekanntes Genomik-Datenset angewendet, wo der statistische Fit erheblich besser war als für DG Modelle. xii

14 hapter 1. Introduction Happy the man who has been able to learn the causes of things. Virgil, Georgics II The goal of many scientific endeavours is to detect causal relationships in the world surrounding us. ssociations are ubiquitous but often difficult to explain with causal models, and only a causal model is usually considered a sufficient explanation. side from the rather philosophical motivation of true understanding of a phenomenon, there are more tangible advantages to knowing cause and effect. Often, understanding a system is a precursor to intervening in the system in an effort to change its outcome. If the outcome cannot be controlled directly, the only way to change it is to influence its causes. If a drug is known to cause the cure of an illness, the way to make a sick person healthy is to administer the drug. The quest of causality is finding out which drugs cure the disease. Sometimes this is possible by direct experiment a number of patients is randomly split into two groups, one of which receives the drug, the other a control. The random assignment rules out any other causes (like gender, age, etc.) for potential improvement, and any change in outcome between the two groups can then be attributed to the drug. For various reasons a randomized experiment is not always possible. It might simply be too costly to test all relevant causal hypotheses. There are thousands of genes in the genome of most species (and an exponential number of gene combinations), which typically cannot all be tested for their effect on a certain phenotype (like pest resilience for 1

15 hapter 1. Introduction crops). nother reason preventing direct experiments might be ethical concerns. It would be wrong to forcefully expose people to a harmful substance (like tobacco) to examine the causal connection with certain diseases. Lastly, some experiments are simply impossible, which is often the case in econometrics. One cannot change global interest rates to asses their effect on unemployment. Sometimes the time ordering of events can provide causal information (a principle that is known as Granger causality in econometrics), but this only works as long as one can exclude the possibility of counfounding variables (post hoc fallacy). When experimental data are not available, one is left with observational data. This makes causal inference hard since then statistical associations can have other explanations than direct causal effects. The well-known lore of correlation does not imply causation serves as a warning for this, and an example of correlation based on confounding is shown below. Example onsider the dataset shown in the left panel of Figure 1.1. There is a strong association between life expectancy and the number of internet users in 211 countries. Does this mean, increasing your web milage will lead to a longer life? Most likely not. Far more realistic is the causal model depicted in the right panel of the figure, where there is a confounding variable that is not observed but causes both other variables. In this case the confounder might be the economic development status of a country, of which gross national income (GNI) could be considered a proxy. Indeed, when conditioning on GNI, the association between the two other variables weakens significantly. However, it would be wasteful to shun any purely observational data and just capitulate in these cases. orrelation doesn t imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing look over here. Randall Munroe, xkcd 552 2

16 1.1. ackground 80 Life Expectancy at birth (years) Gross National Income Low Lower Middle Upper Middle High LE Dev IU Internet users (per 100 people) Figure 1.1.: Life expectancy versus number of internet users conditioned on gross national income in the left panel (2013 data from 211 countries, Worldbank). Possible causal graph connecting life expectancy (LE), internet users (IU), and development (Dev) in the right panel. First, causal relationships imply statistical associations, so it would be futile to look for causation where there is no association. Second, it is possible to rule out a number of (and sometimes all but the correct) candidate causal models based on statistical associations in observational data, as research in this discipline during the last decades has shown. Third, with an increasingly restrictive cascade of assumptions, it is possible to identify more and more causal information from observational data. This thesis explores some of these cases and presents algorithms to extract causal models from observational data ackground The task of causal inference can be roughly split into two consecutive stages: learning the causal model from data and predicting intervention effects for a given causal model. While the former, also called structure learning, is concerned with distinguishing causes and effects, the latter is concerned with the specific types and strengths of effects. 3

17 hapter 1. Introduction This thesis is concerned mainly with structure learning. There is a strong connection with graphical modelling (and particularly ayesian networks), which causal inference can essentially be considered an application of. fairly comprehensive overview of the field can be found in Pearl (2009), with the earlier work of Spirtes et al. (1993) also providing valuable insight ausal Models causal model is usually visualised as a directed graph, consisting of nodes (representing variables) and directed edges between the nodes (representing causal relations). n example is given in the right panel of Figure 1.2, where a change in X 1 would cause a change in X 2 (but not vice versa). The idea is that every variable that can be measured is completely determined by some other variables (its causes). In practice, one rarely has access to (or interest in) all causes of all variables, so all the unknown causes for a given variable are lumped together into a noise term. s long as these unknown causes are distinct for each observed variable (to a reasonable approximation), the noise terms can be considered independent. If there are unobserved confounders, the noise terms are no longer independent (a scenario which is considered in hapter 3), and the presence of a confounder is usually marked with a bidirected edge in the graph. Of course a cause-effect relationship is always relative to a resolution since it can be decomposed into more and more intermediate stages. The key modelling step now is to associate the noise terms and the exogenous variables 1 with a set of probability distributions, which in turn give rise to corresponding joint probability distributions over the observed variables. This induces the statistical model (set of distributions) associated with the causal model (graph) and is essential for all statistical structure learning methods. The causal graph might well be cyclic, that is, it might have feedbackloops. However, it is usually assumed that this is not the case. The 1 Exogenous variables are not caused by any variables in the system. They are also called source variables. 4

18 1.1. ackground Figure 1.2.: ausal Inference in a nutshell. reasons for this are twofold. On the one hand, most variables could be indexed by time and be measured in an arbitarily dense fashion. The notion of causality that most people subscribe to then implies that when measuring at the right time scale, the true causal model will not be cyclic. On the other hand, and more importantly, including cyclicity makes things much more complicated. This is why, in the spirit of progressing from simpler towards more complicated models, most methods so far still assume acyclicity. The causal graph is typically associated with a structural equation model (SEM), which is supposed to model the data generating process (see Figure 1.2). The SEM is simply a system of equations, specifying each variable as a function of its direct causes and a noise term. s such, it encodes the causal graph with its functional dependencies (there is an edge from X 1 to X 2 if and only if f 2 depends on X 1 in Figure 1.2). The reason for the term structural is the asymmetric meaning of the equal sign in these systems it separates causes from effects (usually the causes are written on the right-hand side). The joint distribution over the observed variables is completely determined by the SEM and the marginal distributions of the noise terms and the exogenous variables. This very general model can be constrained by various structural or distributional assumptions. Structural ssumptions The following are assumptions on the causal graph, which are made at various points in this thesis. 5

19 hapter 1. Introduction (1) cyclicity. There are no feedback loops. (2) ausal Sufficiency. There are no hidden confounders the noise terms are independent and the causal graph consists of directed edges only. (3) ow-freeness. There is no hidden confounder between two variables, if one of the variables is a direct cause of the other. This is strictly weaker than assumption (2), since it allows some hidden confounders. When assumptions (1) and (2) hold, the causal graph is a directed acyclic graph (DG). ausal structure learning for DGs is discussed in hapter 2. In this case, the distribution generated by the corresponding SEM is also said to be Markov to the DG, since it satisfies certain conditional independence properties resulting from the DG. When assumptions (1) and (3) hold, the causal graph is a bow-free acyclic path diagram (P), like the one shown in Figure 1.2. This rather special class of models is useful, since it can accomodate some hidden confounders while still being statistically tractable, and is discussed in hapter 3. Distributional ssumptions The following are assumptions on the joint probability distribution or the parametrization of the model, which are made at various points in this thesis. (4) ausal Minimality. There is no proper subgraph 2 of the true graph, that can model the data equally well. This would be violated if there were a superfluous edge in the graph (e.g. if f 2 were independent of X 1 in Figure 1.2). (5) Faithfulness. Every conditional independency in the distribution has to result from the causal graph. This is strictly stronger version of assumption (4) and would be violated by cancelling paths, where different influences between variables cancel each other out. For a violation of causal minimality or faithfulness, 2 proper subgraph is the same as the original graph, but with some missing edges. 6

20 1.1. ackground the distributional parameters would have to be finely tuned, so a violation of faithfulness would almost surely not occur with randomly generated models. (6) dditive Noise. The structural equations are additive in the noise terms, essentially turning each structural equation into a nonlinear regression model. (7) Gaussian and Linear. The marginal distributions of the noise terms and exogenous variables are Gaussian, and the structural equations are linear. This is a strictly stronger version of assumption (6) and turns each structural equation into a linear regression model. The resulting distribution over the observed variables is jointly Gaussian. (8) Non-Gaussian or Non-linear. The negation of assumption (7) (but still keeping assumption (6)). nother important assumption that is often made is that the data are i.i.d. This means in particular that the data cannot have time structure and must come from the same experimental conditions. Equivalency fundamental problem in structure learning is that the causal graph is, in general, not determined by the joint distribution. Instead, the graphs cluster in equivalence classes, which are statistically indistinguishable, meaning they can all generate the exact same distributions. This means the causal graph is only identifiable up to its equivalence class. In the DG case, when assumptions (1) and (4) hold, the equivalence classes are determined by conditional independencies that have to hold in the corresponding joint distribution and which can be read off the graphs using a criterion called d-separation (Pearl, 2009). In practice, one often makes assumption (7), since in a joint Gaussian distribution conditional independencies correspond to vanishing partial correlations, which can be estimated easily from data. However, it turns out that in the additive noise setting the linear Gaussian case stands out, in the sense that it is the only case where 7

21 hapter 1. Introduction full identifiability cannot be achieved (Hoyer et al., 2009). s soon as one departs from this and assumes either nonlinear structural equations or non-gaussian noise, the causal graph becomes identifiable. This is because the model class then does not admit independent noise terms for anything but the true graph. This case is considered in hapter 2. For more general model classes like Ps (when assumption (3) holds) the equivalence structure is not completely known, which complicates structure learning significantly. This case is considered in hapter 3. When the graph is not identifiable for some model classes, structure learning results in the correct equivalence class at best. In lucky cases, this equivalence class might be small and agree on parts of the graph. ut in general, equivalence classes can get very large. However, it might still be possible to identify causal effects between some of the variables, by considering each graph in the equivalence class individually and then taking lower bounds of the absolute causal effects (this is the ID method by Maathuis et al. (2009)). This is what is done in the P case (hapter 3), since there the equivalence classes can potentially be large Structure Learning Methods for structure learning can be broadly categorized into two classes: constraint-based methods and score-based methods. The former class utilizes properties, that the correct model has to fulfill, and that are testable on data, and constructs the model step by step. The latter class assigns a numeric score to each candidate model und turns the problem into a combinatorial optimization problem (optimizing the score over all possible models). Methods for discrete and continuous data are often quite different in this thesis only the continuous case is considered. The underlying problem of learning the best graph is known to be NP-hard in general, so both types of methods face the problem of either having worst-case exponential complexities or not guaranteeing to find the optimal model. However, it has been shown that in some cases polynomial complexity algorithms exist. Specifically, when a 8

22 1.1. ackground DG model is correctly specified (i.e. the observed data really were generated from such a model), finding the true model is not NP-hard (hickering, 2002). similar result holds for sparse networks with hidden confounders (laassen et al., 2013). onstraint-based Methods The most common methods of this type use the fact that the causal graph implies certain conditional independencies, that are testable on the observed data. The P algorithm for DGs (Spirtes et al., 1993) and the FI, RFI, and FI+ algorithms for models with hidden confounders (Spirtes et al., 1993; Maathuis et al., 2009; laassen et al., 2013) all exploit this property. lthough very general in theory, typically distributional assumptions like linearity and Gaussianity (assumption (7)) have to be made in order to facilitate conditional independence tests. The result of these methods is an estimate of the equivalence class, i.e. all graphs that imply precisely the conditional independencies consistent with the observed data. For nonlinear or non-gaussian additive noise models (when assumption (8) holds), the independence of the noise terms is used as a constraint. The first method of this type was LiNGaM (Shimizu et al., 2006), which assumes linear functions and non-gaussian noise. This was later extended to the general case (Hoyer et al., 2009; Mooij et al., 2009). very general method not needing the additive noise assumption is IGI (Janzing et al., 2012), which is based on the independence of the cause and the mechanism mapping the cause to its effect. ll of these methods result in a single causal graph, since in these model classes full identifiability is possible. Score-based Methods The advantage of a score-based approach is that models become inherently comparable, and optimization techniques like greedy search can be applied. There is a long history of score-based methods for causal inference, going back to LISREL (Jöreskog, 2001) for the linear Gaussian case with possibly hidden confounders. related method is presented 9

23 hapter 1. Introduction in hapter 3 of this thesis. For the DG case the most prominent method is greedy equivalent search (GES, hickering (2002)), which efficiently traverses the search space by clustering graphs into their equivalence classes. For the identifiable setting of assumption (8) a score-based method is presented in hapter 2. If the noise is assumed to be Gaussian, the M framework can be applied (ühlmann et al., 2014). The score for most of the methods in this class consists of a likelihood term, quantifying the model fit, as well as a penalty term for model pruning. The latter is necessary as otherwise overfitting would occur, and the full models would always attain the highest scores. Most often the ayesian Information riterion (I) is used as a score, which is consistent in the large sample limit if the model class is a curved exponential family. The score is usually optimized using some form of greedy hill climbing. This is only a heuristic in general but works reasonably well in practice. If the likelihood factorizes over the variables (i.e. can be written as a product of terms each depending on only a single variable and its parents in the graph), the score can be updated locally, leading to a large computational improvement for greedy search based methods Outline of Thesis This thesis has two main contributions. In hapter 2 a score-based method for the setting of nonlinear or non-gaussian additive noise models is presented (assumptions (1), (2), (4) and (8)). The score is a penalized pseudo-likelihood, where the relevant densities are estimated from the data. The main challenge here was to come up with a sensible scoring method, since the model is nonparametric and the true likelihood function is not known. Some simulations show the method works if the model is correctly specified, and an application to a real dataset shows it is comparable with other state-of-the-art methods. The simulations and real datasets considered are low-dimensional, but the method is in principle adaptable to higher dimensions. This chapter also includes some theoretical results, showing consistency of the score 10

24 1.2. Outline of Thesis as long as the distributions satisfy some technical conditions. In hapter 3 a method for linear Gaussian models with bow-free confounder structure is presented (assumptions (1), (3), (5), and (7)). The score is a penalized Gaussian likelihood, using the maximum likelihood estimator provided by Drton et al. (2009). In this chapter the main challenge was to find an efficient way to do greedy search in this setting, since the likelihood no longer factorizes completely over the variables. dditionally, the detailed structure of equivalence classes is unknown for this model class, making the evaluation of any method difficult (the graph found by the model might be different from the original graph yet be in the same equivalence class). This challenge was partly overcome by a mix of theoretical results, going some way towards characterizing equivalence classes, and a heuristic search method to find equivalent models. Simulations show the method is successful if the model is correctly specified, and an application to real data shows this model class has a much better fit than the more restrictive DG models, suggesting the ubiquity of hidden confounders in many datasets. In the appendices an overview of DGs and Ps over 3 nodes is given, as well as a numerical approach to explore the equivalence structure of small Ps. Finally, all algorithms developed as part of this thesis are given as pseudocode in ppendix. 11

25

26 hapter 2. Score-based ausal Learning in dditive Noise Models 1 Given data sampled from a number of variables, one is often interested in the underlying causal relationships in the form of a directed acyclic graph. In the general case, without interventions on some of the variables it is only possible to identify the graph up to its Markov equivalence class. However, in some situations one can find the true causal graph just from observational data, for example in structural equation models with additive noise and nonlinear edge functions. Most current methods for achieving this rely on nonparametric independence tests. One of the problems there is that the null hypothesis is independence, which is what one would like to get evidence for. We take a different approach in our work by using a penalized likelihood as a score for model selection. This is practically feasible in many settings and has the advantage of yielding a natural ranking of the candidate models. When making smoothness assumptions on the probability density space, we prove consistency of the penalized maximum likelihood estimator. We also present empirical results for simulated scenarios and real two-dimensional data sets (cause-effect pairs) where we obtain similar results as other state-ofthe-art methods. 1 This chapter has been published as Nowzohour and ühlmann (2015) 13

27 hapter 2. Score-based ausal Learning in dditive Noise Models 2.1. Introduction Statistical causal inference is an important but relatively new field. Traditionally, most statistical statements and assertions are associational (X and Y are correlated), rather than causal (changes in X cause changes in Y ). While the former are statements about the joint distribution, the latter are about the underlying causal mechanisms. In practice, the relevant question often is whether variable X has a causal effect 2 on variable Y, possibly mediated by some other variables Z 1,..., Z d in the causal network. In general, the only way to completely identify the causal model is by performing experiments (interventions). However, it is often possible to at least narrow down the space of candidate models by using only observational data (Verma and Pearl, 1991; Spirtes et al., 1993). There are many situations where one is dependent on purely observational data either because performing experiments is infeasible (e.g. astronomical data), unethical (e.g. clinical cancer studies), or both (e.g. economical data). Some real-life examples include identifying gene expression networks (Statnikov et al., 2012; Stekhoven et al., 2012) and analysing fmri data from the human brain (Ramsey et al., 2010). When modeling causal networks between some given variables, structural equation models are used frequently, where each variable is expressed as a function of some other variables (its causes) as well as some noise. Thus the model is determined by the cause-effect structure (in the form of a directed graph over the variables), the functional dependencies, and the joint distribution of the noise terms. ssumptions typically made include that the underlying causal model is acyclic (i.e. there are no feedback loops) and that the noise terms are independent (i.e. there are no unobserved variables). We furthermore assume that the noise is additive, i.e. the effect variable minus some noise term is a deterministic function of the cause variables. lthough quite restrictive, this is a common assumption in many other settings (e.g. regression) and allows straightforward estimation. The standard case then is to parameterize the model by making the functional dependen- 2 X has a causal effect on Y if manipulating X changes the distribution of Y, see Pearl (2009). 14

28 2.1. Introduction cies linear and the noise Gaussian 3. In this case the space of candidate models (in the form of directed acyclic graphs) clusters in equivalence classes, which prohibit full identification every model in a given equivalence class can induce the same joint distribution over the variables. In a sense, this is quite exceptional, however. It has been shown that as soon as one departs from the linearity or the Gaussianity assumptions the model becomes fully identifiable 4 (Shimizu et al., 2006; Hoyer et al., 2009; Zhang and Hyvärinen, 2009; Peters et al., 2011b; Peters and ühlmann, 2014). We are thus interested in the nonparametric case, where either the functional dependencies are nonlinear or the noise terms are non-gaussian (or both). n inference procedure for this case based on nonparametric independence tests has been suggested by Mooij et al. (2009). Their method is using the fact that when fitting the wrong model the noise terms will not be independent. There are a few problems with this approach, however. First, the null hypothesis of the tests employed is independence, which is what one would like to show, and statistical hypothesis testing only allows to reject such hypotheses. Second, because of the many tests involved there is a multiple testing problem. Third, nonparametric independence testing among many variables is statistically hard, and the tests tend to be computationally intensive. We take a different approach in the form of a score-based method, which is consistent, fast, and easily adaptable to greedy methods for large problems. Score-based methods are widely used for fitting Gaussian structural equation models (hickering, 2002) or discrete ayesian networks (Koller and Friedman, 2009). Maximum a posteriori estimation was used in the setting of non-linear models with Gaussian noise by Imoto et al. (2002). Two other score-based methods have recently been proposed: for the parametric setting of Gaussian and linear models with same error variances (Peters and ühlmann, 2014) and for linear models with non-gaussian noise (Hyvärinen and Smith, 2013). Most closely related to this paper is an approach from ühlmann et al. (2014). They consider a semi-parametric structural equation model with additive, nonlinear functions in the parental variables and 3 In fact, this is how structural equation models where first introduced and continue to be used today (ollen, 1989). 4 Except for a set of degenerate cases of measure zero. 15

29 hapter 2. Score-based ausal Learning in dditive Noise Models additive Gaussian noise, and they prove consistency and present an algorithm for cases with potentially many variables. In contrast, we consider here a model with a nonparametric specification of the error distribution (while the focus is on cases with few variables only). Thus, our model is more general but harder to estimate from data. We propose a penalized maximum likelihood method and prove its asymptotic consistency for finding the true underlying graph provided some technical assumptions about the class of probability densities hold. Our nonparametric setting also includes the well-known LiNGM model (Shimizu et al., 2006) as a special case, and thus we provide here a score-based approach for LiNGM. Independent work by Kpotufe et al. (2014) considers a similar problem as ours: however, while they only treat the case with two variables, we allow for more realistic multivariate settings. This paper is organized as follows: In Section 2 we review the basic notation and definitions we will use later on before describing our method. In Section 3 we present our main theorem and the assumptions for proving consistency in the large sample limit. In Section 4 we discuss simulation results showing that the method works in practice under controlled conditions. In Section 5, we test our method on some real-world datasets and compare it to other causal inference methods The Method Suppose data is sampled from real-valued random variables X 1,..., X d, which have some causal structure. We are interested in finding this causal structure (in the form of a directed acyclic graph) just by using observational data. efore we describe our method and the assumptions it rests on, we will give definitions of some of the basic terms used in this paper (some of which can be found in e.g. Lauritzen (1996), Pearl (2009), Triebel (1983)). 16

30 2.2. The Method Notation and Definitions Given a set of vertices V = {1,..., d} and edges E V V, we define the d-dimensional graph G as the ordered pair (V, E). If E is asymmetric, G is called a directed graph. Given two vertices α, β V, a directed path of length n from α to β is a sequence of vertices α = v 0,..., v n = β, s.t. (v i, v i+1 ) E i = 0,..., n 1. If G is directed and for all v V there is no path of length n 1 from v to itself, then G is called a directed acyclic graph (DG). If V V and E E V V, then G = (V, E ) is called a subgraph of G, and we write G G. If E E V V, we call G a proper subgraph of G and write G G. In a graph G we define the parents of a vertex v as the set pa G (v) := {u V : (u, v) E}. The structural Hamming distance (SHD) between two graphs G, G is defined as the number of single edge operations (edge additions, deletions, reversals) necessary to transform G into G. joint density p over X 1,..., X d is Markov with respect to a DG D, if it factorizes along D: p(x 1,..., x d ) = d p ( x k {x l } l pad (k)). (2.1) k=1 DG D is causally minimal with respect to a joint density p, if D D s.t. p is Markov with respect to D. structural equation model (SEM) M = {f k, p ɛk } k=1,...,d is a set of functions f k and densities p ɛk, specifying each variable X k as a function of some of the other variables and a noise term ɛ k (independent of the other noise terms) with density p ɛk. The model M induces a DG D, where a directed edge (k, l) is added if the function for X l directly depends on X k. We will assume in this paper, that M is recursive, i.e. its graph D is actually a DG. We can write the model equations as X k = f k ({X l } l pad (k)), ɛ k ), k = 1,... d. If the functions are additive in the noise, i.e. if X k = f k ({X l } l pad (k))) + ɛ k, k = 1,..., d, (2.2) 17

31 hapter 2. Score-based ausal Learning in dditive Noise Models the model is called an additive noise model (NM). We call M := (F, P ɛ ) a functional model class 5 of dimension d if F 0 (R d 1 ) is a class of functions containing the possible edge functions f k and P ɛ is a class of univariate probability densities containing the possible error densities p ɛk. The joint density of an NM is of the form (2.1) and thus Markov to its DG D. Vice versa we say that D induces a class of joint densities P on X 1,..., X d from a functional model class M, where { d ( P = p k xk f k ({x l } l pad (k))) ) } : f k F, p k P ɛ. (2.3) k=1 Thus P contains all joint densities that can be generated by NMs from class M with DG D. The class M is said to be identifiable, if the intersection of any two density classes P 1, P 2 induced by distinct graphs D 1, D 2 only contains densities for which there exists a unique graph that is causally minimal. We assume throughout the paper that the data generating process is an NM with associated causally minimal DG D 0 with induced density class P 0 and true joint density p 0 P 0. ausal minimality here essentially means that every edge in D creates a dependency in the joint distribution (i.e. there is an edge from X l to X k only if f k is not constant in x l ). For the density class, we often consider the weighted Sobolev space of functions W s r (R n, β ) which is defined as follows: W s r (R n, β ) := { f L r (R n ) : D α (f β ) L r (R n ) α s }, where x β = (1+ x 2 ) β/2 is a polynomial weighting function parametrized by β R, D α is the partial derivative operator according to the multiindex α, and r, s are integers at least 1. Note that for β = 0 this is the usual Sobolev space, while for β > 0 this is more restrictive (as the tails get bigger weights), and for β < 0 it is less restrictive. We will mostly be interested in the β < 0 case. 5 Here we implicitly assume that the model has additive noise. 18

32 2.2. The Method Penalized maximum likelihood estimation We now describe our method to learn the true causal structure from data. Suppose we measure d variables, and we have n i.i.d. samples {x j k } with j = 1,..., n and k = 1,..., d. Let D 1,..., D N be the candidate DGs under consideration 6 and P 1,..., P N their induced density classes for some model class M. If M is identifiable, we aim to infer the true DG D 0 by finding the density class P 0 that contains the true joint density p 0 (if there is more than one such class, we choose the one corresponding to the smallest graph). Of course, we do not know p 0 instead we estimate it by computing best representatives ˆp i n from each class P i. These are chosen via nonparametric maximum likelihood: ˆp i n = arg max p P i n log p(x j 1,..., xj d ). j=1 Then, each model is scored with a penalized log-likelihood: S i n = 1 n n log ˆp i n(x j 1,..., xj d ) #(edges) i a n, (2.4) j=1 where a n controls the strength of the penalty. Taking the maximum over these scores we get the estimator ˆD n = DÎn, where Î n = arg max Sn. i i=1,...,n Hence the estimated DG is DÎ. We will show in Section 2.3 that this procedure is consistent for a n proportional to 1/ log n and that therefore ˆD n = D 0 in the large sample limit. The question arises how to find the maximum likelihood estimators ˆp i n in each class in this nonparametric setting. We present here an exemplary procedure that has proved useful in practice. To estimate the edge functions of the SEM, we employ a nonparametric regression method. The error densities are then inferred from the residuals using a 6 E.g. all DGs with d nodes. 19

33 hapter 2. Score-based ausal Learning in dditive Noise Models density estimation method. The estimated joint density is finally given by the product of the residual densities, in accordance with (2.3). This gives the following three-step procedure for each DG D i : 1. For each node k estimate the residuals ˆɛ k by nonparametrically regressing X k on {X l } l padi (k). If pa D i(k) =, set ˆɛ k = x k. 2. For each node k estimate the residual densities ˆp ɛk from the estimated residuals ˆɛ k. 3. ompute the penalized likelihood score Sn i = 1 n d log ˆp ɛk (ˆɛ j k n ) #(edges) i a n. j=1 k=1 Of course, an exhaustive search over all DGs is only feasible for small values of d, since the number of DGs grows super-exponentially with the number of vertices 7 and nonparametric regression in d dimensions is ill-posed in general without making structural constraints, due to the curse of dimensionality 8. The methods used in steps 1 and 2 should be chosen depending on the model class M. Examples are (generalized) additive model regression (GM) for step 1 and kernel density estimation for step 2. s an illustration we look at the two-dimensional case, where there are only two variables X 1 and X 2. There are three DGs inducing the 7 The first few values of the number of DGs N(d) with d nodes are N(2) = 3, N(3) = 25, N(4) = 543, N(5) = 29281, N(6) = , for example. 8 The latter problem can be dealt with in certain cases, e.g. additive models, where the edge functions are additive in the parental variables. 20

34 2.3. Theoretical Results following models: D 1 : X 1 X 2 X 1 = ɛ 1 X 2 = f(x 1 ) + ɛ 2 p 1 (x 1, x 2 ) = p X1 (x) p X2 X 1 (x 2 x 1 ) = p ɛ1 (x 1 ) p ɛ2 (x 2 f(x 1 )) D 2 : X 1 X 2 X 1 = g(x 2 ) + ɛ 1 X 2 = ɛ 2 p 2 (x 1, x 2 ) = p X1 X 2 (x 1 x 2 ) p X2 (x 2 ) = p ɛ1 (x 1 g(x 2 )) p ɛ2 (x 2 ) D 3 : X 1 X 2 X 1 = ɛ 1 X 2 = ɛ 2 p 3 (x 1, x 2 ) = p X1 (x 1 ) p X2 (x 2 ) = p ɛ1 (x 1 ) p ɛ2 (x 2 ) We do steps 1, 2, and 3 as described above and choose the model with the highest (log-)likelihood penalized likelihood score. omparing this score-based approach with independence-test-based methods, the main difference occurs at step 2, where we estimate the residual densities instead of testing their independence. In terms of complexity, we swap one d-dimensional independence test againt d univariate density estimations. Simulations show that this is faster by a factor on the order of 100 with current implementations. However, even though we do not test residual independence directly, it is still the discriminatory property by which to identify the true model. y constructing the densities according to (2.3), we enforce the error terms to be independent in the estimated joint density. If they are not actually, the considered model will obtain a poor score. Thus, we are searching for the best fitting densities where the errors are independent Theoretical Results We now show that our method is consistent, i.e. that it will identify the true underlying DG given enough samples. In the following P D 21

35 hapter 2. Score-based ausal Learning in dditive Noise Models denotes the induced density class of DG D. We make the following assumptions: (1) Identifiability: The data {x j k } k=1,...,d are i.i.d. realizations (over j=1,...,n j = 1,..., n) of an identifiable structural equation model with induced d-dimensional DG D 0. In particular, the SEM can be the additive noise model (2.2) with nonlinear edge functions f k or non-gaussian noise variables 9 ɛ k for all k = 1,..., d (Peters et al., 2011b, Lemma 1). There are no hidden variables, i.e. the noise terms are jointly independent. (2) ausal Minimality: There is no proper subgraph D of D 0, s.t. p 0 is Markov with respect to D. (3) Smoothness of log-densities: For all DGs D the log-densities of P D (restricted to their respective support) are elements of a bounded weighted Sobolev space. That is r 1, s > d, β < 0, > 0 s.t. D α ( β 1{p > 0} log p) r < p P D, α s where r is the usual L r -norm. (4) Moment condition for densities: For all DGs D we have γ > s d/r s.t. p γ β r < p P D, where r, s, d, and β are determined by (3). (5) Uniformly bounded variance of log-densities: For all DGs D we have p 0 P D K > 0 s.t. sup p P D var p 0(log p(x 1,..., X d )) < K. (6) losedness of density classes: For all DGs D the induced density class P D is a closed set, with the topology given by the Kullback- Leibler (KL) divergence D KL (p(x) q(x)) = p(x) log p(x) q(x) dx. 9 Excluding a set of exceptions of measure zero (Hoyer et al., 2009, Theorem 1). 22

36 2.3. Theoretical Results The first two assumptions concern the general model setup and ensure identifiability (i.e. non-overlapping induced density classes). (1) requires the data to come from an identifiable NM due to nonlinearity or non-gaussianity, as in Hoyer et al. (2009). (2) ensures there are no superfluous edges in the true DG, i.e. the true model is the most parsimonious fitting the data. The last four assumptions are technical and used to prove consistency of the penalized maximum likelihood estimator. (3) essentially requires the log-densities to be smooth. (4) requires the densities to have some (at least fractional) finite moments. (5) requires the log-densities, for every underlying density p 0, to have uniformly bounded second moments. Finally, (6) guarantees the existence of the maximizers of the likelihood and the negative information entropy in each class. Furthermore, it is needed to ensure the true density p 0 has positive KL distance from all wrong density classes. Note that the latter statement alone would suffice to show consistency, since all statements can be written in terms of the supremums of likelihood and negative entropy, instead of their actual maximizers. However, for better comprehensability we chose the present formulation with the slightly stronger assumption. Making these assumptions, the penalized maximum likelihood estimator is consistent. We show this by proving that the probability of the true model obtaining a smaller score than any other model vanishes in the large sample limit. Theorem 1. ssume (1) (6). Let S i n be the penalized likelihood score of DG D i, given by S i n = 1 n n log ˆp i n(x j 1,..., xj d ) #(edges) i a n, j=1 where #(edges) i is the number of edges in DG D i, and a n = 1/ log n. Denote by i 0 the index of the true DG D 0 = D i0. Then we have P ( S i0 n S i n) 0 as n i i0. The proof relies on entropy methods and is presented in the appendix. In practice the 1/ log n penalty rate might be too large. We used 23

37 hapter 2. Score-based ausal Learning in dditive Noise Models a n = 1/ n for some simulations in Section 2.4 (where the noise is Gaussian), which lead to reasonably good performance for finite sample size n = 300. Moreover, under stronger assumptions we have: Remark 1. When replacing (5) with the stronger assumption of sub-exponential tails of log p(x 1,..., X d ), we can improve the penalty rate a n in Theorem 1 from 1/ log n to cn 1/(2+d/s), for some c > 0 sufficiently large Numerical Results In this section we present simulation results to show that our method works under controlled conditions. In each case, the data generating process is an additive noise model with acyclic graph structure. We first reproduce some results from an earlier paper by Hoyer et al. (2009), where the model involves just two variables and is parametrized by two parameters, controlling linearity and Gaussianity respectively. Then, we extend this setup to a slightly more general class of models. Finally, we look at cases with more than two variables. In our implementation we use (generalized) additive model regression (GM, see Hastie and Tibshirani (1986)) or local polynomial regression (LOESS, see leveland (1979)) for step 1 and logspline density estimation (see Kooperberg and Stone (1991)) or kernel density estimation for step 2. For models with more than two variables, penalization becomes important. We used a factor of a n = 1/ n instead of the very severe 1/ log n. This can be justified since in the relevant simulations the noise is Gaussian and the log-densities can be assumed to be sub- Exponential. In this case, the faster rate can be used (see Remark 1). ll computations were carried out in the statistical computing language R (using packages mgcv and logspline) and the code is available on request from the authors Identifiability depending on Linearity and Gaussianity Hoyer et al. (2009) illustrate their method with a two-dimensional NM 24

38 2.4. Numerical Results of the form X 1 = ɛ 1 X 2 = X 1 + bx ɛ 2 with the parameter b ranging from 1 to 1, thus controlling the linearity of the model. The noise terms ɛ 1, ɛ 2 are transformed Normal random variables: ɛ k = sgn(ν k ) ν k q, ν k iid N (0, 1), where the parameter q ranges from 0.5 to 2 and thus controls Gaussianity. The true direction M 1 : X 1 X 2 cannot be identified with traditional methods (e.g. the P algorithm), since the backwards model M 2 : X 1 X 2 entails precisely the same conditional independence relations (none) and thus belongs to the same Markov equivalence class. If b = 0 and q = 1 there exists a backwards model entailing the same joint density. s soon as we move away from this point, however, the model becomes identifiable (Hoyer et al., 2009). We confirm this numerically, showing our method performs as expected in this setting. We discretize the parameter space (b, q) [ 1, 1] [0.5, 2], and for each grid point we repeat the simulation 1000 times, with n = 300 samples per trial. We then count the number of times the backwards model gets wrongly chosen by the method 10, and this false decision rate serves as our measure of quality of the method. s can be seen in Figure 2.1, the false decision rate peaks around (b, q) = (0, 1) with around 50% wrong decisions, corresponding to random guessing. way from this region it quickly drops to zero. In this setting the regressions were done using LOESS and the density estimations using logsplines Random Edge Functions We now generalise the setup of the scenario from Section in allowing a bigger function class for the edge function. Specifically, we randomly generate functions by sampling a random path from a 10 I.e. when the likelihood score of the backwards model is lower than that of the forwards model. 25

39 hapter 2. Score-based ausal Learning in dditive Noise Models b=0 q=1 1.0 b q (a) Full b q grid False Decisions q False Decisions False Decisions (b) b fixed b (c) q fixed. Figure 2.1.: False decision rates for a two-dimensional NM with two parameters b and q, controlling linearity and Gaussianity (n = 300). For b = 0 the model is linear, for q = 1 the noise is Gaussian. Wiener process and smoothing it with cubic splines11. To measure their nonlinearity we use the normalised L2 -difference between the function and its best linear approximation on the interval [ 1, 1], as described in Emancipator and Kroll (1993). number of randomly generated functions with different nonlinearity values are shown in Figure 2.2. We again choose a uniform grid of nonlinearity values (in the interval [0, 0.4]) and, for each grid point, generate 100 random functions. With each function we perform 100 simulations and average the results. The noise is standard Gaussian in this setting. In Figure 2.2 we see the results for a small sample (n = 300) and a large sample (n = 1500) case. The findings are analogous to the simple cubic model the false decision rate decreases with nonlinearity of the edge function and sample size. gain, the regressions were done using LOESS and the density estimations using logsplines. 11 Wiener path (random normal increments) is sampled on a 1000 point grid spanning [ 1, 1] and the resulting vector rescaled to an interval of length 2 and consequently smoothed using cubic splines. The resulting functions are linear outside [-1,1] and nonlinear inside. 26

40 2.4. Numerical Results False Decisions Nonlinearity n=300 n= s=0 s=0.1 s=0.2 s=0.3 s= (a) (b) Figure 2.2.: a) False decision rates with randomly sampled edge functions and Gaussian noise decreases with nonlinearity of the functions. b) Examples of randomly generated functions, where parameter s controls nonlinearity Larger Networks and Thresholding In a practical situation the reliability of any method invariably depends on whether its assumptions are met, as well as some other factors. In our case this would include the nonlinearity of the edge functions, the non-gaussianity of the noise, the sample size, and the number of nodes. It would be desirable to have some criterion indicating there is insufficient information to make a decision. While this is hard to make concrete, a good first heuristic seems to be the separation of the best-scoring model from the rest. We concretely look at the ratio of the smallest ( 1 ) and the largest ( 2 ) score difference (see Figure 2.3b). If this is smaller than some threshold t, we make no decision (no selection of a model). The effect of this can be seen in Figure 2.3a. Starting from a full DG with 3 nodes as the ground truth, we randomly generate 100 different sets of nonlinear 12 edge functions, and for each set of edge functions we generate 100 data sets with standard Gaussian noise of sample size n = 300. With each data set we run an exhaustive search over all With nonlinearity values in [0.39, 0.4]. 27

41 hapter 2. Score-based ausal Learning in dditive Noise Models No decision SHD=3 SHD=2 SHD=1 orrect DG Score 2 t=0 t= (a) (b) Figure 2.3.: a) Structural Hamming distance between the best-scoring DG and the ground truth for a 3-node simulation with (t = 0.01) and without (t = 0) thresholding. b) Illustration of thresholding for a single simulation run. Let s 1,..., s D be the (increasingly) ordered scores. Then 1 = s 1 /s 2 and 2 = s 1 /s N. candidate models and, if making a decision after thresholding, compute the structural Hamming distance (SHD) between the best-scoring DG and the ground truth. omparing the thresholds t = 0 and t = 0.01, the false decision rate falls from 3.9% to 2.4% while in 3.1% of the cases no decision is made. We also look at two simulation settings suggested in Peters et al. (2011b), where the graph consists of 4 nodes and the edge functions are nonlinear but parametrized by 4 and 5 parameters respectively. In both cases, nonlinear1 and nonlinear2, 100 sets of parameters are drawn from a uniform distribution and then data (with a sample size of n = 400) is generated. Our method identifies the correct DG in 96 / 97 out of the 100 cases for nonlinear1/2 (in the other cases, there is one additional edge). This certainly improves upon the results reported in Peters et al. (2011b) (86 correct decision in both cases). In all of these multivariate settings, we used GM for regression and logsplines for density estimation. 28

42 2.5. Real Data 2.5. Real Data To determine the performance on real-world datasets, we apply our method to so-called cause-effect pairs. These are bivariate datasets where the true causal direction is known. n example would be the altitude and the average temperature of weather stations. Mooji and Janzing (2010) describe 8 such pairs and compare several methods that were submitted as part of the ausality Pot-Luck hallenge. Our method identifies 7 out of the 8 pairs correctly 13, thus beating all other compared methods except Zhang and Hyvärinen (2010), who take into account post-nonlinear additive noise. We next consider the extended collection of cause-effect pairs, which can be found at This currently comprises 86 datasets, 81 of which are bivariate. Using our method on these 81 bivariate datasets, we identify the true model in 66% of the cases 14. In Janzing et al. (2012) a subset of these datasets were used to compare various causal inference methods. Running our method on those datasets, it compares well with the other methods (see Table 2.1), being slightly better than independence testing (N) and outperforming the Lingam method. In both of these settings we used LOESS and kernel density estimation onclusions We presented a new fully nonparametric likelihood score-based method for causal inference in nonlinear or non-gaussian NMs. We proved consistency of the penalized maximum likelihood estimator for finding the correct model. We showed via simulation studies that our method works well in practice when the ground truth is an NM with sufficiently nonlinear edge functions or non-gaussian error terms. Our method compares favourably to other causal inference procedures on both simulated and real-world data. 13 This corresponds to a p-value of under the random guessing null hypothesis. 14 This corresponds to a p-value of under the random guessing null hypothesis. 29

43 hapter 2. Score-based ausal Learning in dditive Noise Models Method SL N Lingam PNL IGI GPI ccuracy 66% 63% 58% 68% 75% 70% Table 2.1.: Success rates of different causal inference methods on cause-effect pairs at a decision rate of 100%. SL=Score-based ausal Learning (our method), N=dditive Noise with independence testing, PNL=Post- Nonlinear, IGI=Information-Geometric ausal Inference, GPI=Gaussian Process Inference. ll values except SL taken from Janzing et al. (2012). ll datasets were subsampled three times (if n > 500), and the results were averaged. s a major open challenge, the current approach of exhaustively searching through the whole model space becomes computationally infeasible for more than a handful of variables. Since our method is score-based and the scoring criterion is local (i.e., decomposable), it is straightforward to implement a greedy algorithm although there will be no guarantee for finding a global optimum. 30

44 2.. onsistency Proof ppendix 2. onsistency Proof The proof heavily relies on entropy methods and empirical process theory. For a good overview of the necessary material we refer to van de Geer (2000) or van der Vaart (1998). For an overview of Sobolov and related function spaces we refer to Triebel (1983). Throughout this section we will adopt the following notation for taking expectations of some random variable f with respect to a distribution Q (following van de Geer (2000)): Qf := f dq. In particular, this means we will write expectations and means as P f = E [f(x)] P n f = 1 n f(x j ), n j=1 where P is the true distribution with density p 0, f : R d R is some function, X is a vector of random variables (one corresponding to each node) with distribution P, {X j } j=1,...,n are independent copies of X, and P n is the empirical distribution (placing weight 1/n on each X j ). With this notation we can write the maximum likelihood estimator ˆp i n and the entropy minimizer p i in class P i (which exist by assumption (6) but need not be unique) as: ˆp i n = arg max P n log p, (2.5) p P i p i = arg max P log p. (2.6) p P i Note that the true density p 0 minimizes the information entropy over the complete density space N i=1 Pi since the Kullback-Leiber divergence P log p0 p is positive for all densities p p0. 31

45 hapter 2. Score-based ausal Learning in dditive Noise Models One of the building blocks of the proof of Theorem 1 is a uniform law of large numbers (ULLN) for the classes of log-densities: P sup (P n P ) log p 0 as n i. p P i To show this, an entropy argument is used. We first define the bracketing entropy of a function space. Let G be a set of functions from R d to R. Two functions g L, g U : R d R (not necessarily in G) form an ɛ-bracket for some g G, if g L g g U and g L g U 1,µ < ɛ, where 1,µ is the weighted L 1 -norm, i.e. f 1,µ = f(x)µ(x) dx. Suppose {gi L, gu i } i=1,...,n [] is the smallest set s.t. g G i s.t. gi L, gu i form an ɛ-bracket for g, where N [] denotes the number of such pairs. Then H [] (ɛ, G, 1,µ ) := log N [] is called the bracketing entropy of G. The following result connects bracketing entropy H [] (ɛ, G, 1,p 0) with respect to the L 1 -norm weighted with the true density p 0 and the uniform convergence of the empirical process (P n P )g. Note that here and throughout this section we use the notation "a(ɛ) b(ɛ)" as shorthand for "a(ɛ) cb(ɛ) ɛ > 0 for some constant c not depending on ɛ". Lemma 1. Suppose that: (i) 0 α < 1 s.t. H [] (ɛ, G, 1,p 0) ɛ α (ii) K s.t. var (g(x 1,..., X d )) < K g G ɛ > 0 and Then G satisfies the ULLN: ( ) P sup (P n P )g > δ n 0 as n, g G where δ n = c/ log n for some c > 0. Proof. We first show that it suffices to look at the supremum over the bracketing functions. Let g G and gi L, gu i be its δ n -brackets. We then have (P n P ) g < (P n P ) g U i + δ n and > (P n P ) g L i δ n. 32

46 2.. onsistency Proof So we have and hence (P n P ) g < ( max (Pn P ) gi L, (Pn P ) g U ) i + δ n i=1,...,n [] sup (P n P ) g < max (P n P ) g + δ n. g G g {gi L,gU i }i Now ( ) P sup (P n P )g > 2δ n P g G ( max g {g L i,gu i }i (P n P )g > δ n 2N [] (δ n ) max P ( (P n P ) g > δ n ) g {gi L,gU i }i ) exp(δn α ) K2 nδn 2 (2.7) where the last line follows from hebyshev s inequality. Substituting for δ n gives P(...) log 2 n exp(c α log α n log n) 0 as n. Note that if we replace condition (ii) with the assumption that g(x 1,..., X d ) are sub-exponential (as in Remark 1), we apply the sub-exponential tail bound (see ühlmann and van de Geer (2011, Lemma 14.9) for example) instead of hebyshev s inequality and obtain exp(δn α nδ2 n const. ) instead of (2.7), which converges to zero for δ n = cn 1/(2+α), for c > 0 sufficiently large. Lemma 1 shows that a sufficient condition for the ULLN is finite bracketing entropy. To this end, we make use of the following result: Lemma 2 (Nickl and Pötscher (2007, Theorem 1)). Suppose G is a (non-empty) bounded subset of the weighted Sobolev space W s p (R d, x β ) for some β < 0. Suppose γ > s d/p > 0 s.t. the moment condition γ β 1,µ = µ(x) x γ β 1 < 33

47 hapter 2. Score-based ausal Learning in dditive Noise Models holds for some orel measure µ on R d. Then: H [] (ɛ, G, 1,µ ) ɛ d/s. The relevant sets of functions G in this context are the log-densities of each class, i.e. {1{p > 0} log p p P i }, with the relevant orel measure µ being the true density p 0. Essentially the idea of the proof of Theorem 1 is to show that the maximum log-likelihood in each induced density class converges to the minimal entropy. For non-overlapping models (e.g. X 1 X 2 and X 1 X 2 ), the minimal entropy will be different in each class (with the minimum occuring in the true model class), and the likelihood will eventually pick up on this difference. Since the penalty term vanishes asymptotically, an ever so small difference in entropy will differentiate the true model class from the others. For overlapping (e.g. hierarchical) models, the minimal entropy can occur in more than one class. In this case the penalty term picks out the most parsimonious model (which is the true model according to the ausal Minimality assumption). Note that the penalty 1/ log n is quite large compared with e.g. the I penalty (log n/n). This is due to the slow convergence of maximum likelihood to minimal entropy (Lemmas 3 and 1). If the penalty vanishes too quickly, it will be drowned out by the noise in the likelihood and have no effect. The convergence can be improved (and thus the penalty relaxed) when making stronger assumptions on the distributions, e.g. sub-gaussian tails. The following lemma shows convergence of maximum log-likelihood to minimal entropy in each class, given that a ULLN holds. Lemma 3. Suppose that a ULLN for the classes log P i holds with convergence rate δ n, i.e. ( ) Then P sup p P i (P n P ) (1{p > 0} log p) > δ n 0 as n. P ( Pn log ˆp i n P log p i > δn ) 0 as n. 34

48 2.. onsistency Proof Proof. y the definition of the MLE (2.5) we have: i.e. P n log ˆp i n P n log p i = P log p i + (P n P ) log p i, P n log ˆp i n P log p i (P n P ) log p i. (2.8) Let P i n be the restriction of P i to densities whose support contains the data, i.e. Pi n = {p P i supp(p) {X 1,..., X n }}. Note that the maximum log-likelihood as well as minimum entropy are the same over P i and P i n, since densities with support not including the data will yield values of. So we also have: i.e. P n log ˆp i n = max P n log p = max P n log p p P i p P i n = max (P log p + (P n P ) log p) p P n i P log p i + sup (P n P ) log p, p P n i P n log ˆp i n P log p i sup (P n P ) log p. p P n i This together with (2.8) yields: ( ) Pn log ˆp i n P log p i (Pn max P ) log p i, sup (P n P ) log p p P n i ) ( (Pn max P ) log p i, sup (P n P ) log p p P n i sup p P i (P n P ) log p We thus have: P ( Pn log ˆp i n P log p i ) > δn ( sup p P i (P n P ) (1{p > 0} log p). P sup p P i (P n P ) (1{p > 0} log p) > δ n ), 35

49 hapter 2. Score-based ausal Learning in dditive Noise Models which converges to zero as n by assumption. Finally, before proving Theorem 1, we show the following useful lemma. Lemma 4. Let a, b, a, b R and ɛ > 0. If one of the following holds: 1. a b > ɛ and a b 0 2. a b < ɛ and a b 2ɛ we have a a > ɛ 2 or b b > ɛ 2. Proof. ssume (i). Then we have ɛ = ɛ 0 a b + b a = a a (b b ) a a + b b, and the result follows. Similarly for (ii): ɛ = 2ɛ ɛ a b + b a = a a (b b) a a + b b. We can now prove the main theorem. Proof of Theorem 1. We will make repeated use of Lemma 3. For that matter, note that assumptions (3), (4), and (5), together with Lemmas 1 and 2 (taking µ = p 0 ) satisfy the sufficient conditions. (6) ensures the existence of ˆp i n, p i as defined in (2.5) and (2.6). Let i i 0. We differentiate two cases: i) where P i includes the true density p 0 and ii) where it does not. Let 1 δ n = (#(edges) i #(edges) i0 ) log n denote the difference of the penalties in the two scores. ase i). p 0 P i, which implies p i = p 0. ssumptions (1) and (2) together with Theorem 2 in Peters et al. (2011b) guarantee identifiability of the true graph. In particular this means that in this case P i must correspond to a graph containing the true graph. Hence 36

50 2.. onsistency Proof #(edges) i > #(edges) i0, i.e. δ n > 0. We then have: P ( ( Sn i0 Sn i ) P P n log ˆp i n P n log ˆp i0 n > δ ) n 2 ( Pn P log ˆp i0 n P log p 0 δ n > 4 P n log ˆp i n P log p i > δ ) n 4 ( Pn P log ˆp i0 n P log p 0 ) δ n > + 4 ( Pn P log ˆp i n P log p i ) δ n > 0 4 as n, where the second line follows from p i = p 0 and Lemma 4 (first case), and the convergence in the last line follows from Lemma 3. ase ii). p 0 / P i, which implies P log p 0 > P log p i. Hence δ > 0 s.t. P log p 0 > P log p i 1 + 4δ. Let N > 0 s.t. #(edges) i0 log n < δ n N. Then we have P ( Sn i0 Sn i ) ( = P Pn log ˆp i0 n P n log ˆp i ) n δ n P ( P n log ˆp i0 n P n log ˆp i n < δ ) P ( Pn log ˆp i0 n P log p 0 > δ Pn log ˆp i n P log p i ) > δ P ( P n log ˆp i0 n P log p 0 > δ ) + P ( Pn log ˆp i n P log p i > δ ) 0 as n, where the third line follows from Lemma 4 (second case), and the convergence in the last line follows again from Lemma 3. 37

51

52 hapter 3. Structure Learning with owfree cyclic Path Diagrams 1 We consider the problem of structure learning for bow-free acyclic path diagrams (Ps). Ps can be viewed as a generalization of linear Gaussian DG models that allow for certain hidden variables. We present a first method for this problem using a greedy score-based search algorithm. We also investigate some distributional equivalence properties of Ps which are used in an algorithmic approach to compute (nearly) equivalent model structures, allowing to infer lower bounds of causal effects. The application of our method to some datasets reveals that P models can represent the data much better than DG models in these cases Introduction We consider learning the causal structure among a set of variables from observational data. In general, the data can be modelled with a structural equation model (SEM) over the observed and unobserved variables, which expresses each variable as a function of its direct causes and a noise term, where the noise terms are assumed to be mutually independent. The structure of the SEM is visualized as a directed graph, with vertices representing variables and edges representing direct causal relationships. We assume the structure to be recursive 1 This chapter is heavily based on the preprint Nowzohour et al. (2015) 39

53 hapter 3. Structure Learning with ow-free cyclic Path Diagrams (acyclic), which results in a directed acyclic graph (DG). DGs can be understood as models of conditional independence, and many structure learning algorithms use this to find all DGs which are compatible with the observed conditional independencies (Spirtes et al., 1993). Often, however, not all relevant variables are observed. The resulting marginal distribution over the observed variables might still satisfy some conditional independencies, but in general these will not have a DG representation (Richardson and Spirtes, 2002). lso, there generally are additional constraints resulting from the marginalization of some of the variables (Evans, 2014; Shpitser et al., 2014). In this paper we consider a model class which can accommodate certain hidden variables. Specifically, we assume that the graph over the observed variables is a bow-free acyclic path diagram (P). This means it can have directed as well as bidirected edges (with the directed part being acyclic), where the directed edges represent direct causal effects, and the bidirected edges represent hidden confounders. The bow-freeness condition means there cannot be both a directed and a bidirected edge between the same pair of variables. The P can be obtained from the underlying DG over all (hidden and observed) variables via a latent projection operation (Pearl, 2009) (if the bow-freeness condition admits this). We furthermore assume a parametrization with linear structural equations and Gaussian noise, where two noise terms are correlated only if there is a bidirected edge between the two respective nodes. For many practical purposes, it is beneficial to consider this restricted class of hidden variable models. Such a restricted model class, if not heavily misspecified, results in a smaller distributional equivalence class, and estimation is expected to be more accurate than for more general hidden variable methods like FI (Spirtes et al., 1993), RFI (olombo et al., 2012), or FI+ (laassen et al., 2013). The goal of this paper is structure learning with Ps, that is, finding the best set of Ps given some observational data. Just like in other models, there is typically an equivalence class of Ps that are statistically indistinguishable, so a meaningful structure search result should represent this equivalence class. We propose a penalized likelihood score that is greedily optimized and a heuristic algorithm (supported by some theoretical results) for finding equivalent models once an optimum is found. This method is the first of its kind for P 40

54 3.1. Introduction models. Example of a P onsider the situation shown in Figure 3.1a, where we observe variables X 1,..., X 4, but do not observe H 1. The only (conditional) independency over the observed variables is X 1 X 3 X 2, which is also represented in the corresponding P in Figure 3.1b. The parametrization of this P would be X 1 = ɛ 1 X 2 = 21 X 1 + ɛ 2 X 3 = 32 X 2 + ɛ 3 X 4 = 43 X 3 + ɛ 4 with (ɛ 1, ɛ 2, ɛ 3, ɛ 4 ) T N (0, Ω) and Ω Ω = 0 Ω 22 0 Ω Ω Ω 24 0 Ω 44 Hence the model parameters in this case are 21, 32, 43, Ω 11, Ω 22, Ω 33, Ω 44, and Ω 24. n example of a graph that is not a P is shown in Figure 3.1c. hallenges The main challenge, like with all structure search problems in graphical modelling, is the vastness of the model space. The number of Ps grows super-exponentially. Exhaustively scoring all Ps and finding the global score optimum is typically computationally infeasible. nother major challenge, specifically for our setting, is the fact that a graphical characterization of the (distributional) equivalence classes for P models is not yet known. In the (unconstrained) DG case, for example, it is known that models are equivalent if and only if they 41

55 hapter 3. Structure Learning with ow-free cyclic Path Diagrams X 1 X 2 X 1 X 2 21 X 1 X 2 H 1 32 Ω 24 X 3 X 4 43 X 3 X 4 X 4 (a) (b) (c) Figure 3.1.: (a) DG with hidden variable H 1, (b) resulting P over the observed variables X 1,..., X 4 with annotated edge weights, and (c) resulting graph if X 3 is also not observed, which is not a P. share the same skeleton and v-structures (Verma and Pearl, 1991). similar result is not known for Ps (or the more general acyclic directed mixed graphs). This makes it hard to traverse the search space efficiently, since one cannot search over the equivalence classes (like the greedy equivalence search for DGs (hickering, 2002)). It also makes it difficult to evaluate simulation results, since the ground truth P and the found solution might not coincide yet be equivalent. ontributions We provide the first structure learning algorithm for Ps. It is a score-based algorithm and uses greedy hill climbing to optimize a penalized likelihood score. y decomposing the score over the bidirected connected components of the graph and caching the score of each component we are able to achieve a significant computational speedup. To mitigate the problem of local optima, we perform many random restarts of the greedy search. We propose to approximate the distributional equivalence class of a P by using a greedy strategy for likelihood scoring. If two Ps are similar with respect to their likelihoods within a tolerance, they should be treated as statistically indistinguishable and hence as belonging to the same class of (nearly) equivalent Ps. ased on such greedily 42

56 3.1. Introduction computed (near) equivalence classes, we can then infer bounds of total causal effects, in the spirit of Maathuis et al. (2009, 2010). We present some theoretical results towards equivalence properties in P models, some of which generalize to acyclic path diagrams. We also provide a proof of Wright s path tracing formula that does not assume a proper model parametrization. Furthermore, we developed a Markov hain Monte arlo method for uniformly sampling Ps based on ideas from Kuipers and Moffa (2015). We obtain promising results on simulated data sets despite the challenges listed above. omparing the maximal penalized likelihood scores between Ps and DGs on real datasets shows a clear advantage of P models. Related Work There are two main research communities that intersect at this topic. On the one side there are the path diagram models, going back to Wright (1934) and then being mainly developed in the behavioral sciences (Jöreskog, 1970; Duncan, 1975; Glymour and Scheines, 1986; Jöreskog, 2001). In this setting there is always a parametric model, usually assuming linear edge functions and Gaussian noise. In the most general formulation, the graph over the observed variables is assumed to be an acyclic directed mixed graph (DMG), which can have bows. While in general the parameters for these models are not identified, Drton et al. (2011) give necessary and sufficient conditions for global identifiability. omplete necessary and sufficient conditions for the more useful almost everwhere identifiability remain unknown. P models are a useful subclass, since they are almost everywhere identified (rito and Pearl, 2002). Drton et al. (2009) provided an algorithm, called residual iterative conditional fitting (RIF), for maximum likelihood estimation of the parameters for a given P. On the other side there are the non-parametric hidden variable models, which are defined as marginalized DG models (Pearl, 2009). The marginalized distributions are constrained by conditional independencies, as well as additional equality and inequality constraints (Evans, 43

57 hapter 3. Structure Learning with ow-free cyclic Path Diagrams 2014). When just modelling the conditional independence constraints, the class of maximal ancestral graphs (MGs) is sufficient (Richardson and Spirtes, 2002). Shpitser et al. (2014) have proposed the nested Markov model using DMGs to also include the additional equality constraints. Finally, mdgs can be used to model all resulting constraints (Evans, 2014). With each additional layer of constraints, learning the structure from data becomes more complicated, but at the same time more available information is utilized and a possibly more detailed structure can be learned. ncestral P models coincide with the Gaussian parametrization of MGs (Richardson and Spirtes, 2002, Section 8). However, in general Ps need to be neither maximal nor ancestral and can model additional constraints. They are easier to interpret as hidden variable models. This can be seen when comparing the P in Figure 3.1b with the corresponding MG. The latter would have an additional edge between X 1 and X 4 since there is no (conditional) independency of these two variables. s can be verified, the P and the MG in this example are not distributionally equivalent, since the former encodes additional non-independence constraints. Structure search for MGs can be done with the FI (Spirtes et al., 1993), RFI (Maathuis et al., 2009), or FI+ (laassen et al., 2013) algorithms. Silva and Ghahramani (2006) propose a fully ayesian method for structure search in linear Gaussian DMGs, sampling from the posterior distribution using an MM approach. Shpitser et al. (2012) employ a greedy approach to optimize a penalized likelihood over DMGs for discrete parametrizations. Outline of this Paper This paper is structured as follows. In Section 3.2 we give an in-depth overview of the model and its estimation from data, as well as some distributional equivalence properties. In Section 3.3 we present the details of our greedy algorithm with various computational speedups. In Section 3.4 we present empirical results on simulated and real datasets. ll proofs as well as further theoretical results and justifications can be found in the ppendix. 44

58 3.2. Model and Estimation 3.2. Model and Estimation Graph Terminology Let X 1,..., X d be a set of random variables and V = {1,..., d} be their index set. The elements of V are also called nodes or vertices. mixed graph or path diagram G on V is an ordered tuple G = (V, E D, E ) for some E D, E V V. If (i, j) E D, we say there is a directed edge from i to j and write i j G. If (i, j) E, we must also have (j, i) E, and we say there is a bidirected edge between i and j and write i j G. The set pa G (i) := {j j i G} is called the parents of i. This definition extends to sets of nodes S in the obvious way: pa G (S) := i S pa G(i). The in-degree of i is the number of arrowheads at i. If V V, E D E D V V, and E E V V, then G = (V, E D, E ) is called a subgraph of G, and we write G G. If any of the subset relations are strict, we call G a strict subgraph of G and write G G. The skeleton of G is the undirected graph over the same node set V and with edges i j if and only if i j G or i j G (or both). path π between i and j is an ordered tuple of (not necessarily distinct) nodes π = (v 0 = i,..., v l = j) such that there is an edge between v k and v k+1 for all k = 0,..., l 1. If the nodes are distinct, the path is called non-overlapping. The length of π is the number of edges λ(π) = l. If π consists only of directed edges pointing in the direction of j, it is called a directed path from i to j. The tuple (i, j, k) is called a collider on π if i, j and j, k are each adjacent on π with arrowheads pointing from i to j and from k to j 2. If additionally there is no edge between i and k (and i k), the collider is called v-structure. path without colliders is called a trek. Let,, V be three disjoint sets of nodes. The set an() := {i V there exists a directed non-overlapping path from i to c for some c } is called the ancestors of. non-overlapping path π from a to b is said to be m-connecting given if every non-collider on π is not in and every collider on π is in an(). If there are no such 2 That is, one of the following structures:,,,. 45

59 hapter 3. Structure Learning with ow-free cyclic Path Diagrams paths, and are m-separated given, and we write m. We use a similar notation for denoting conditional independence of the corresponding set of variables X and X given X : X X X. graph G is called cyclic if there are at least two nodes i and j such that there are directed paths both from i to j and from j to i. Otherwise G is called acyclic or recursive. n acyclic path diagram is also called an acyclic directed mixed graph (DMG). n acyclic path diagram having at most one edge between each pair of nodes is called a bowfree 3 acyclic path diagram (P). n DMG without any bidirected edges is called a directed acyclic graph (DG) The Model linear structural equation model (SEM) M is a set of linear equations involving the variables X = (X 1,..., X d ) T and some error terms ɛ = (ɛ 1,..., ɛ d ) T : X = X + ɛ, (3.1) where is a real matrix, and cov(ɛ) = Ω is a positive semi-definite matrix. M has an associated graph G that reflects the structure of and Ω. For every non-zero entry ij there is a directed edge from j to i, and for every non-zero entry Ω ij there is a bidirected edge between i and j. Thus we can also write (3.1) as: X i = j pa G (i) with cov(ɛ i, ɛ j ) = Ω ij for all i, j V. ij X j + ɛ i, for all i V, (3.2) Our model is a special type of SEM; in particular, we make the following assumptions: (1) The errors ɛ follow a multivariate Normal distribution N (0, Ω). (2) The associated graph G is a P. 3 The structure i j together with i j is also know as bow. 46

60 3.2. Model and Estimation It then follows from (1) that M is completely parametrized by θ = (, Ω). Often M is specified via its graph G, and we are interested to find a parametrization θ G compatible with G. We thus define the parameter spaces for the edge weight matrices (directed edges) and Ω (bidirected edges) for a given P G as G = { R d d ij = 0 if j i is not an edge in G} O G = {Ω R d d Ω ij = 0 if i j and i j is not an edge in G; Ω is positive semi-definite} and the combined parameter space as Θ G = G O G. The covariance matrix for X is given by: φ(θ) = (I ) 1 Ω(I ) T, (3.3) where φ : Θ G S G maps parameters to covariance matrices, and S G := φ(θ G ) is the set of covariance matrices compatible with G. Note that φ(θ) exists since G is acyclic by (2) and therefore I is invertible. We assume that the variables are normalized to have variance 1, that is, we are interested in the subset S G S G, where S G = {Σ S G Σ ii = 1 for all i = 1,..., d}, and its preimage under φ, ΘG := φ 1 ( SG ) Θ G. One of the main motivations of working with P models is parameter identifiability. This is defined below: Definition 1. normalized parametrization θ G Θ G is identifiable if there is no θ G Θ G such that θ G θ G and φ(θ G) = φ(θ G ). rito and Pearl (2002) show that for any P G, the set of normalized non-identifiable parametrizations has measure zero. The causal interpretation of Ps is the following. directed edge from X to Y represents a direct causal 4 effect of X on Y. bidirected 4 We adopt the interventional definition of causality, i.e. there is a direct causal effect of X on Y if intervening at X changes Y. 47

61 hapter 3. Structure Learning with ow-free cyclic Path Diagrams edge between X and Y represents a hidden variable which is a direct cause of both X and Y. In practice, one is often interested in predicting the effect of an intervention at X i on another variable X j. This is called the total causal effect of X i on X j and can be defined as E ij = x E[X j do(x i = x)], where the do(x i = x) means replacing the respective equation in (3.2) with X i = x (Pearl, 2009). For linear Gaussian path diagrams this is a constant quantity and given by E ij = ( (I ) 1) ij. (3.4) Penalized Maximum Likelihood onsider a P G. first objective is to estimate the parameters θ G from i.i.d. samples {x (s) i } (i = 1,..., d and s = 1,..., n). This can be done by maximum likelihood estimation using the RIF method of Drton et al. (2009). Given the Gaussianity assumption (1) and the covariance formula (3.3), one can write down the log-likelihood for a given parameter set θ G and a sample covariance matrix S: l(θ G ; S) = n ( log 2πΣ G + n 1 ) 2 n tr(σ 1 G S), (3.5) where Σ G = φ(θ G ) is the covariance matrix implied by parameters θ G, see for example Mardia et al. (1979, (4.1.9)). However, due to the structural constraints on and Ω it is not straightforward to maximize this for θ G. RIF is an iterative method to do so, yielding the maximum likelihood estimate: ˆθ G = arg max l(s; θ G ). (3.6) θ G Θ G We now extend this to the scenario where the graph G is also unknown, using a regularized likelihood score with a I-like penalty term that increases with the number of edges. oncretely, we use the following score for a given P G: s(g) := 1 ( ) max l(s; θ G ) (#{nodes} + #{edges}) log n. (3.7) n θ G Θ G 48

62 3.2. Model and Estimation We have scaled the log-likelihood and penalty with 1/n so that the score converges to a limit 5 for increasing n. ompared with the usual I penalty, we chose our penalty to be twice as large, since this led to better performance in simulations studies Equivalence Properties There is an important issue when doing structure learning with graphical models: typically the maximizers of (3.7) will not be unique. This is a fundamental problem for most model classes and a consequence of the model being underdetermined. In general, there are sets of graphs that are statistically indistinguishable (in the sense that they can all parametrize the same joint distributions over the variables). These graphs are called distributionally equivalent. For nonparametric DG models (without non-linearity or non-gaussianity constraints), for example, the distributional equivalence classes are characterized by conditional independencies and are called Markov equivalence classes. For Ps, distributional equivalence is not completely characterized yet, but we can present some useful results (see Spirtes et al. (1998) or Williams (2012) for a discussion of the linear Gaussian DMG case). Let us first make precise the different notions of model equivalence. Definition 2. Two Ps G 1, G 2 over a set of nodes V are Markov equivalent if they imply the same m-separation relationships. This essentially means they imply the same conditional independencies, and the definition coincides with the classical notion of Markov equivalence for DGs. The following notion of distributional equivalence is stronger. Definition 3. Two Ps G 1, G 2 are distributionally equivalent if for all θ G1 Θ G1 there exists θ G2 Θ G2 (and vice versa) such that φ(θ G1 ) = φ(θ G2 ). Spirtes et al. (1998) showed the following global Markov property for general linear path diagrams: if there are nodes a, b V and a possibly empty set V such that a m b, then the partial correlation 5 For the true graph G, this limit is the negative entropy of the joint distribution. 49

63 hapter 3. Structure Learning with ow-free cyclic Path Diagrams of X a and X b given X is zero. s a direct consequence, we get the following first result: Theorem 2. If two Ps G 1, G 2 do not share the same m-separations, they are not distributionally equivalent. Unlike for DGs, the converse is not true, as the counterexample in Figure 3.2 shows. n important tool in this context is Wright s path tracing formula (Wright, 1960), which expresses the covariance between any two variables in a path diagram as the sum-product over the edge labels of the treks (collider-free paths) between those variables, as long as all variables are normalized to variance 1. precise statement as well as a proof of a more general version of Wright s formula can be found in the ppendix (Theorem 6). s an example, consider the P in Figure 3.1b. There are two treks between X 2 and X 4 : X 2 X 3 X 4 and X 2 X 4. Hence cov(x 2, X 4 ) = Ω 24, assuming normalized parameters. Similarly we have cov(x 1, X 4 ) = s a consequence of Wright s formula, we can show that having the same skeleton and collider structure is sufficient for two acyclic path diagrams to be distributionally equivalent (Theorem 3 below). For DGs, it is known that the weaker condition of having the same skeleton and v-structures is sufficient for being Markov equivalent. However, for Ps this is not true, as the counterexample in Figure 3.2 shows. Theorem 3. Let G 1, G 2 be two acyclic path diagrams that have the same skeleton and collider structure. Then G 1 and G 2 are distributionally equivalent. It would be desirable to also have a necessary condition for distributional equivalence. We conjecture that having the same skeleton is a necessary condition. For DGs this is trivial, since a missing edge between two nodes means they can be d-separated, and thus a conditional independency would have to be present in the corresponding distribution. However, for Ps a missing edge does not necessarily result in an m-separation, as the counterexample in Figure 3.2 shows. The following theorem at least shows that strictly nested models cannot be distributionally equivalent. Theorem 4. Let G 1, G 2 be two Ps over the same set of nodes, 50

64 3.3. Greedy Search X 1 X 2 X 1 X 2 X 1 X 2 X 3 X 4 (a) X 3 X 4 (b) X 3 X 4 (c) Figure 3.2.: The two Ps in (a) and (b) share the same skeleton and v-structures, but in (a) there are no m-separations, whereas in (b) we have X 2 m X 3 {X 1, X 4 }. Ps (a) and (c) share the same m-separations (none) but are not distributionally equivalent since (a) is a strict subgraph of (b) (using Theorem 4). such that G 1 is a strict subgraph of G 2. Then G 1 and G 2 are not distributionally equivalent Greedy Search We aim to find the maximizer of (3.7) over all graphs over the node set V = {1,..., d}. Since exhaustive search is infeasible, we use greedy hill-climbing. Starting from some graph G 0, this method obtains increasingly better estimates by exploring the local neighborhood of the current graph. t the end of each exploration, the highestscoring graph is selected as the next estimate. This approach is also called greedy search and is often used for combinatorial optimization problems. Greedy search converges to a local optimum, although typically not the global one. To alleviate this we repeat it multiple times with different (random) starting points. We use the following neighborhood relation. P G is in the local neighborhood of G if it differs by exactly one edge, that is, one of the following holds: 1. G G (edge addition), 51

65 hapter 3. Structure Learning with ow-free cyclic Path Diagrams 2. G G (edge deletion), or 3. G and G have the same skeleton (edge change). If we only admit the first condition, the procedure is called forward search, and it is usually started with the empty graph. Instead of at every step searching through the complete local neighborhood (which can become prohibitive for large graphs), we can also select a random subset of neighbors and only consider those. In Sections and we describe some adaptations of this general scheme, that are specific to the problem of P learning. In Section we describe our greedy equivalence class algorithm Score Decomposition Greedy search becomes much more efficient when the score separates over the nodes or parts of the nodes. For DGs, for example, the log-likelihood can be written as a sum of components, each of which only depends on one node and its parents. Hence, when considering a neighbor of some given DG, one only needs to update the components affected by the respective edge change. similar property holds for Ps. Here, however, the components are not the nodes themselves, but rather the connected components of the bidirected part of the graph (that is, the partition of V into sets of vertices that are reachable from each other by only traversing the bidirected edges). For example, in Figure 3.1b the bidirected connected components (sometimes also called districts) are {X 1 }, {X 2, X 4 }, {X 3 }. This decomposition property is known (?), but for completeness we give a derivation in the ppendix. We write out the special case of the Gaussian parametrization below. Let us write p X G for the joint density of X under the model (3.2), and p ɛ G for the corresponding joint density of ɛ. Let 1,..., K be the connected components of the bidirected part of G. We separate the model G into submodels G 1,..., G K of the full SEM (3.2), where each G k consists only of nodes in V k = k pa( k ) and without any edges between nodes in pa( k ). Then, as we show in the ppendix, the loglikelihood of the model with joint density p X G given data D = {x(s) i } 52

66 3.3. Greedy Search (with 1 i d and 1 s n) can be written as: l(p X G; D) = n s=1 = k log p X G(x (s) 1,..., x(s) p ) l(p X G k ; {x (s) i } s=1,...,n i V k ) l(p X G k ; {x (s) j } s=1,...,n ), j pa( k ) where l(p X G k ; {x (s) j } s=1,...,n ) refers to the likelihood of the X j -marginal of p X G k. For our Gaussian parametrization, using (3.5), this becomes l(σ G1,..., Σ GK ; S) = ( n k log 2π + log 2 k Σ Gk j pa( k ) σ2 kj + n 1 n tr( Σ 1 G k S Gk pa( k ) ) ), where S Gk is the restriction of S to the rows and columns corresponding to k, and σkj 2 is the diagonal entry of Σ G k corresponding to parent node j. Note that now the log-likelihood depends on {x (s) i } and p X G only via S and Σ G1,..., Σ GK. Furthermore, the log-likelihood is now a sum of contributions from the submodels G k. This means we only need to re-compute the likelihood of the submodels that are affected by an edge change when scoring the local neighborhood. In practice, we also cache the submodels scores, that is, we assign each encountered submodel a unique hash and store the respective scores, so they can be re-used Uniformly Random Restarts To restart the greedy search we need random starting points (Ps), and it seems desirable to sample them uniformly at random 6. Just like for DGs, it is not straightforward to achieve this. What is often done in practice (and implemented in the pcalg package (Kalisch et al., 2012) as randomdg()) is uniform sampling of triangular (adjacency) 6 nother motivation for uniform P generation is generating ground truths for simulations. 53

67 hapter 3. Structure Learning with ow-free cyclic Path Diagrams n=30000; h= Rel. Frequency Naive (LL/n= 4.393) MM (LL/n= 4.128) P ID Figure 3.3.: Relative frequencies of the 62 3-node Ps when sampled times with the naive (triangular matrix sampling) and the MM method. matrices and subsequent uniform permutation of the nodes. However, this does not result in uniformly distributed graphs, since for some triangular matrices many permutations yield the same graph (the empty graph is an extreme example). The consequence is a shift of weight to more symmetric graphs, that are invariant under some permutations of their adjacency matrices. simple example with Ps for d = 3 is shown in Figure 3.3. One way around this is to design a random process with graphs as states and a uniform limiting distribution. corresponding Markov hain Monte arlo (MM) approach is described for example in Melançon et al. (2001) for the case of DGs. See also Kuipers and Moffa (2015) for an overview of different sampling schemes. We adapted this method for Ps. Specifically, we used the following transition rules: in each step a position (i, j) gets sampled uniformly at random; if there is an edge at (i, j), remove this with probability 0.5; if there is no edge at (i, j): 54

68 3.3. Greedy Search add i j with probability 0.5, as long as this does not create a directed cycle; add i j with probability 0.5. It is easy to check that the resulting transition matrix is irreducible and symmetric and hence the Markov chain has a (unique) uniform stationary distribution. Thus, starting from any graph, after an initial burn-in period, the distribution of the visited states will be approximately uniform over the set of all Ps. In practice, we start the process from the empty graph and sample after taking O(d 4 ) steps (c.f. Kuipers and Moffa (2015)). It is straightforward to adapt this sampling scheme to a number of constraints, for example uniform sampling over all Ps with a given maximal in-degree Greedy Equivalence lass onstruction We propose the following recursive algorithm to greedily estimate the distributional equivalence class E(G) of a given P G with score ζ. We start by populating the empirical equivalence class Ê(G) with graphs that have the same skeleton and colliders as G, since these are guaranteed to be equivalent by Theorem 3. This is a significant computational shortcut, since these graphs do not have to be found greedily anymore. Then, starting from G, at each recursion level we search all edge-change neighbors of the current P for Ps that have a score within ɛ of ζ (edge additions or deletions would result in non-equivalent graphs by Theorem 4). For each such P, we spark a new recursive search until a maximum depth of d(d 1)/2 (corresponding to the maximum number of possible edges) is reached, and always comparing against the original score ζ. lready visited states are stored and ignored. Finally, all found graphs are added to Ê(G). The main tuning parameter here is ɛ, essentially specifying the threshold for statistically indistinguishable graphs. This approach of approximating the equivalence class has the advantage of also including neighbouring equivalence classes, that are statistically indistiguishable from the given data, thus being on the conservative side. 55

69 hapter 3. Structure Learning with ow-free cyclic Path Diagrams Implementation Our implementation is done in the statistical computing language R (R ore Team, 2015). The code is available as supplemental material to this paper, and it will also be published in a future version of the R package pcalg (Kalisch et al., 2012). We make heavy use of the RIF implementation fitncestralgraph() 7 in the ggm package (Marchetti et al., 2015). We noted that there are sometimes convergence issues, so we adapted the implementation of RIF to include a maximal iteration limit (which we set to 10 by default) Empirical Results In this section we present some empirical results to show the effectiveness of our method. First, we consider a simulation setting where we can compare against the ground truth. Then we turn to a well known genomic data set, where the ground truth is unknown, but the likelihood of the fitted models can be compared against other methods ausal Effects Discovery on Simulated Data To validate our method, we randomly generate ground truths, simulate data from them, and try to recover them from the generated datasets. This procedure is repeated N = 100 times and the results are averaged. We now discuss each step in more detail. Randomly generate a P G. We do this uniformly at random (for a fixed model size d = 10 and maximal in-degree α = 2). The sampling procedure is described in Section Randomly generate a parametrization θ G. We sample the directed edge labels in independently from a standard 7 Despite the function name the implementation is not restricted to ancestral graphs. 56

70 3.4. Empirical Results Normal distribution. We do the same for the bidirected edge labels in Ω, and set the error variances (diagonal entries of Ω) to the respective row-sums of absolute values plus an independently sampled χ 2 (1) value 8. Simulate data {x (s) i } from θ G. This is straightforward, since we just need to sample from a multivariate Normal distribution with mean 0 and covariance φ(θ G ). We use the function rmvnorm() from the package mvtnorm (Genz et al., 2014). Find an estimate Ĝ from {x(s) i }. We use greedy search with R = 100 uniformly random restarts (as outlined in Section 3.3) for this, as well as one greedy forward search starting from the empty model. ompare G and Ĝ. This is not so straightforward, since the structure of equivalence classes for Ps is not known. We therefore make use of the greedy approach described in Section with ɛ = to get Ê(G) and Ê(Ĝ). We proceed by estimating the minimal absolute causal effects EG min between all nodes over the empirical equivalence class, analogous to the ID method (Maathuis et al., 2009). Thus, for each graph G Ê(G) in the empirical equivalence class with estimated parameters ˆθ G = ( ˆ, ˆΩ ), we compute the estimated causal effects matrix Ê according to (3.4). We then take absolute values and take the entry-wise minima over all Ê to obtain Êmin. We do the same for Ĝ to get Êmin G Ĝ. To compare the estimated causal effects of the ground truth G and those of the greedy search result Ĝ, we treat this as a classification problem, where (Êmin) Ĝ ij is giving a confidence score for the event (Êmin G ) ij > 0, and we report the area under the RO curve (U, see?). The U ranges from 0 to 1, with 1 meaning perfect classification and 0.5 being equivalent to random guessing 9. 8 y Gershgorin s circle theorem, this is guaranteed to result in a positive definite matrix. To increase stability, we also repeat the sampling of Ω if its minimal eigenvalue is less then bit of care has to be taken because of the fact that the cases (Êmin G ) ij > 0 and 57

71 hapter 3. Structure Learning with ow-free cyclic Path Diagrams RO urves Mean TPR Mean FPR Figure 3.4.: RO curves for causal effect discovery for N = 100 simulation runs of Ps with d = 10 nodes and a maximal in-degree of α = 2. Sample size was n = 1000, greedy search was repeated R = 100 times at uniformly random starting points. The average area under the RO curves (U) is The thick curve is the point-wise average of the individual RO curves. The results can be seen in Figure 3.4 with an U of While this suggests that perfect graph discovery is usually not achieved, causal effects (which are often more relevant in practice) can be identified to some extent. The computations took 2.5 hours on an MD Opteron 6174 processor using 20 cores Genomic Data We also applied our method to a well-known genomics data set (Sachs et al., 2005), where the expression of 11 proteins in human T-cells was measured under 14 different experimental conditions. There are likely (Êmin G ) ji > 0 exclude each other, but we took this into account when computing the false positive rate. 58

72 3.5. onclusions hidden confounders, which makes this setting suitable for hidden variable models. However, it is questionable whether the bow-freeness, linearity, and Gaussianity assumptions hold to a reasonable approximation (in fact the data seem not multivariate Normal). Furthermore, there does not exist a ground truth network (although some individual links between pairs of proteins are reported as known in the original paper). So we abstain from presenting a best network, but instead compare the model fit of Ps and DGs from a purely statistical point of view. To do this, we first log-transform all variables since they are heavily skewed. We then run two sets of greedy searches for each of the 14 data sets: one with Ps and one with DGs. For the Ps we use 100 and for the DGs we use 1000 random restarts (DG models can be fit much faster then P models). The results for datasets 1 to 8 can be seen in Figure The penalized likelihood scores of the best P models are always much higher than the corresponding scores for the DG models, despite the larger number of random restarts for the DGs. Furthermore, while there is a marked improvement from the random starting scores for Ps, the improvement for the DG scores is marginal. For these datasets, Ps seem to be superior to DGs in modelling the data. The computations took 4 hours for the P models and 1.5 hours for the DG models on an MD Opteron 6174 processor using 20 cores onclusions We have presented a structure learning method for Ps, which can be viewed as a generalization of Gaussian linear DG models that allow for certain latent variables. Our method is computationally feasible and the first of its kind. The results on simulated data are promising, keeping in mind that structure learning and inferring causal effects are difficult, even for the easier case with DGs. The main sources of errors (given the model assumptions are fulfilled) are sampling variability, finding a local optimum only, and not knowing 10 The results for the other 6 datasets are qualitatively similar. 59

73 hapter 3. Structure Learning with ow-free cyclic Path Diagrams Figure 3.5.: Greedy search runs on the first 8 of 14 genomic datasets with Ps and DGs. The x-axis is time in seconds, the y-axis is the (normalized) penalized likelihood score. Each path corresponds to a run of a greedy search with a different random restart, with each point on the path being a state visited by this greedy search run. 60