Estimating Causal Networks from Multivariate Observational Data

Size: px
Start display at page:

Download "Estimating Causal Networks from Multivariate Observational Data"

Transcription

1 Research ollection Doctoral Thesis Estimating ausal Networks from Multivariate Observational Data uthor(s): Nowzohour, hristopher Publication Date: 2015 Permanent Link: Rights / License: In opyright - Non-ommercial Use Permitted This page was generated automatically upon download from the ETH Zurich Research ollection. For more information please consult the Terms of use. ETH Library

2 Diss. ETH No Estimating ausal Networks from Multivariate Observational Data. dissertation submitted to ETH ZURIH for the degree of Doctor of Sciences presented by HRISTOPHER NOWZOHOUR Master of Science, University of Oxford born July 2, 1986 citizen of Germany accepted on the recommendation of Prof. Dr. Peter ühlmann, examiner Prof. Dr. Marloes Maathuis, co-examiner 2015

3

4 To my family

5

6 cknowledgements First, I would like to thank my supervisor, Prof. Peter ühlmann, who was ever encouraging and set an example balancing an unbelievable number of personal and professional committments. I am also grateful to Prof. Marloes Maathuis, who co-supervised me during the second part of my doctoral studies and whose enthusiasm and attention to detail made our collaboration very enjoyable. Without the encouragement of Prof. Nicolai Meinshausen, while being a student at Oxford, I would have never considered a PhD in statistics. I would not have gotten through my PhD without the copious amounts of chocolate and good vibes, generously supplied by my two amazing office mates nna and Ewa. I won t forget the climbs done with lain, the Salsa dancing with Sophie, or the rounds of razy Dog played with Jan, nna, Jonas, Ruben, and all the others! I am thankful to all SfS colleagues for the great atmosphere at our institute, be it at work or outside. Finally, I am very grateful to my parents and my sister for their lasting support and encouragement. v

7

8 ontents bstract Zusammenfassung ix xi 1. Introduction ackground ausal Models Structure Learning Outline of Thesis Score-based ausal Learning in dditive Noise Models Introduction The Method Notation and Definitions Penalized maximum likelihood estimation Theoretical Results Numerical Results Identifiability depending on Linearity and Gaussianity Random Edge Functions Larger Networks and Thresholding Real Data onclusions onsistency Proof Structure Learning with ow-free cyclic Path Diagrams Introduction Model and Estimation Graph Terminology vii

9 ontents The Model Penalized Maximum Likelihood Equivalence Properties Greedy Search Score Decomposition Uniformly Random Restarts Greedy Equivalence lass onstruction Implementation Empirical Results ausal Effects Discovery on Simulated Data Genomic Data onclusions Distributional Equivalence Likelihood Separation onclusions and Future Work Specific Extensions for DG learning Specific Extensions for P learning node DGs Full Double Sink Double Source hain Single Edge Empty Equivalence lasses for 3-node Ps Finding Equivalent Ps Finding Equivalence lasses lgorithms DG Learning lgorithms P Learning lgorithms ibliography 95 viii

10 bstract The field of statistical causal inference is concerned with estimating cause-effect relationships between some variables from i.i.d. observations. This is impossible in general, e.g. one cannot distinguish X Y from X Y without making further assumptions. However, when more variables are involved or certain structural or distributional assumptions are made, causal inference becomes possible. This is relevant for applications, where randomized experiments are not feasible to test causal hypotheses (econometrics) or the large number of hypotheses requires some kind of pre-screening (genomics). This thesis is about structure learning, which means estimating the underlying causal graph from data. Specifically, the focus of interest is score-based methods, which assign every possible causal graph a numeric score (depending on the observed data) and then try to find the graph maximizing this score. The two main challenges are: 1. Defining a meaningful score, that is maximized by the true underlying graph only, and is easily computable at the same time. 2. Solving the combinatorial optimization problem of maximizing the score over all possible graphs. n important class of causal models are directed acyclic graphs (DGs), where there are no cyclic relations and no hidden variables. DGs (also known as ayesian networks) encode conditional independencies in the joint distribution, and are only identifiable up to their equivalence class in general (there is generally more than one DG encoding the same set of conditional independencies). When the model is restricted to additive noise, the independence of the noise terms can be used to identify DGs completely, unless the model is linear and Gaussian (in the continuous case). This thesis presents a score-based method for continuous identifiable additive noise models. Specifically, a penalized ix

11 bstract pseudo-likelihood score is developed for this nonparametric setting and proved to be consistent. The method is also successfully tested on simulated and real datasets. To also accomodate hidden variables, the class of DGs needs to be extended. useful way to do this are bow-free acyclic path diagrams (Ps), which put some restrictions on the hidden structure, but are statistically viable. The parametrization is assumed to be linear and Gaussian to facilitate likelihood scoring. This means full identifiability is not possible anymore. In contrast to DGs, no established theory exists about model equivalency for Ps. This thesis presents a greedy search method for this case, that estimates the equivalence class of the underlying graph, as well as some theoretical results about model equivalency. The method is shown to work on simulated data and is applied to a well-known genomics dataset, where the statistical fit is shown to be much better than for DG models. x

12 Zusammenfassung Das Gebiet der statistischen kausalen Inferenz beschäftigt sich mit dem Schätzen von Ursache-Wirkungs-Zusammenhängen zwischen einer Reihe von Variablen basierend auf i.i.d. Daten. Ganz generell ist das nicht möglich (z.. X Y von X Y zu unterscheiden) ohne zusätzliche nnahmen zu treffen. Dies ändert sich, sobald mehr Variablen involviert sind oder bestimmte strukturelle oder verteilungstechnische nnahmen getroffen werden. Kausale Inferenz ist besonders relevant für nwendungen, für die randomisierte Experimente nicht möglich sind (z.. in der Ökonometrie) oder wo die grosse nzahl der zu testenden Hypothesen eine rt Vorauswahl erfordert (z.. in der Genomik). In dieser Dissertation geht es darum, den zugrundeliegenden kausalen Graphen von Daten zu schätzen. Der Fokus liegt insbesondere auf Score-basierten Methoden, die jedem Graphen einen (von den Daten abhängigen) numerischen Vergleichswert die Score zuordnen und dann versuchen den wertmaximierenden Graphen zu finden. Die zwei Hauptherausforderungen sind: 1. Das Definieren einer sinnvollen Score-Funktion, die nur vom wahren kausalen Graphen maximiert wird und zugleich einfach zu berechnen ist. 2. Das Lösen des kombinatorischen Optimierungsproblems um die Score über alle Graphen zu maximieren. Eine wichtige Klasse von kausalen Modellen sind DGs (directed acyclic graphs), in denen es keine zyklischen Strukturen und keine verborgenen Variablen gibt. DGs (auch als ayes sche Netze bekannt) kodieren konditionelle Unabhängigkeiten in der gemeinsamen Wahrscheinlichkeitsverteilung und sind im llgemeinfall nur bis auf ihre Äquivalenzklasse identifizierbar (es gibt in der Regel mehrere DGs die die gleichen konditionellen Unabhängigkeiten kodieren). Wenn man das xi

13 Zusammenfassung Modell auf additive Fehlerterme beschränkt, kann die Unabhängigkeit dieser Fehlerterme dazu genutzt werden den DG komplett zu identifizieren, ausser das Modell ist linear und normalverteilt (im kontinuierlichen Fall). Diese rbeit präsentiert eine Score-basierte Methode für kontinuierliche und identifizierbare Modelle mit additiven Fehlertermen. Da dieses Szenario nichtparametrisch ist, wurde eine penalisierte pseudo-likelihood Score entwickelt und deren Konsistenz bewiesen. Die Methode wurde ausserdem erfolgreich an simulierten und reellen Daten getestet. Um auch verborgene Variablen zu modellieren muss die Klasse der DGs erweitert werden. Eine Möglichkeit dies zu tun sind Ps (bowfree acyclic path diagrams). Die verborgenen Variablen sind hier bestimmten Restriktionen unterlegen, aber dafür ist das statistische Modell praktikabel. Die Parametrisierung ist linear und normalverteilt, sodass eine likelihood score eingesetzt werden kann. Das heisst aber auch, dass das Modell nicht mehr komplett identifizierbar ist. Im Gegensatz zu DGs gibt es für Ps keine vollständige Theorie der Äquivalenzklassen. In dieser rbeit wird ein Greedy-lgorithmus für dieses Szenario präsentiert, der die Äquivalenzklasse des zugrundeliegenden Graphen schätzt. usserdem werden einige theoretische Resultate über die Äquivalenzstruktur von Ps vorgestellt. Die Methode wurde ebenfalls erfolgreich an simulierten Daten getestet und wurde darüberhinaus auf ein bekanntes Genomik-Datenset angewendet, wo der statistische Fit erheblich besser war als für DG Modelle. xii

14 hapter 1. Introduction Happy the man who has been able to learn the causes of things. Virgil, Georgics II The goal of many scientific endeavours is to detect causal relationships in the world surrounding us. ssociations are ubiquitous but often difficult to explain with causal models, and only a causal model is usually considered a sufficient explanation. side from the rather philosophical motivation of true understanding of a phenomenon, there are more tangible advantages to knowing cause and effect. Often, understanding a system is a precursor to intervening in the system in an effort to change its outcome. If the outcome cannot be controlled directly, the only way to change it is to influence its causes. If a drug is known to cause the cure of an illness, the way to make a sick person healthy is to administer the drug. The quest of causality is finding out which drugs cure the disease. Sometimes this is possible by direct experiment a number of patients is randomly split into two groups, one of which receives the drug, the other a control. The random assignment rules out any other causes (like gender, age, etc.) for potential improvement, and any change in outcome between the two groups can then be attributed to the drug. For various reasons a randomized experiment is not always possible. It might simply be too costly to test all relevant causal hypotheses. There are thousands of genes in the genome of most species (and an exponential number of gene combinations), which typically cannot all be tested for their effect on a certain phenotype (like pest resilience for 1

15 hapter 1. Introduction crops). nother reason preventing direct experiments might be ethical concerns. It would be wrong to forcefully expose people to a harmful substance (like tobacco) to examine the causal connection with certain diseases. Lastly, some experiments are simply impossible, which is often the case in econometrics. One cannot change global interest rates to asses their effect on unemployment. Sometimes the time ordering of events can provide causal information (a principle that is known as Granger causality in econometrics), but this only works as long as one can exclude the possibility of counfounding variables (post hoc fallacy). When experimental data are not available, one is left with observational data. This makes causal inference hard since then statistical associations can have other explanations than direct causal effects. The well-known lore of correlation does not imply causation serves as a warning for this, and an example of correlation based on confounding is shown below. Example onsider the dataset shown in the left panel of Figure 1.1. There is a strong association between life expectancy and the number of internet users in 211 countries. Does this mean, increasing your web milage will lead to a longer life? Most likely not. Far more realistic is the causal model depicted in the right panel of the figure, where there is a confounding variable that is not observed but causes both other variables. In this case the confounder might be the economic development status of a country, of which gross national income (GNI) could be considered a proxy. Indeed, when conditioning on GNI, the association between the two other variables weakens significantly. However, it would be wasteful to shun any purely observational data and just capitulate in these cases. orrelation doesn t imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing look over here. Randall Munroe, xkcd 552 2

16 1.1. ackground 80 Life Expectancy at birth (years) Gross National Income Low Lower Middle Upper Middle High LE Dev IU Internet users (per 100 people) Figure 1.1.: Life expectancy versus number of internet users conditioned on gross national income in the left panel (2013 data from 211 countries, Worldbank). Possible causal graph connecting life expectancy (LE), internet users (IU), and development (Dev) in the right panel. First, causal relationships imply statistical associations, so it would be futile to look for causation where there is no association. Second, it is possible to rule out a number of (and sometimes all but the correct) candidate causal models based on statistical associations in observational data, as research in this discipline during the last decades has shown. Third, with an increasingly restrictive cascade of assumptions, it is possible to identify more and more causal information from observational data. This thesis explores some of these cases and presents algorithms to extract causal models from observational data ackground The task of causal inference can be roughly split into two consecutive stages: learning the causal model from data and predicting intervention effects for a given causal model. While the former, also called structure learning, is concerned with distinguishing causes and effects, the latter is concerned with the specific types and strengths of effects. 3

17 hapter 1. Introduction This thesis is concerned mainly with structure learning. There is a strong connection with graphical modelling (and particularly ayesian networks), which causal inference can essentially be considered an application of. fairly comprehensive overview of the field can be found in Pearl (2009), with the earlier work of Spirtes et al. (1993) also providing valuable insight ausal Models causal model is usually visualised as a directed graph, consisting of nodes (representing variables) and directed edges between the nodes (representing causal relations). n example is given in the right panel of Figure 1.2, where a change in X 1 would cause a change in X 2 (but not vice versa). The idea is that every variable that can be measured is completely determined by some other variables (its causes). In practice, one rarely has access to (or interest in) all causes of all variables, so all the unknown causes for a given variable are lumped together into a noise term. s long as these unknown causes are distinct for each observed variable (to a reasonable approximation), the noise terms can be considered independent. If there are unobserved confounders, the noise terms are no longer independent (a scenario which is considered in hapter 3), and the presence of a confounder is usually marked with a bidirected edge in the graph. Of course a cause-effect relationship is always relative to a resolution since it can be decomposed into more and more intermediate stages. The key modelling step now is to associate the noise terms and the exogenous variables 1 with a set of probability distributions, which in turn give rise to corresponding joint probability distributions over the observed variables. This induces the statistical model (set of distributions) associated with the causal model (graph) and is essential for all statistical structure learning methods. The causal graph might well be cyclic, that is, it might have feedbackloops. However, it is usually assumed that this is not the case. The 1 Exogenous variables are not caused by any variables in the system. They are also called source variables. 4

18 1.1. ackground Figure 1.2.: ausal Inference in a nutshell. reasons for this are twofold. On the one hand, most variables could be indexed by time and be measured in an arbitarily dense fashion. The notion of causality that most people subscribe to then implies that when measuring at the right time scale, the true causal model will not be cyclic. On the other hand, and more importantly, including cyclicity makes things much more complicated. This is why, in the spirit of progressing from simpler towards more complicated models, most methods so far still assume acyclicity. The causal graph is typically associated with a structural equation model (SEM), which is supposed to model the data generating process (see Figure 1.2). The SEM is simply a system of equations, specifying each variable as a function of its direct causes and a noise term. s such, it encodes the causal graph with its functional dependencies (there is an edge from X 1 to X 2 if and only if f 2 depends on X 1 in Figure 1.2). The reason for the term structural is the asymmetric meaning of the equal sign in these systems it separates causes from effects (usually the causes are written on the right-hand side). The joint distribution over the observed variables is completely determined by the SEM and the marginal distributions of the noise terms and the exogenous variables. This very general model can be constrained by various structural or distributional assumptions. Structural ssumptions The following are assumptions on the causal graph, which are made at various points in this thesis. 5

19 hapter 1. Introduction (1) cyclicity. There are no feedback loops. (2) ausal Sufficiency. There are no hidden confounders the noise terms are independent and the causal graph consists of directed edges only. (3) ow-freeness. There is no hidden confounder between two variables, if one of the variables is a direct cause of the other. This is strictly weaker than assumption (2), since it allows some hidden confounders. When assumptions (1) and (2) hold, the causal graph is a directed acyclic graph (DG). ausal structure learning for DGs is discussed in hapter 2. In this case, the distribution generated by the corresponding SEM is also said to be Markov to the DG, since it satisfies certain conditional independence properties resulting from the DG. When assumptions (1) and (3) hold, the causal graph is a bow-free acyclic path diagram (P), like the one shown in Figure 1.2. This rather special class of models is useful, since it can accomodate some hidden confounders while still being statistically tractable, and is discussed in hapter 3. Distributional ssumptions The following are assumptions on the joint probability distribution or the parametrization of the model, which are made at various points in this thesis. (4) ausal Minimality. There is no proper subgraph 2 of the true graph, that can model the data equally well. This would be violated if there were a superfluous edge in the graph (e.g. if f 2 were independent of X 1 in Figure 1.2). (5) Faithfulness. Every conditional independency in the distribution has to result from the causal graph. This is strictly stronger version of assumption (4) and would be violated by cancelling paths, where different influences between variables cancel each other out. For a violation of causal minimality or faithfulness, 2 proper subgraph is the same as the original graph, but with some missing edges. 6

20 1.1. ackground the distributional parameters would have to be finely tuned, so a violation of faithfulness would almost surely not occur with randomly generated models. (6) dditive Noise. The structural equations are additive in the noise terms, essentially turning each structural equation into a nonlinear regression model. (7) Gaussian and Linear. The marginal distributions of the noise terms and exogenous variables are Gaussian, and the structural equations are linear. This is a strictly stronger version of assumption (6) and turns each structural equation into a linear regression model. The resulting distribution over the observed variables is jointly Gaussian. (8) Non-Gaussian or Non-linear. The negation of assumption (7) (but still keeping assumption (6)). nother important assumption that is often made is that the data are i.i.d. This means in particular that the data cannot have time structure and must come from the same experimental conditions. Equivalency fundamental problem in structure learning is that the causal graph is, in general, not determined by the joint distribution. Instead, the graphs cluster in equivalence classes, which are statistically indistinguishable, meaning they can all generate the exact same distributions. This means the causal graph is only identifiable up to its equivalence class. In the DG case, when assumptions (1) and (4) hold, the equivalence classes are determined by conditional independencies that have to hold in the corresponding joint distribution and which can be read off the graphs using a criterion called d-separation (Pearl, 2009). In practice, one often makes assumption (7), since in a joint Gaussian distribution conditional independencies correspond to vanishing partial correlations, which can be estimated easily from data. However, it turns out that in the additive noise setting the linear Gaussian case stands out, in the sense that it is the only case where 7

21 hapter 1. Introduction full identifiability cannot be achieved (Hoyer et al., 2009). s soon as one departs from this and assumes either nonlinear structural equations or non-gaussian noise, the causal graph becomes identifiable. This is because the model class then does not admit independent noise terms for anything but the true graph. This case is considered in hapter 2. For more general model classes like Ps (when assumption (3) holds) the equivalence structure is not completely known, which complicates structure learning significantly. This case is considered in hapter 3. When the graph is not identifiable for some model classes, structure learning results in the correct equivalence class at best. In lucky cases, this equivalence class might be small and agree on parts of the graph. ut in general, equivalence classes can get very large. However, it might still be possible to identify causal effects between some of the variables, by considering each graph in the equivalence class individually and then taking lower bounds of the absolute causal effects (this is the ID method by Maathuis et al. (2009)). This is what is done in the P case (hapter 3), since there the equivalence classes can potentially be large Structure Learning Methods for structure learning can be broadly categorized into two classes: constraint-based methods and score-based methods. The former class utilizes properties, that the correct model has to fulfill, and that are testable on data, and constructs the model step by step. The latter class assigns a numeric score to each candidate model und turns the problem into a combinatorial optimization problem (optimizing the score over all possible models). Methods for discrete and continuous data are often quite different in this thesis only the continuous case is considered. The underlying problem of learning the best graph is known to be NP-hard in general, so both types of methods face the problem of either having worst-case exponential complexities or not guaranteeing to find the optimal model. However, it has been shown that in some cases polynomial complexity algorithms exist. Specifically, when a 8

22 1.1. ackground DG model is correctly specified (i.e. the observed data really were generated from such a model), finding the true model is not NP-hard (hickering, 2002). similar result holds for sparse networks with hidden confounders (laassen et al., 2013). onstraint-based Methods The most common methods of this type use the fact that the causal graph implies certain conditional independencies, that are testable on the observed data. The P algorithm for DGs (Spirtes et al., 1993) and the FI, RFI, and FI+ algorithms for models with hidden confounders (Spirtes et al., 1993; Maathuis et al., 2009; laassen et al., 2013) all exploit this property. lthough very general in theory, typically distributional assumptions like linearity and Gaussianity (assumption (7)) have to be made in order to facilitate conditional independence tests. The result of these methods is an estimate of the equivalence class, i.e. all graphs that imply precisely the conditional independencies consistent with the observed data. For nonlinear or non-gaussian additive noise models (when assumption (8) holds), the independence of the noise terms is used as a constraint. The first method of this type was LiNGaM (Shimizu et al., 2006), which assumes linear functions and non-gaussian noise. This was later extended to the general case (Hoyer et al., 2009; Mooij et al., 2009). very general method not needing the additive noise assumption is IGI (Janzing et al., 2012), which is based on the independence of the cause and the mechanism mapping the cause to its effect. ll of these methods result in a single causal graph, since in these model classes full identifiability is possible. Score-based Methods The advantage of a score-based approach is that models become inherently comparable, and optimization techniques like greedy search can be applied. There is a long history of score-based methods for causal inference, going back to LISREL (Jöreskog, 2001) for the linear Gaussian case with possibly hidden confounders. related method is presented 9

23 hapter 1. Introduction in hapter 3 of this thesis. For the DG case the most prominent method is greedy equivalent search (GES, hickering (2002)), which efficiently traverses the search space by clustering graphs into their equivalence classes. For the identifiable setting of assumption (8) a score-based method is presented in hapter 2. If the noise is assumed to be Gaussian, the M framework can be applied (ühlmann et al., 2014). The score for most of the methods in this class consists of a likelihood term, quantifying the model fit, as well as a penalty term for model pruning. The latter is necessary as otherwise overfitting would occur, and the full models would always attain the highest scores. Most often the ayesian Information riterion (I) is used as a score, which is consistent in the large sample limit if the model class is a curved exponential family. The score is usually optimized using some form of greedy hill climbing. This is only a heuristic in general but works reasonably well in practice. If the likelihood factorizes over the variables (i.e. can be written as a product of terms each depending on only a single variable and its parents in the graph), the score can be updated locally, leading to a large computational improvement for greedy search based methods Outline of Thesis This thesis has two main contributions. In hapter 2 a score-based method for the setting of nonlinear or non-gaussian additive noise models is presented (assumptions (1), (2), (4) and (8)). The score is a penalized pseudo-likelihood, where the relevant densities are estimated from the data. The main challenge here was to come up with a sensible scoring method, since the model is nonparametric and the true likelihood function is not known. Some simulations show the method works if the model is correctly specified, and an application to a real dataset shows it is comparable with other state-of-the-art methods. The simulations and real datasets considered are low-dimensional, but the method is in principle adaptable to higher dimensions. This chapter also includes some theoretical results, showing consistency of the score 10

24 1.2. Outline of Thesis as long as the distributions satisfy some technical conditions. In hapter 3 a method for linear Gaussian models with bow-free confounder structure is presented (assumptions (1), (3), (5), and (7)). The score is a penalized Gaussian likelihood, using the maximum likelihood estimator provided by Drton et al. (2009). In this chapter the main challenge was to find an efficient way to do greedy search in this setting, since the likelihood no longer factorizes completely over the variables. dditionally, the detailed structure of equivalence classes is unknown for this model class, making the evaluation of any method difficult (the graph found by the model might be different from the original graph yet be in the same equivalence class). This challenge was partly overcome by a mix of theoretical results, going some way towards characterizing equivalence classes, and a heuristic search method to find equivalent models. Simulations show the method is successful if the model is correctly specified, and an application to real data shows this model class has a much better fit than the more restrictive DG models, suggesting the ubiquity of hidden confounders in many datasets. In the appendices an overview of DGs and Ps over 3 nodes is given, as well as a numerical approach to explore the equivalence structure of small Ps. Finally, all algorithms developed as part of this thesis are given as pseudocode in ppendix. 11

25

26 hapter 2. Score-based ausal Learning in dditive Noise Models 1 Given data sampled from a number of variables, one is often interested in the underlying causal relationships in the form of a directed acyclic graph. In the general case, without interventions on some of the variables it is only possible to identify the graph up to its Markov equivalence class. However, in some situations one can find the true causal graph just from observational data, for example in structural equation models with additive noise and nonlinear edge functions. Most current methods for achieving this rely on nonparametric independence tests. One of the problems there is that the null hypothesis is independence, which is what one would like to get evidence for. We take a different approach in our work by using a penalized likelihood as a score for model selection. This is practically feasible in many settings and has the advantage of yielding a natural ranking of the candidate models. When making smoothness assumptions on the probability density space, we prove consistency of the penalized maximum likelihood estimator. We also present empirical results for simulated scenarios and real two-dimensional data sets (cause-effect pairs) where we obtain similar results as other state-ofthe-art methods. 1 This chapter has been published as Nowzohour and ühlmann (2015) 13

27 hapter 2. Score-based ausal Learning in dditive Noise Models 2.1. Introduction Statistical causal inference is an important but relatively new field. Traditionally, most statistical statements and assertions are associational (X and Y are correlated), rather than causal (changes in X cause changes in Y ). While the former are statements about the joint distribution, the latter are about the underlying causal mechanisms. In practice, the relevant question often is whether variable X has a causal effect 2 on variable Y, possibly mediated by some other variables Z 1,..., Z d in the causal network. In general, the only way to completely identify the causal model is by performing experiments (interventions). However, it is often possible to at least narrow down the space of candidate models by using only observational data (Verma and Pearl, 1991; Spirtes et al., 1993). There are many situations where one is dependent on purely observational data either because performing experiments is infeasible (e.g. astronomical data), unethical (e.g. clinical cancer studies), or both (e.g. economical data). Some real-life examples include identifying gene expression networks (Statnikov et al., 2012; Stekhoven et al., 2012) and analysing fmri data from the human brain (Ramsey et al., 2010). When modeling causal networks between some given variables, structural equation models are used frequently, where each variable is expressed as a function of some other variables (its causes) as well as some noise. Thus the model is determined by the cause-effect structure (in the form of a directed graph over the variables), the functional dependencies, and the joint distribution of the noise terms. ssumptions typically made include that the underlying causal model is acyclic (i.e. there are no feedback loops) and that the noise terms are independent (i.e. there are no unobserved variables). We furthermore assume that the noise is additive, i.e. the effect variable minus some noise term is a deterministic function of the cause variables. lthough quite restrictive, this is a common assumption in many other settings (e.g. regression) and allows straightforward estimation. The standard case then is to parameterize the model by making the functional dependen- 2 X has a causal effect on Y if manipulating X changes the distribution of Y, see Pearl (2009). 14

28 2.1. Introduction cies linear and the noise Gaussian 3. In this case the space of candidate models (in the form of directed acyclic graphs) clusters in equivalence classes, which prohibit full identification every model in a given equivalence class can induce the same joint distribution over the variables. In a sense, this is quite exceptional, however. It has been shown that as soon as one departs from the linearity or the Gaussianity assumptions the model becomes fully identifiable 4 (Shimizu et al., 2006; Hoyer et al., 2009; Zhang and Hyvärinen, 2009; Peters et al., 2011b; Peters and ühlmann, 2014). We are thus interested in the nonparametric case, where either the functional dependencies are nonlinear or the noise terms are non-gaussian (or both). n inference procedure for this case based on nonparametric independence tests has been suggested by Mooij et al. (2009). Their method is using the fact that when fitting the wrong model the noise terms will not be independent. There are a few problems with this approach, however. First, the null hypothesis of the tests employed is independence, which is what one would like to show, and statistical hypothesis testing only allows to reject such hypotheses. Second, because of the many tests involved there is a multiple testing problem. Third, nonparametric independence testing among many variables is statistically hard, and the tests tend to be computationally intensive. We take a different approach in the form of a score-based method, which is consistent, fast, and easily adaptable to greedy methods for large problems. Score-based methods are widely used for fitting Gaussian structural equation models (hickering, 2002) or discrete ayesian networks (Koller and Friedman, 2009). Maximum a posteriori estimation was used in the setting of non-linear models with Gaussian noise by Imoto et al. (2002). Two other score-based methods have recently been proposed: for the parametric setting of Gaussian and linear models with same error variances (Peters and ühlmann, 2014) and for linear models with non-gaussian noise (Hyvärinen and Smith, 2013). Most closely related to this paper is an approach from ühlmann et al. (2014). They consider a semi-parametric structural equation model with additive, nonlinear functions in the parental variables and 3 In fact, this is how structural equation models where first introduced and continue to be used today (ollen, 1989). 4 Except for a set of degenerate cases of measure zero. 15

29 hapter 2. Score-based ausal Learning in dditive Noise Models additive Gaussian noise, and they prove consistency and present an algorithm for cases with potentially many variables. In contrast, we consider here a model with a nonparametric specification of the error distribution (while the focus is on cases with few variables only). Thus, our model is more general but harder to estimate from data. We propose a penalized maximum likelihood method and prove its asymptotic consistency for finding the true underlying graph provided some technical assumptions about the class of probability densities hold. Our nonparametric setting also includes the well-known LiNGM model (Shimizu et al., 2006) as a special case, and thus we provide here a score-based approach for LiNGM. Independent work by Kpotufe et al. (2014) considers a similar problem as ours: however, while they only treat the case with two variables, we allow for more realistic multivariate settings. This paper is organized as follows: In Section 2 we review the basic notation and definitions we will use later on before describing our method. In Section 3 we present our main theorem and the assumptions for proving consistency in the large sample limit. In Section 4 we discuss simulation results showing that the method works in practice under controlled conditions. In Section 5, we test our method on some real-world datasets and compare it to other causal inference methods The Method Suppose data is sampled from real-valued random variables X 1,..., X d, which have some causal structure. We are interested in finding this causal structure (in the form of a directed acyclic graph) just by using observational data. efore we describe our method and the assumptions it rests on, we will give definitions of some of the basic terms used in this paper (some of which can be found in e.g. Lauritzen (1996), Pearl (2009), Triebel (1983)). 16

30 2.2. The Method Notation and Definitions Given a set of vertices V = {1,..., d} and edges E V V, we define the d-dimensional graph G as the ordered pair (V, E). If E is asymmetric, G is called a directed graph. Given two vertices α, β V, a directed path of length n from α to β is a sequence of vertices α = v 0,..., v n = β, s.t. (v i, v i+1 ) E i = 0,..., n 1. If G is directed and for all v V there is no path of length n 1 from v to itself, then G is called a directed acyclic graph (DG). If V V and E E V V, then G = (V, E ) is called a subgraph of G, and we write G G. If E E V V, we call G a proper subgraph of G and write G G. In a graph G we define the parents of a vertex v as the set pa G (v) := {u V : (u, v) E}. The structural Hamming distance (SHD) between two graphs G, G is defined as the number of single edge operations (edge additions, deletions, reversals) necessary to transform G into G. joint density p over X 1,..., X d is Markov with respect to a DG D, if it factorizes along D: p(x 1,..., x d ) = d p ( x k {x l } l pad (k)). (2.1) k=1 DG D is causally minimal with respect to a joint density p, if D D s.t. p is Markov with respect to D. structural equation model (SEM) M = {f k, p ɛk } k=1,...,d is a set of functions f k and densities p ɛk, specifying each variable X k as a function of some of the other variables and a noise term ɛ k (independent of the other noise terms) with density p ɛk. The model M induces a DG D, where a directed edge (k, l) is added if the function for X l directly depends on X k. We will assume in this paper, that M is recursive, i.e. its graph D is actually a DG. We can write the model equations as X k = f k ({X l } l pad (k)), ɛ k ), k = 1,... d. If the functions are additive in the noise, i.e. if X k = f k ({X l } l pad (k))) + ɛ k, k = 1,..., d, (2.2) 17

31 hapter 2. Score-based ausal Learning in dditive Noise Models the model is called an additive noise model (NM). We call M := (F, P ɛ ) a functional model class 5 of dimension d if F 0 (R d 1 ) is a class of functions containing the possible edge functions f k and P ɛ is a class of univariate probability densities containing the possible error densities p ɛk. The joint density of an NM is of the form (2.1) and thus Markov to its DG D. Vice versa we say that D induces a class of joint densities P on X 1,..., X d from a functional model class M, where { d ( P = p k xk f k ({x l } l pad (k))) ) } : f k F, p k P ɛ. (2.3) k=1 Thus P contains all joint densities that can be generated by NMs from class M with DG D. The class M is said to be identifiable, if the intersection of any two density classes P 1, P 2 induced by distinct graphs D 1, D 2 only contains densities for which there exists a unique graph that is causally minimal. We assume throughout the paper that the data generating process is an NM with associated causally minimal DG D 0 with induced density class P 0 and true joint density p 0 P 0. ausal minimality here essentially means that every edge in D creates a dependency in the joint distribution (i.e. there is an edge from X l to X k only if f k is not constant in x l ). For the density class, we often consider the weighted Sobolev space of functions W s r (R n, β ) which is defined as follows: W s r (R n, β ) := { f L r (R n ) : D α (f β ) L r (R n ) α s }, where x β = (1+ x 2 ) β/2 is a polynomial weighting function parametrized by β R, D α is the partial derivative operator according to the multiindex α, and r, s are integers at least 1. Note that for β = 0 this is the usual Sobolev space, while for β > 0 this is more restrictive (as the tails get bigger weights), and for β < 0 it is less restrictive. We will mostly be interested in the β < 0 case. 5 Here we implicitly assume that the model has additive noise. 18

32 2.2. The Method Penalized maximum likelihood estimation We now describe our method to learn the true causal structure from data. Suppose we measure d variables, and we have n i.i.d. samples {x j k } with j = 1,..., n and k = 1,..., d. Let D 1,..., D N be the candidate DGs under consideration 6 and P 1,..., P N their induced density classes for some model class M. If M is identifiable, we aim to infer the true DG D 0 by finding the density class P 0 that contains the true joint density p 0 (if there is more than one such class, we choose the one corresponding to the smallest graph). Of course, we do not know p 0 instead we estimate it by computing best representatives ˆp i n from each class P i. These are chosen via nonparametric maximum likelihood: ˆp i n = arg max p P i n log p(x j 1,..., xj d ). j=1 Then, each model is scored with a penalized log-likelihood: S i n = 1 n n log ˆp i n(x j 1,..., xj d ) #(edges) i a n, (2.4) j=1 where a n controls the strength of the penalty. Taking the maximum over these scores we get the estimator ˆD n = DÎn, where Î n = arg max Sn. i i=1,...,n Hence the estimated DG is DÎ. We will show in Section 2.3 that this procedure is consistent for a n proportional to 1/ log n and that therefore ˆD n = D 0 in the large sample limit. The question arises how to find the maximum likelihood estimators ˆp i n in each class in this nonparametric setting. We present here an exemplary procedure that has proved useful in practice. To estimate the edge functions of the SEM, we employ a nonparametric regression method. The error densities are then inferred from the residuals using a 6 E.g. all DGs with d nodes. 19

33 hapter 2. Score-based ausal Learning in dditive Noise Models density estimation method. The estimated joint density is finally given by the product of the residual densities, in accordance with (2.3). This gives the following three-step procedure for each DG D i : 1. For each node k estimate the residuals ˆɛ k by nonparametrically regressing X k on {X l } l padi (k). If pa D i(k) =, set ˆɛ k = x k. 2. For each node k estimate the residual densities ˆp ɛk from the estimated residuals ˆɛ k. 3. ompute the penalized likelihood score Sn i = 1 n d log ˆp ɛk (ˆɛ j k n ) #(edges) i a n. j=1 k=1 Of course, an exhaustive search over all DGs is only feasible for small values of d, since the number of DGs grows super-exponentially with the number of vertices 7 and nonparametric regression in d dimensions is ill-posed in general without making structural constraints, due to the curse of dimensionality 8. The methods used in steps 1 and 2 should be chosen depending on the model class M. Examples are (generalized) additive model regression (GM) for step 1 and kernel density estimation for step 2. s an illustration we look at the two-dimensional case, where there are only two variables X 1 and X 2. There are three DGs inducing the 7 The first few values of the number of DGs N(d) with d nodes are N(2) = 3, N(3) = 25, N(4) = 543, N(5) = 29281, N(6) = , for example. 8 The latter problem can be dealt with in certain cases, e.g. additive models, where the edge functions are additive in the parental variables. 20

34 2.3. Theoretical Results following models: D 1 : X 1 X 2 X 1 = ɛ 1 X 2 = f(x 1 ) + ɛ 2 p 1 (x 1, x 2 ) = p X1 (x) p X2 X 1 (x 2 x 1 ) = p ɛ1 (x 1 ) p ɛ2 (x 2 f(x 1 )) D 2 : X 1 X 2 X 1 = g(x 2 ) + ɛ 1 X 2 = ɛ 2 p 2 (x 1, x 2 ) = p X1 X 2 (x 1 x 2 ) p X2 (x 2 ) = p ɛ1 (x 1 g(x 2 )) p ɛ2 (x 2 ) D 3 : X 1 X 2 X 1 = ɛ 1 X 2 = ɛ 2 p 3 (x 1, x 2 ) = p X1 (x 1 ) p X2 (x 2 ) = p ɛ1 (x 1 ) p ɛ2 (x 2 ) We do steps 1, 2, and 3 as described above and choose the model with the highest (log-)likelihood penalized likelihood score. omparing this score-based approach with independence-test-based methods, the main difference occurs at step 2, where we estimate the residual densities instead of testing their independence. In terms of complexity, we swap one d-dimensional independence test againt d univariate density estimations. Simulations show that this is faster by a factor on the order of 100 with current implementations. However, even though we do not test residual independence directly, it is still the discriminatory property by which to identify the true model. y constructing the densities according to (2.3), we enforce the error terms to be independent in the estimated joint density. If they are not actually, the considered model will obtain a poor score. Thus, we are searching for the best fitting densities where the errors are independent Theoretical Results We now show that our method is consistent, i.e. that it will identify the true underlying DG given enough samples. In the following P D 21

35 hapter 2. Score-based ausal Learning in dditive Noise Models denotes the induced density class of DG D. We make the following assumptions: (1) Identifiability: The data {x j k } k=1,...,d are i.i.d. realizations (over j=1,...,n j = 1,..., n) of an identifiable structural equation model with induced d-dimensional DG D 0. In particular, the SEM can be the additive noise model (2.2) with nonlinear edge functions f k or non-gaussian noise variables 9 ɛ k for all k = 1,..., d (Peters et al., 2011b, Lemma 1). There are no hidden variables, i.e. the noise terms are jointly independent. (2) ausal Minimality: There is no proper subgraph D of D 0, s.t. p 0 is Markov with respect to D. (3) Smoothness of log-densities: For all DGs D the log-densities of P D (restricted to their respective support) are elements of a bounded weighted Sobolev space. That is r 1, s > d, β < 0, > 0 s.t. D α ( β 1{p > 0} log p) r < p P D, α s where r is the usual L r -norm. (4) Moment condition for densities: For all DGs D we have γ > s d/r s.t. p γ β r < p P D, where r, s, d, and β are determined by (3). (5) Uniformly bounded variance of log-densities: For all DGs D we have p 0 P D K > 0 s.t. sup p P D var p 0(log p(x 1,..., X d )) < K. (6) losedness of density classes: For all DGs D the induced density class P D is a closed set, with the topology given by the Kullback- Leibler (KL) divergence D KL (p(x) q(x)) = p(x) log p(x) q(x) dx. 9 Excluding a set of exceptions of measure zero (Hoyer et al., 2009, Theorem 1). 22

36 2.3. Theoretical Results The first two assumptions concern the general model setup and ensure identifiability (i.e. non-overlapping induced density classes). (1) requires the data to come from an identifiable NM due to nonlinearity or non-gaussianity, as in Hoyer et al. (2009). (2) ensures there are no superfluous edges in the true DG, i.e. the true model is the most parsimonious fitting the data. The last four assumptions are technical and used to prove consistency of the penalized maximum likelihood estimator. (3) essentially requires the log-densities to be smooth. (4) requires the densities to have some (at least fractional) finite moments. (5) requires the log-densities, for every underlying density p 0, to have uniformly bounded second moments. Finally, (6) guarantees the existence of the maximizers of the likelihood and the negative information entropy in each class. Furthermore, it is needed to ensure the true density p 0 has positive KL distance from all wrong density classes. Note that the latter statement alone would suffice to show consistency, since all statements can be written in terms of the supremums of likelihood and negative entropy, instead of their actual maximizers. However, for better comprehensability we chose the present formulation with the slightly stronger assumption. Making these assumptions, the penalized maximum likelihood estimator is consistent. We show this by proving that the probability of the true model obtaining a smaller score than any other model vanishes in the large sample limit. Theorem 1. ssume (1) (6). Let S i n be the penalized likelihood score of DG D i, given by S i n = 1 n n log ˆp i n(x j 1,..., xj d ) #(edges) i a n, j=1 where #(edges) i is the number of edges in DG D i, and a n = 1/ log n. Denote by i 0 the index of the true DG D 0 = D i0. Then we have P ( S i0 n S i n) 0 as n i i0. The proof relies on entropy methods and is presented in the appendix. In practice the 1/ log n penalty rate might be too large. We used 23

37 hapter 2. Score-based ausal Learning in dditive Noise Models a n = 1/ n for some simulations in Section 2.4 (where the noise is Gaussian), which lead to reasonably good performance for finite sample size n = 300. Moreover, under stronger assumptions we have: Remark 1. When replacing (5) with the stronger assumption of sub-exponential tails of log p(x 1,..., X d ), we can improve the penalty rate a n in Theorem 1 from 1/ log n to cn 1/(2+d/s), for some c > 0 sufficiently large Numerical Results In this section we present simulation results to show that our method works under controlled conditions. In each case, the data generating process is an additive noise model with acyclic graph structure. We first reproduce some results from an earlier paper by Hoyer et al. (2009), where the model involves just two variables and is parametrized by two parameters, controlling linearity and Gaussianity respectively. Then, we extend this setup to a slightly more general class of models. Finally, we look at cases with more than two variables. In our implementation we use (generalized) additive model regression (GM, see Hastie and Tibshirani (1986)) or local polynomial regression (LOESS, see leveland (1979)) for step 1 and logspline density estimation (see Kooperberg and Stone (1991)) or kernel density estimation for step 2. For models with more than two variables, penalization becomes important. We used a factor of a n = 1/ n instead of the very severe 1/ log n. This can be justified since in the relevant simulations the noise is Gaussian and the log-densities can be assumed to be sub- Exponential. In this case, the faster rate can be used (see Remark 1). ll computations were carried out in the statistical computing language R (using packages mgcv and logspline) and the code is available on request from the authors Identifiability depending on Linearity and Gaussianity Hoyer et al. (2009) illustrate their method with a two-dimensional NM 24

38 2.4. Numerical Results of the form X 1 = ɛ 1 X 2 = X 1 + bx ɛ 2 with the parameter b ranging from 1 to 1, thus controlling the linearity of the model. The noise terms ɛ 1, ɛ 2 are transformed Normal random variables: ɛ k = sgn(ν k ) ν k q, ν k iid N (0, 1), where the parameter q ranges from 0.5 to 2 and thus controls Gaussianity. The true direction M 1 : X 1 X 2 cannot be identified with traditional methods (e.g. the P algorithm), since the backwards model M 2 : X 1 X 2 entails precisely the same conditional independence relations (none) and thus belongs to the same Markov equivalence class. If b = 0 and q = 1 there exists a backwards model entailing the same joint density. s soon as we move away from this point, however, the model becomes identifiable (Hoyer et al., 2009). We confirm this numerically, showing our method performs as expected in this setting. We discretize the parameter space (b, q) [ 1, 1] [0.5, 2], and for each grid point we repeat the simulation 1000 times, with n = 300 samples per trial. We then count the number of times the backwards model gets wrongly chosen by the method 10, and this false decision rate serves as our measure of quality of the method. s can be seen in Figure 2.1, the false decision rate peaks around (b, q) = (0, 1) with around 50% wrong decisions, corresponding to random guessing. way from this region it quickly drops to zero. In this setting the regressions were done using LOESS and the density estimations using logsplines Random Edge Functions We now generalise the setup of the scenario from Section in allowing a bigger function class for the edge function. Specifically, we randomly generate functions by sampling a random path from a 10 I.e. when the likelihood score of the backwards model is lower than that of the forwards model. 25

39 hapter 2. Score-based ausal Learning in dditive Noise Models b=0 q=1 1.0 b q (a) Full b q grid False Decisions q False Decisions False Decisions (b) b fixed b (c) q fixed. Figure 2.1.: False decision rates for a two-dimensional NM with two parameters b and q, controlling linearity and Gaussianity (n = 300). For b = 0 the model is linear, for q = 1 the noise is Gaussian. Wiener process and smoothing it with cubic splines11. To measure their nonlinearity we use the normalised L2 -difference between the function and its best linear approximation on the interval [ 1, 1], as described in Emancipator and Kroll (1993). number of randomly generated functions with different nonlinearity values are shown in Figure 2.2. We again choose a uniform grid of nonlinearity values (in the interval [0, 0.4]) and, for each grid point, generate 100 random functions. With each function we perform 100 simulations and average the results. The noise is standard Gaussian in this setting. In Figure 2.2 we see the results for a small sample (n = 300) and a large sample (n = 1500) case. The findings are analogous to the simple cubic model the false decision rate decreases with nonlinearity of the edge function and sample size. gain, the regressions were done using LOESS and the density estimations using logsplines. 11 Wiener path (random normal increments) is sampled on a 1000 point grid spanning [ 1, 1] and the resulting vector rescaled to an interval of length 2 and consequently smoothed using cubic splines. The resulting functions are linear outside [-1,1] and nonlinear inside. 26

40 2.4. Numerical Results False Decisions Nonlinearity n=300 n= s=0 s=0.1 s=0.2 s=0.3 s= (a) (b) Figure 2.2.: a) False decision rates with randomly sampled edge functions and Gaussian noise decreases with nonlinearity of the functions. b) Examples of randomly generated functions, where parameter s controls nonlinearity Larger Networks and Thresholding In a practical situation the reliability of any method invariably depends on whether its assumptions are met, as well as some other factors. In our case this would include the nonlinearity of the edge functions, the non-gaussianity of the noise, the sample size, and the number of nodes. It would be desirable to have some criterion indicating there is insufficient information to make a decision. While this is hard to make concrete, a good first heuristic seems to be the separation of the best-scoring model from the rest. We concretely look at the ratio of the smallest ( 1 ) and the largest ( 2 ) score difference (see Figure 2.3b). If this is smaller than some threshold t, we make no decision (no selection of a model). The effect of this can be seen in Figure 2.3a. Starting from a full DG with 3 nodes as the ground truth, we randomly generate 100 different sets of nonlinear 12 edge functions, and for each set of edge functions we generate 100 data sets with standard Gaussian noise of sample size n = 300. With each data set we run an exhaustive search over all With nonlinearity values in [0.39, 0.4]. 27

41 hapter 2. Score-based ausal Learning in dditive Noise Models No decision SHD=3 SHD=2 SHD=1 orrect DG Score 2 t=0 t= (a) (b) Figure 2.3.: a) Structural Hamming distance between the best-scoring DG and the ground truth for a 3-node simulation with (t = 0.01) and without (t = 0) thresholding. b) Illustration of thresholding for a single simulation run. Let s 1,..., s D be the (increasingly) ordered scores. Then 1 = s 1 /s 2 and 2 = s 1 /s N. candidate models and, if making a decision after thresholding, compute the structural Hamming distance (SHD) between the best-scoring DG and the ground truth. omparing the thresholds t = 0 and t = 0.01, the false decision rate falls from 3.9% to 2.4% while in 3.1% of the cases no decision is made. We also look at two simulation settings suggested in Peters et al. (2011b), where the graph consists of 4 nodes and the edge functions are nonlinear but parametrized by 4 and 5 parameters respectively. In both cases, nonlinear1 and nonlinear2, 100 sets of parameters are drawn from a uniform distribution and then data (with a sample size of n = 400) is generated. Our method identifies the correct DG in 96 / 97 out of the 100 cases for nonlinear1/2 (in the other cases, there is one additional edge). This certainly improves upon the results reported in Peters et al. (2011b) (86 correct decision in both cases). In all of these multivariate settings, we used GM for regression and logsplines for density estimation. 28

42 2.5. Real Data 2.5. Real Data To determine the performance on real-world datasets, we apply our method to so-called cause-effect pairs. These are bivariate datasets where the true causal direction is known. n example would be the altitude and the average temperature of weather stations. Mooji and Janzing (2010) describe 8 such pairs and compare several methods that were submitted as part of the ausality Pot-Luck hallenge. Our method identifies 7 out of the 8 pairs correctly 13, thus beating all other compared methods except Zhang and Hyvärinen (2010), who take into account post-nonlinear additive noise. We next consider the extended collection of cause-effect pairs, which can be found at This currently comprises 86 datasets, 81 of which are bivariate. Using our method on these 81 bivariate datasets, we identify the true model in 66% of the cases 14. In Janzing et al. (2012) a subset of these datasets were used to compare various causal inference methods. Running our method on those datasets, it compares well with the other methods (see Table 2.1), being slightly better than independence testing (N) and outperforming the Lingam method. In both of these settings we used LOESS and kernel density estimation onclusions We presented a new fully nonparametric likelihood score-based method for causal inference in nonlinear or non-gaussian NMs. We proved consistency of the penalized maximum likelihood estimator for finding the correct model. We showed via simulation studies that our method works well in practice when the ground truth is an NM with sufficiently nonlinear edge functions or non-gaussian error terms. Our method compares favourably to other causal inference procedures on both simulated and real-world data. 13 This corresponds to a p-value of under the random guessing null hypothesis. 14 This corresponds to a p-value of under the random guessing null hypothesis. 29

43 hapter 2. Score-based ausal Learning in dditive Noise Models Method SL N Lingam PNL IGI GPI ccuracy 66% 63% 58% 68% 75% 70% Table 2.1.: Success rates of different causal inference methods on cause-effect pairs at a decision rate of 100%. SL=Score-based ausal Learning (our method), N=dditive Noise with independence testing, PNL=Post- Nonlinear, IGI=Information-Geometric ausal Inference, GPI=Gaussian Process Inference. ll values except SL taken from Janzing et al. (2012). ll datasets were subsampled three times (if n > 500), and the results were averaged. s a major open challenge, the current approach of exhaustively searching through the whole model space becomes computationally infeasible for more than a handful of variables. Since our method is score-based and the scoring criterion is local (i.e., decomposable), it is straightforward to implement a greedy algorithm although there will be no guarantee for finding a global optimum. 30

44 2.. onsistency Proof ppendix 2. onsistency Proof The proof heavily relies on entropy methods and empirical process theory. For a good overview of the necessary material we refer to van de Geer (2000) or van der Vaart (1998). For an overview of Sobolov and related function spaces we refer to Triebel (1983). Throughout this section we will adopt the following notation for taking expectations of some random variable f with respect to a distribution Q (following van de Geer (2000)): Qf := f dq. In particular, this means we will write expectations and means as P f = E [f(x)] P n f = 1 n f(x j ), n j=1 where P is the true distribution with density p 0, f : R d R is some function, X is a vector of random variables (one corresponding to each node) with distribution P, {X j } j=1,...,n are independent copies of X, and P n is the empirical distribution (placing weight 1/n on each X j ). With this notation we can write the maximum likelihood estimator ˆp i n and the entropy minimizer p i in class P i (which exist by assumption (6) but need not be unique) as: ˆp i n = arg max P n log p, (2.5) p P i p i = arg max P log p. (2.6) p P i Note that the true density p 0 minimizes the information entropy over the complete density space N i=1 Pi since the Kullback-Leiber divergence P log p0 p is positive for all densities p p0. 31

45 hapter 2. Score-based ausal Learning in dditive Noise Models One of the building blocks of the proof of Theorem 1 is a uniform law of large numbers (ULLN) for the classes of log-densities: P sup (P n P ) log p 0 as n i. p P i To show this, an entropy argument is used. We first define the bracketing entropy of a function space. Let G be a set of functions from R d to R. Two functions g L, g U : R d R (not necessarily in G) form an ɛ-bracket for some g G, if g L g g U and g L g U 1,µ < ɛ, where 1,µ is the weighted L 1 -norm, i.e. f 1,µ = f(x)µ(x) dx. Suppose {gi L, gu i } i=1,...,n [] is the smallest set s.t. g G i s.t. gi L, gu i form an ɛ-bracket for g, where N [] denotes the number of such pairs. Then H [] (ɛ, G, 1,µ ) := log N [] is called the bracketing entropy of G. The following result connects bracketing entropy H [] (ɛ, G, 1,p 0) with respect to the L 1 -norm weighted with the true density p 0 and the uniform convergence of the empirical process (P n P )g. Note that here and throughout this section we use the notation "a(ɛ) b(ɛ)" as shorthand for "a(ɛ) cb(ɛ) ɛ > 0 for some constant c not depending on ɛ". Lemma 1. Suppose that: (i) 0 α < 1 s.t. H [] (ɛ, G, 1,p 0) ɛ α (ii) K s.t. var (g(x 1,..., X d )) < K g G ɛ > 0 and Then G satisfies the ULLN: ( ) P sup (P n P )g > δ n 0 as n, g G where δ n = c/ log n for some c > 0. Proof. We first show that it suffices to look at the supremum over the bracketing functions. Let g G and gi L, gu i be its δ n -brackets. We then have (P n P ) g < (P n P ) g U i + δ n and > (P n P ) g L i δ n. 32

46 2.. onsistency Proof So we have and hence (P n P ) g < ( max (Pn P ) gi L, (Pn P ) g U ) i + δ n i=1,...,n [] sup (P n P ) g < max (P n P ) g + δ n. g G g {gi L,gU i }i Now ( ) P sup (P n P )g > 2δ n P g G ( max g {g L i,gu i }i (P n P )g > δ n 2N [] (δ n ) max P ( (P n P ) g > δ n ) g {gi L,gU i }i ) exp(δn α ) K2 nδn 2 (2.7) where the last line follows from hebyshev s inequality. Substituting for δ n gives P(...) log 2 n exp(c α log α n log n) 0 as n. Note that if we replace condition (ii) with the assumption that g(x 1,..., X d ) are sub-exponential (as in Remark 1), we apply the sub-exponential tail bound (see ühlmann and van de Geer (2011, Lemma 14.9) for example) instead of hebyshev s inequality and obtain exp(δn α nδ2 n const. ) instead of (2.7), which converges to zero for δ n = cn 1/(2+α), for c > 0 sufficiently large. Lemma 1 shows that a sufficient condition for the ULLN is finite bracketing entropy. To this end, we make use of the following result: Lemma 2 (Nickl and Pötscher (2007, Theorem 1)). Suppose G is a (non-empty) bounded subset of the weighted Sobolev space W s p (R d, x β ) for some β < 0. Suppose γ > s d/p > 0 s.t. the moment condition γ β 1,µ = µ(x) x γ β 1 < 33

47 hapter 2. Score-based ausal Learning in dditive Noise Models holds for some orel measure µ on R d. Then: H [] (ɛ, G, 1,µ ) ɛ d/s. The relevant sets of functions G in this context are the log-densities of each class, i.e. {1{p > 0} log p p P i }, with the relevant orel measure µ being the true density p 0. Essentially the idea of the proof of Theorem 1 is to show that the maximum log-likelihood in each induced density class converges to the minimal entropy. For non-overlapping models (e.g. X 1 X 2 and X 1 X 2 ), the minimal entropy will be different in each class (with the minimum occuring in the true model class), and the likelihood will eventually pick up on this difference. Since the penalty term vanishes asymptotically, an ever so small difference in entropy will differentiate the true model class from the others. For overlapping (e.g. hierarchical) models, the minimal entropy can occur in more than one class. In this case the penalty term picks out the most parsimonious model (which is the true model according to the ausal Minimality assumption). Note that the penalty 1/ log n is quite large compared with e.g. the I penalty (log n/n). This is due to the slow convergence of maximum likelihood to minimal entropy (Lemmas 3 and 1). If the penalty vanishes too quickly, it will be drowned out by the noise in the likelihood and have no effect. The convergence can be improved (and thus the penalty relaxed) when making stronger assumptions on the distributions, e.g. sub-gaussian tails. The following lemma shows convergence of maximum log-likelihood to minimal entropy in each class, given that a ULLN holds. Lemma 3. Suppose that a ULLN for the classes log P i holds with convergence rate δ n, i.e. ( ) Then P sup p P i (P n P ) (1{p > 0} log p) > δ n 0 as n. P ( Pn log ˆp i n P log p i > δn ) 0 as n. 34

48 2.. onsistency Proof Proof. y the definition of the MLE (2.5) we have: i.e. P n log ˆp i n P n log p i = P log p i + (P n P ) log p i, P n log ˆp i n P log p i (P n P ) log p i. (2.8) Let P i n be the restriction of P i to densities whose support contains the data, i.e. Pi n = {p P i supp(p) {X 1,..., X n }}. Note that the maximum log-likelihood as well as minimum entropy are the same over P i and P i n, since densities with support not including the data will yield values of. So we also have: i.e. P n log ˆp i n = max P n log p = max P n log p p P i p P i n = max (P log p + (P n P ) log p) p P n i P log p i + sup (P n P ) log p, p P n i P n log ˆp i n P log p i sup (P n P ) log p. p P n i This together with (2.8) yields: ( ) Pn log ˆp i n P log p i (Pn max P ) log p i, sup (P n P ) log p p P n i ) ( (Pn max P ) log p i, sup (P n P ) log p p P n i sup p P i (P n P ) log p We thus have: P ( Pn log ˆp i n P log p i ) > δn ( sup p P i (P n P ) (1{p > 0} log p). P sup p P i (P n P ) (1{p > 0} log p) > δ n ), 35

49 hapter 2. Score-based ausal Learning in dditive Noise Models which converges to zero as n by assumption. Finally, before proving Theorem 1, we show the following useful lemma. Lemma 4. Let a, b, a, b R and ɛ > 0. If one of the following holds: 1. a b > ɛ and a b 0 2. a b < ɛ and a b 2ɛ we have a a > ɛ 2 or b b > ɛ 2. Proof. ssume (i). Then we have ɛ = ɛ 0 a b + b a = a a (b b ) a a + b b, and the result follows. Similarly for (ii): ɛ = 2ɛ ɛ a b + b a = a a (b b) a a + b b. We can now prove the main theorem. Proof of Theorem 1. We will make repeated use of Lemma 3. For that matter, note that assumptions (3), (4), and (5), together with Lemmas 1 and 2 (taking µ = p 0 ) satisfy the sufficient conditions. (6) ensures the existence of ˆp i n, p i as defined in (2.5) and (2.6). Let i i 0. We differentiate two cases: i) where P i includes the true density p 0 and ii) where it does not. Let 1 δ n = (#(edges) i #(edges) i0 ) log n denote the difference of the penalties in the two scores. ase i). p 0 P i, which implies p i = p 0. ssumptions (1) and (2) together with Theorem 2 in Peters et al. (2011b) guarantee identifiability of the true graph. In particular this means that in this case P i must correspond to a graph containing the true graph. Hence 36

50 2.. onsistency Proof #(edges) i > #(edges) i0, i.e. δ n > 0. We then have: P ( ( Sn i0 Sn i ) P P n log ˆp i n P n log ˆp i0 n > δ ) n 2 ( Pn P log ˆp i0 n P log p 0 δ n > 4 P n log ˆp i n P log p i > δ ) n 4 ( Pn P log ˆp i0 n P log p 0 ) δ n > + 4 ( Pn P log ˆp i n P log p i ) δ n > 0 4 as n, where the second line follows from p i = p 0 and Lemma 4 (first case), and the convergence in the last line follows from Lemma 3. ase ii). p 0 / P i, which implies P log p 0 > P log p i. Hence δ > 0 s.t. P log p 0 > P log p i 1 + 4δ. Let N > 0 s.t. #(edges) i0 log n < δ n N. Then we have P ( Sn i0 Sn i ) ( = P Pn log ˆp i0 n P n log ˆp i ) n δ n P ( P n log ˆp i0 n P n log ˆp i n < δ ) P ( Pn log ˆp i0 n P log p 0 > δ Pn log ˆp i n P log p i ) > δ P ( P n log ˆp i0 n P log p 0 > δ ) + P ( Pn log ˆp i n P log p i > δ ) 0 as n, where the third line follows from Lemma 4 (second case), and the convergence in the last line follows again from Lemma 3. 37

51

52 hapter 3. Structure Learning with owfree cyclic Path Diagrams 1 We consider the problem of structure learning for bow-free acyclic path diagrams (Ps). Ps can be viewed as a generalization of linear Gaussian DG models that allow for certain hidden variables. We present a first method for this problem using a greedy score-based search algorithm. We also investigate some distributional equivalence properties of Ps which are used in an algorithmic approach to compute (nearly) equivalent model structures, allowing to infer lower bounds of causal effects. The application of our method to some datasets reveals that P models can represent the data much better than DG models in these cases Introduction We consider learning the causal structure among a set of variables from observational data. In general, the data can be modelled with a structural equation model (SEM) over the observed and unobserved variables, which expresses each variable as a function of its direct causes and a noise term, where the noise terms are assumed to be mutually independent. The structure of the SEM is visualized as a directed graph, with vertices representing variables and edges representing direct causal relationships. We assume the structure to be recursive 1 This chapter is heavily based on the preprint Nowzohour et al. (2015) 39

53 hapter 3. Structure Learning with ow-free cyclic Path Diagrams (acyclic), which results in a directed acyclic graph (DG). DGs can be understood as models of conditional independence, and many structure learning algorithms use this to find all DGs which are compatible with the observed conditional independencies (Spirtes et al., 1993). Often, however, not all relevant variables are observed. The resulting marginal distribution over the observed variables might still satisfy some conditional independencies, but in general these will not have a DG representation (Richardson and Spirtes, 2002). lso, there generally are additional constraints resulting from the marginalization of some of the variables (Evans, 2014; Shpitser et al., 2014). In this paper we consider a model class which can accommodate certain hidden variables. Specifically, we assume that the graph over the observed variables is a bow-free acyclic path diagram (P). This means it can have directed as well as bidirected edges (with the directed part being acyclic), where the directed edges represent direct causal effects, and the bidirected edges represent hidden confounders. The bow-freeness condition means there cannot be both a directed and a bidirected edge between the same pair of variables. The P can be obtained from the underlying DG over all (hidden and observed) variables via a latent projection operation (Pearl, 2009) (if the bow-freeness condition admits this). We furthermore assume a parametrization with linear structural equations and Gaussian noise, where two noise terms are correlated only if there is a bidirected edge between the two respective nodes. For many practical purposes, it is beneficial to consider this restricted class of hidden variable models. Such a restricted model class, if not heavily misspecified, results in a smaller distributional equivalence class, and estimation is expected to be more accurate than for more general hidden variable methods like FI (Spirtes et al., 1993), RFI (olombo et al., 2012), or FI+ (laassen et al., 2013). The goal of this paper is structure learning with Ps, that is, finding the best set of Ps given some observational data. Just like in other models, there is typically an equivalence class of Ps that are statistically indistinguishable, so a meaningful structure search result should represent this equivalence class. We propose a penalized likelihood score that is greedily optimized and a heuristic algorithm (supported by some theoretical results) for finding equivalent models once an optimum is found. This method is the first of its kind for P 40

54 3.1. Introduction models. Example of a P onsider the situation shown in Figure 3.1a, where we observe variables X 1,..., X 4, but do not observe H 1. The only (conditional) independency over the observed variables is X 1 X 3 X 2, which is also represented in the corresponding P in Figure 3.1b. The parametrization of this P would be X 1 = ɛ 1 X 2 = 21 X 1 + ɛ 2 X 3 = 32 X 2 + ɛ 3 X 4 = 43 X 3 + ɛ 4 with (ɛ 1, ɛ 2, ɛ 3, ɛ 4 ) T N (0, Ω) and Ω Ω = 0 Ω 22 0 Ω Ω Ω 24 0 Ω 44 Hence the model parameters in this case are 21, 32, 43, Ω 11, Ω 22, Ω 33, Ω 44, and Ω 24. n example of a graph that is not a P is shown in Figure 3.1c. hallenges The main challenge, like with all structure search problems in graphical modelling, is the vastness of the model space. The number of Ps grows super-exponentially. Exhaustively scoring all Ps and finding the global score optimum is typically computationally infeasible. nother major challenge, specifically for our setting, is the fact that a graphical characterization of the (distributional) equivalence classes for P models is not yet known. In the (unconstrained) DG case, for example, it is known that models are equivalent if and only if they 41

55 hapter 3. Structure Learning with ow-free cyclic Path Diagrams X 1 X 2 X 1 X 2 21 X 1 X 2 H 1 32 Ω 24 X 3 X 4 43 X 3 X 4 X 4 (a) (b) (c) Figure 3.1.: (a) DG with hidden variable H 1, (b) resulting P over the observed variables X 1,..., X 4 with annotated edge weights, and (c) resulting graph if X 3 is also not observed, which is not a P. share the same skeleton and v-structures (Verma and Pearl, 1991). similar result is not known for Ps (or the more general acyclic directed mixed graphs). This makes it hard to traverse the search space efficiently, since one cannot search over the equivalence classes (like the greedy equivalence search for DGs (hickering, 2002)). It also makes it difficult to evaluate simulation results, since the ground truth P and the found solution might not coincide yet be equivalent. ontributions We provide the first structure learning algorithm for Ps. It is a score-based algorithm and uses greedy hill climbing to optimize a penalized likelihood score. y decomposing the score over the bidirected connected components of the graph and caching the score of each component we are able to achieve a significant computational speedup. To mitigate the problem of local optima, we perform many random restarts of the greedy search. We propose to approximate the distributional equivalence class of a P by using a greedy strategy for likelihood scoring. If two Ps are similar with respect to their likelihoods within a tolerance, they should be treated as statistically indistinguishable and hence as belonging to the same class of (nearly) equivalent Ps. ased on such greedily 42

56 3.1. Introduction computed (near) equivalence classes, we can then infer bounds of total causal effects, in the spirit of Maathuis et al. (2009, 2010). We present some theoretical results towards equivalence properties in P models, some of which generalize to acyclic path diagrams. We also provide a proof of Wright s path tracing formula that does not assume a proper model parametrization. Furthermore, we developed a Markov hain Monte arlo method for uniformly sampling Ps based on ideas from Kuipers and Moffa (2015). We obtain promising results on simulated data sets despite the challenges listed above. omparing the maximal penalized likelihood scores between Ps and DGs on real datasets shows a clear advantage of P models. Related Work There are two main research communities that intersect at this topic. On the one side there are the path diagram models, going back to Wright (1934) and then being mainly developed in the behavioral sciences (Jöreskog, 1970; Duncan, 1975; Glymour and Scheines, 1986; Jöreskog, 2001). In this setting there is always a parametric model, usually assuming linear edge functions and Gaussian noise. In the most general formulation, the graph over the observed variables is assumed to be an acyclic directed mixed graph (DMG), which can have bows. While in general the parameters for these models are not identified, Drton et al. (2011) give necessary and sufficient conditions for global identifiability. omplete necessary and sufficient conditions for the more useful almost everwhere identifiability remain unknown. P models are a useful subclass, since they are almost everywhere identified (rito and Pearl, 2002). Drton et al. (2009) provided an algorithm, called residual iterative conditional fitting (RIF), for maximum likelihood estimation of the parameters for a given P. On the other side there are the non-parametric hidden variable models, which are defined as marginalized DG models (Pearl, 2009). The marginalized distributions are constrained by conditional independencies, as well as additional equality and inequality constraints (Evans, 43

57 hapter 3. Structure Learning with ow-free cyclic Path Diagrams 2014). When just modelling the conditional independence constraints, the class of maximal ancestral graphs (MGs) is sufficient (Richardson and Spirtes, 2002). Shpitser et al. (2014) have proposed the nested Markov model using DMGs to also include the additional equality constraints. Finally, mdgs can be used to model all resulting constraints (Evans, 2014). With each additional layer of constraints, learning the structure from data becomes more complicated, but at the same time more available information is utilized and a possibly more detailed structure can be learned. ncestral P models coincide with the Gaussian parametrization of MGs (Richardson and Spirtes, 2002, Section 8). However, in general Ps need to be neither maximal nor ancestral and can model additional constraints. They are easier to interpret as hidden variable models. This can be seen when comparing the P in Figure 3.1b with the corresponding MG. The latter would have an additional edge between X 1 and X 4 since there is no (conditional) independency of these two variables. s can be verified, the P and the MG in this example are not distributionally equivalent, since the former encodes additional non-independence constraints. Structure search for MGs can be done with the FI (Spirtes et al., 1993), RFI (Maathuis et al., 2009), or FI+ (laassen et al., 2013) algorithms. Silva and Ghahramani (2006) propose a fully ayesian method for structure search in linear Gaussian DMGs, sampling from the posterior distribution using an MM approach. Shpitser et al. (2012) employ a greedy approach to optimize a penalized likelihood over DMGs for discrete parametrizations. Outline of this Paper This paper is structured as follows. In Section 3.2 we give an in-depth overview of the model and its estimation from data, as well as some distributional equivalence properties. In Section 3.3 we present the details of our greedy algorithm with various computational speedups. In Section 3.4 we present empirical results on simulated and real datasets. ll proofs as well as further theoretical results and justifications can be found in the ppendix. 44

58 3.2. Model and Estimation 3.2. Model and Estimation Graph Terminology Let X 1,..., X d be a set of random variables and V = {1,..., d} be their index set. The elements of V are also called nodes or vertices. mixed graph or path diagram G on V is an ordered tuple G = (V, E D, E ) for some E D, E V V. If (i, j) E D, we say there is a directed edge from i to j and write i j G. If (i, j) E, we must also have (j, i) E, and we say there is a bidirected edge between i and j and write i j G. The set pa G (i) := {j j i G} is called the parents of i. This definition extends to sets of nodes S in the obvious way: pa G (S) := i S pa G(i). The in-degree of i is the number of arrowheads at i. If V V, E D E D V V, and E E V V, then G = (V, E D, E ) is called a subgraph of G, and we write G G. If any of the subset relations are strict, we call G a strict subgraph of G and write G G. The skeleton of G is the undirected graph over the same node set V and with edges i j if and only if i j G or i j G (or both). path π between i and j is an ordered tuple of (not necessarily distinct) nodes π = (v 0 = i,..., v l = j) such that there is an edge between v k and v k+1 for all k = 0,..., l 1. If the nodes are distinct, the path is called non-overlapping. The length of π is the number of edges λ(π) = l. If π consists only of directed edges pointing in the direction of j, it is called a directed path from i to j. The tuple (i, j, k) is called a collider on π if i, j and j, k are each adjacent on π with arrowheads pointing from i to j and from k to j 2. If additionally there is no edge between i and k (and i k), the collider is called v-structure. path without colliders is called a trek. Let,, V be three disjoint sets of nodes. The set an() := {i V there exists a directed non-overlapping path from i to c for some c } is called the ancestors of. non-overlapping path π from a to b is said to be m-connecting given if every non-collider on π is not in and every collider on π is in an(). If there are no such 2 That is, one of the following structures:,,,. 45

59 hapter 3. Structure Learning with ow-free cyclic Path Diagrams paths, and are m-separated given, and we write m. We use a similar notation for denoting conditional independence of the corresponding set of variables X and X given X : X X X. graph G is called cyclic if there are at least two nodes i and j such that there are directed paths both from i to j and from j to i. Otherwise G is called acyclic or recursive. n acyclic path diagram is also called an acyclic directed mixed graph (DMG). n acyclic path diagram having at most one edge between each pair of nodes is called a bowfree 3 acyclic path diagram (P). n DMG without any bidirected edges is called a directed acyclic graph (DG) The Model linear structural equation model (SEM) M is a set of linear equations involving the variables X = (X 1,..., X d ) T and some error terms ɛ = (ɛ 1,..., ɛ d ) T : X = X + ɛ, (3.1) where is a real matrix, and cov(ɛ) = Ω is a positive semi-definite matrix. M has an associated graph G that reflects the structure of and Ω. For every non-zero entry ij there is a directed edge from j to i, and for every non-zero entry Ω ij there is a bidirected edge between i and j. Thus we can also write (3.1) as: X i = j pa G (i) with cov(ɛ i, ɛ j ) = Ω ij for all i, j V. ij X j + ɛ i, for all i V, (3.2) Our model is a special type of SEM; in particular, we make the following assumptions: (1) The errors ɛ follow a multivariate Normal distribution N (0, Ω). (2) The associated graph G is a P. 3 The structure i j together with i j is also know as bow. 46

60 3.2. Model and Estimation It then follows from (1) that M is completely parametrized by θ = (, Ω). Often M is specified via its graph G, and we are interested to find a parametrization θ G compatible with G. We thus define the parameter spaces for the edge weight matrices (directed edges) and Ω (bidirected edges) for a given P G as G = { R d d ij = 0 if j i is not an edge in G} O G = {Ω R d d Ω ij = 0 if i j and i j is not an edge in G; Ω is positive semi-definite} and the combined parameter space as Θ G = G O G. The covariance matrix for X is given by: φ(θ) = (I ) 1 Ω(I ) T, (3.3) where φ : Θ G S G maps parameters to covariance matrices, and S G := φ(θ G ) is the set of covariance matrices compatible with G. Note that φ(θ) exists since G is acyclic by (2) and therefore I is invertible. We assume that the variables are normalized to have variance 1, that is, we are interested in the subset S G S G, where S G = {Σ S G Σ ii = 1 for all i = 1,..., d}, and its preimage under φ, ΘG := φ 1 ( SG ) Θ G. One of the main motivations of working with P models is parameter identifiability. This is defined below: Definition 1. normalized parametrization θ G Θ G is identifiable if there is no θ G Θ G such that θ G θ G and φ(θ G) = φ(θ G ). rito and Pearl (2002) show that for any P G, the set of normalized non-identifiable parametrizations has measure zero. The causal interpretation of Ps is the following. directed edge from X to Y represents a direct causal 4 effect of X on Y. bidirected 4 We adopt the interventional definition of causality, i.e. there is a direct causal effect of X on Y if intervening at X changes Y. 47

61 hapter 3. Structure Learning with ow-free cyclic Path Diagrams edge between X and Y represents a hidden variable which is a direct cause of both X and Y. In practice, one is often interested in predicting the effect of an intervention at X i on another variable X j. This is called the total causal effect of X i on X j and can be defined as E ij = x E[X j do(x i = x)], where the do(x i = x) means replacing the respective equation in (3.2) with X i = x (Pearl, 2009). For linear Gaussian path diagrams this is a constant quantity and given by E ij = ( (I ) 1) ij. (3.4) Penalized Maximum Likelihood onsider a P G. first objective is to estimate the parameters θ G from i.i.d. samples {x (s) i } (i = 1,..., d and s = 1,..., n). This can be done by maximum likelihood estimation using the RIF method of Drton et al. (2009). Given the Gaussianity assumption (1) and the covariance formula (3.3), one can write down the log-likelihood for a given parameter set θ G and a sample covariance matrix S: l(θ G ; S) = n ( log 2πΣ G + n 1 ) 2 n tr(σ 1 G S), (3.5) where Σ G = φ(θ G ) is the covariance matrix implied by parameters θ G, see for example Mardia et al. (1979, (4.1.9)). However, due to the structural constraints on and Ω it is not straightforward to maximize this for θ G. RIF is an iterative method to do so, yielding the maximum likelihood estimate: ˆθ G = arg max l(s; θ G ). (3.6) θ G Θ G We now extend this to the scenario where the graph G is also unknown, using a regularized likelihood score with a I-like penalty term that increases with the number of edges. oncretely, we use the following score for a given P G: s(g) := 1 ( ) max l(s; θ G ) (#{nodes} + #{edges}) log n. (3.7) n θ G Θ G 48

62 3.2. Model and Estimation We have scaled the log-likelihood and penalty with 1/n so that the score converges to a limit 5 for increasing n. ompared with the usual I penalty, we chose our penalty to be twice as large, since this led to better performance in simulations studies Equivalence Properties There is an important issue when doing structure learning with graphical models: typically the maximizers of (3.7) will not be unique. This is a fundamental problem for most model classes and a consequence of the model being underdetermined. In general, there are sets of graphs that are statistically indistinguishable (in the sense that they can all parametrize the same joint distributions over the variables). These graphs are called distributionally equivalent. For nonparametric DG models (without non-linearity or non-gaussianity constraints), for example, the distributional equivalence classes are characterized by conditional independencies and are called Markov equivalence classes. For Ps, distributional equivalence is not completely characterized yet, but we can present some useful results (see Spirtes et al. (1998) or Williams (2012) for a discussion of the linear Gaussian DMG case). Let us first make precise the different notions of model equivalence. Definition 2. Two Ps G 1, G 2 over a set of nodes V are Markov equivalent if they imply the same m-separation relationships. This essentially means they imply the same conditional independencies, and the definition coincides with the classical notion of Markov equivalence for DGs. The following notion of distributional equivalence is stronger. Definition 3. Two Ps G 1, G 2 are distributionally equivalent if for all θ G1 Θ G1 there exists θ G2 Θ G2 (and vice versa) such that φ(θ G1 ) = φ(θ G2 ). Spirtes et al. (1998) showed the following global Markov property for general linear path diagrams: if there are nodes a, b V and a possibly empty set V such that a m b, then the partial correlation 5 For the true graph G, this limit is the negative entropy of the joint distribution. 49

63 hapter 3. Structure Learning with ow-free cyclic Path Diagrams of X a and X b given X is zero. s a direct consequence, we get the following first result: Theorem 2. If two Ps G 1, G 2 do not share the same m-separations, they are not distributionally equivalent. Unlike for DGs, the converse is not true, as the counterexample in Figure 3.2 shows. n important tool in this context is Wright s path tracing formula (Wright, 1960), which expresses the covariance between any two variables in a path diagram as the sum-product over the edge labels of the treks (collider-free paths) between those variables, as long as all variables are normalized to variance 1. precise statement as well as a proof of a more general version of Wright s formula can be found in the ppendix (Theorem 6). s an example, consider the P in Figure 3.1b. There are two treks between X 2 and X 4 : X 2 X 3 X 4 and X 2 X 4. Hence cov(x 2, X 4 ) = Ω 24, assuming normalized parameters. Similarly we have cov(x 1, X 4 ) = s a consequence of Wright s formula, we can show that having the same skeleton and collider structure is sufficient for two acyclic path diagrams to be distributionally equivalent (Theorem 3 below). For DGs, it is known that the weaker condition of having the same skeleton and v-structures is sufficient for being Markov equivalent. However, for Ps this is not true, as the counterexample in Figure 3.2 shows. Theorem 3. Let G 1, G 2 be two acyclic path diagrams that have the same skeleton and collider structure. Then G 1 and G 2 are distributionally equivalent. It would be desirable to also have a necessary condition for distributional equivalence. We conjecture that having the same skeleton is a necessary condition. For DGs this is trivial, since a missing edge between two nodes means they can be d-separated, and thus a conditional independency would have to be present in the corresponding distribution. However, for Ps a missing edge does not necessarily result in an m-separation, as the counterexample in Figure 3.2 shows. The following theorem at least shows that strictly nested models cannot be distributionally equivalent. Theorem 4. Let G 1, G 2 be two Ps over the same set of nodes, 50

64 3.3. Greedy Search X 1 X 2 X 1 X 2 X 1 X 2 X 3 X 4 (a) X 3 X 4 (b) X 3 X 4 (c) Figure 3.2.: The two Ps in (a) and (b) share the same skeleton and v-structures, but in (a) there are no m-separations, whereas in (b) we have X 2 m X 3 {X 1, X 4 }. Ps (a) and (c) share the same m-separations (none) but are not distributionally equivalent since (a) is a strict subgraph of (b) (using Theorem 4). such that G 1 is a strict subgraph of G 2. Then G 1 and G 2 are not distributionally equivalent Greedy Search We aim to find the maximizer of (3.7) over all graphs over the node set V = {1,..., d}. Since exhaustive search is infeasible, we use greedy hill-climbing. Starting from some graph G 0, this method obtains increasingly better estimates by exploring the local neighborhood of the current graph. t the end of each exploration, the highestscoring graph is selected as the next estimate. This approach is also called greedy search and is often used for combinatorial optimization problems. Greedy search converges to a local optimum, although typically not the global one. To alleviate this we repeat it multiple times with different (random) starting points. We use the following neighborhood relation. P G is in the local neighborhood of G if it differs by exactly one edge, that is, one of the following holds: 1. G G (edge addition), 51

65 hapter 3. Structure Learning with ow-free cyclic Path Diagrams 2. G G (edge deletion), or 3. G and G have the same skeleton (edge change). If we only admit the first condition, the procedure is called forward search, and it is usually started with the empty graph. Instead of at every step searching through the complete local neighborhood (which can become prohibitive for large graphs), we can also select a random subset of neighbors and only consider those. In Sections and we describe some adaptations of this general scheme, that are specific to the problem of P learning. In Section we describe our greedy equivalence class algorithm Score Decomposition Greedy search becomes much more efficient when the score separates over the nodes or parts of the nodes. For DGs, for example, the log-likelihood can be written as a sum of components, each of which only depends on one node and its parents. Hence, when considering a neighbor of some given DG, one only needs to update the components affected by the respective edge change. similar property holds for Ps. Here, however, the components are not the nodes themselves, but rather the connected components of the bidirected part of the graph (that is, the partition of V into sets of vertices that are reachable from each other by only traversing the bidirected edges). For example, in Figure 3.1b the bidirected connected components (sometimes also called districts) are {X 1 }, {X 2, X 4 }, {X 3 }. This decomposition property is known (?), but for completeness we give a derivation in the ppendix. We write out the special case of the Gaussian parametrization below. Let us write p X G for the joint density of X under the model (3.2), and p ɛ G for the corresponding joint density of ɛ. Let 1,..., K be the connected components of the bidirected part of G. We separate the model G into submodels G 1,..., G K of the full SEM (3.2), where each G k consists only of nodes in V k = k pa( k ) and without any edges between nodes in pa( k ). Then, as we show in the ppendix, the loglikelihood of the model with joint density p X G given data D = {x(s) i } 52

66 3.3. Greedy Search (with 1 i d and 1 s n) can be written as: l(p X G; D) = n s=1 = k log p X G(x (s) 1,..., x(s) p ) l(p X G k ; {x (s) i } s=1,...,n i V k ) l(p X G k ; {x (s) j } s=1,...,n ), j pa( k ) where l(p X G k ; {x (s) j } s=1,...,n ) refers to the likelihood of the X j -marginal of p X G k. For our Gaussian parametrization, using (3.5), this becomes l(σ G1,..., Σ GK ; S) = ( n k log 2π + log 2 k Σ Gk j pa( k ) σ2 kj + n 1 n tr( Σ 1 G k S Gk pa( k ) ) ), where S Gk is the restriction of S to the rows and columns corresponding to k, and σkj 2 is the diagonal entry of Σ G k corresponding to parent node j. Note that now the log-likelihood depends on {x (s) i } and p X G only via S and Σ G1,..., Σ GK. Furthermore, the log-likelihood is now a sum of contributions from the submodels G k. This means we only need to re-compute the likelihood of the submodels that are affected by an edge change when scoring the local neighborhood. In practice, we also cache the submodels scores, that is, we assign each encountered submodel a unique hash and store the respective scores, so they can be re-used Uniformly Random Restarts To restart the greedy search we need random starting points (Ps), and it seems desirable to sample them uniformly at random 6. Just like for DGs, it is not straightforward to achieve this. What is often done in practice (and implemented in the pcalg package (Kalisch et al., 2012) as randomdg()) is uniform sampling of triangular (adjacency) 6 nother motivation for uniform P generation is generating ground truths for simulations. 53

67 hapter 3. Structure Learning with ow-free cyclic Path Diagrams n=30000; h= Rel. Frequency Naive (LL/n= 4.393) MM (LL/n= 4.128) P ID Figure 3.3.: Relative frequencies of the 62 3-node Ps when sampled times with the naive (triangular matrix sampling) and the MM method. matrices and subsequent uniform permutation of the nodes. However, this does not result in uniformly distributed graphs, since for some triangular matrices many permutations yield the same graph (the empty graph is an extreme example). The consequence is a shift of weight to more symmetric graphs, that are invariant under some permutations of their adjacency matrices. simple example with Ps for d = 3 is shown in Figure 3.3. One way around this is to design a random process with graphs as states and a uniform limiting distribution. corresponding Markov hain Monte arlo (MM) approach is described for example in Melançon et al. (2001) for the case of DGs. See also Kuipers and Moffa (2015) for an overview of different sampling schemes. We adapted this method for Ps. Specifically, we used the following transition rules: in each step a position (i, j) gets sampled uniformly at random; if there is an edge at (i, j), remove this with probability 0.5; if there is no edge at (i, j): 54

68 3.3. Greedy Search add i j with probability 0.5, as long as this does not create a directed cycle; add i j with probability 0.5. It is easy to check that the resulting transition matrix is irreducible and symmetric and hence the Markov chain has a (unique) uniform stationary distribution. Thus, starting from any graph, after an initial burn-in period, the distribution of the visited states will be approximately uniform over the set of all Ps. In practice, we start the process from the empty graph and sample after taking O(d 4 ) steps (c.f. Kuipers and Moffa (2015)). It is straightforward to adapt this sampling scheme to a number of constraints, for example uniform sampling over all Ps with a given maximal in-degree Greedy Equivalence lass onstruction We propose the following recursive algorithm to greedily estimate the distributional equivalence class E(G) of a given P G with score ζ. We start by populating the empirical equivalence class Ê(G) with graphs that have the same skeleton and colliders as G, since these are guaranteed to be equivalent by Theorem 3. This is a significant computational shortcut, since these graphs do not have to be found greedily anymore. Then, starting from G, at each recursion level we search all edge-change neighbors of the current P for Ps that have a score within ɛ of ζ (edge additions or deletions would result in non-equivalent graphs by Theorem 4). For each such P, we spark a new recursive search until a maximum depth of d(d 1)/2 (corresponding to the maximum number of possible edges) is reached, and always comparing against the original score ζ. lready visited states are stored and ignored. Finally, all found graphs are added to Ê(G). The main tuning parameter here is ɛ, essentially specifying the threshold for statistically indistinguishable graphs. This approach of approximating the equivalence class has the advantage of also including neighbouring equivalence classes, that are statistically indistiguishable from the given data, thus being on the conservative side. 55

69 hapter 3. Structure Learning with ow-free cyclic Path Diagrams Implementation Our implementation is done in the statistical computing language R (R ore Team, 2015). The code is available as supplemental material to this paper, and it will also be published in a future version of the R package pcalg (Kalisch et al., 2012). We make heavy use of the RIF implementation fitncestralgraph() 7 in the ggm package (Marchetti et al., 2015). We noted that there are sometimes convergence issues, so we adapted the implementation of RIF to include a maximal iteration limit (which we set to 10 by default) Empirical Results In this section we present some empirical results to show the effectiveness of our method. First, we consider a simulation setting where we can compare against the ground truth. Then we turn to a well known genomic data set, where the ground truth is unknown, but the likelihood of the fitted models can be compared against other methods ausal Effects Discovery on Simulated Data To validate our method, we randomly generate ground truths, simulate data from them, and try to recover them from the generated datasets. This procedure is repeated N = 100 times and the results are averaged. We now discuss each step in more detail. Randomly generate a P G. We do this uniformly at random (for a fixed model size d = 10 and maximal in-degree α = 2). The sampling procedure is described in Section Randomly generate a parametrization θ G. We sample the directed edge labels in independently from a standard 7 Despite the function name the implementation is not restricted to ancestral graphs. 56

70 3.4. Empirical Results Normal distribution. We do the same for the bidirected edge labels in Ω, and set the error variances (diagonal entries of Ω) to the respective row-sums of absolute values plus an independently sampled χ 2 (1) value 8. Simulate data {x (s) i } from θ G. This is straightforward, since we just need to sample from a multivariate Normal distribution with mean 0 and covariance φ(θ G ). We use the function rmvnorm() from the package mvtnorm (Genz et al., 2014). Find an estimate Ĝ from {x(s) i }. We use greedy search with R = 100 uniformly random restarts (as outlined in Section 3.3) for this, as well as one greedy forward search starting from the empty model. ompare G and Ĝ. This is not so straightforward, since the structure of equivalence classes for Ps is not known. We therefore make use of the greedy approach described in Section with ɛ = to get Ê(G) and Ê(Ĝ). We proceed by estimating the minimal absolute causal effects EG min between all nodes over the empirical equivalence class, analogous to the ID method (Maathuis et al., 2009). Thus, for each graph G Ê(G) in the empirical equivalence class with estimated parameters ˆθ G = ( ˆ, ˆΩ ), we compute the estimated causal effects matrix Ê according to (3.4). We then take absolute values and take the entry-wise minima over all Ê to obtain Êmin. We do the same for Ĝ to get Êmin G Ĝ. To compare the estimated causal effects of the ground truth G and those of the greedy search result Ĝ, we treat this as a classification problem, where (Êmin) Ĝ ij is giving a confidence score for the event (Êmin G ) ij > 0, and we report the area under the RO curve (U, see?). The U ranges from 0 to 1, with 1 meaning perfect classification and 0.5 being equivalent to random guessing 9. 8 y Gershgorin s circle theorem, this is guaranteed to result in a positive definite matrix. To increase stability, we also repeat the sampling of Ω if its minimal eigenvalue is less then bit of care has to be taken because of the fact that the cases (Êmin G ) ij > 0 and 57

71 hapter 3. Structure Learning with ow-free cyclic Path Diagrams RO urves Mean TPR Mean FPR Figure 3.4.: RO curves for causal effect discovery for N = 100 simulation runs of Ps with d = 10 nodes and a maximal in-degree of α = 2. Sample size was n = 1000, greedy search was repeated R = 100 times at uniformly random starting points. The average area under the RO curves (U) is The thick curve is the point-wise average of the individual RO curves. The results can be seen in Figure 3.4 with an U of While this suggests that perfect graph discovery is usually not achieved, causal effects (which are often more relevant in practice) can be identified to some extent. The computations took 2.5 hours on an MD Opteron 6174 processor using 20 cores Genomic Data We also applied our method to a well-known genomics data set (Sachs et al., 2005), where the expression of 11 proteins in human T-cells was measured under 14 different experimental conditions. There are likely (Êmin G ) ji > 0 exclude each other, but we took this into account when computing the false positive rate. 58

72 3.5. onclusions hidden confounders, which makes this setting suitable for hidden variable models. However, it is questionable whether the bow-freeness, linearity, and Gaussianity assumptions hold to a reasonable approximation (in fact the data seem not multivariate Normal). Furthermore, there does not exist a ground truth network (although some individual links between pairs of proteins are reported as known in the original paper). So we abstain from presenting a best network, but instead compare the model fit of Ps and DGs from a purely statistical point of view. To do this, we first log-transform all variables since they are heavily skewed. We then run two sets of greedy searches for each of the 14 data sets: one with Ps and one with DGs. For the Ps we use 100 and for the DGs we use 1000 random restarts (DG models can be fit much faster then P models). The results for datasets 1 to 8 can be seen in Figure The penalized likelihood scores of the best P models are always much higher than the corresponding scores for the DG models, despite the larger number of random restarts for the DGs. Furthermore, while there is a marked improvement from the random starting scores for Ps, the improvement for the DG scores is marginal. For these datasets, Ps seem to be superior to DGs in modelling the data. The computations took 4 hours for the P models and 1.5 hours for the DG models on an MD Opteron 6174 processor using 20 cores onclusions We have presented a structure learning method for Ps, which can be viewed as a generalization of Gaussian linear DG models that allow for certain latent variables. Our method is computationally feasible and the first of its kind. The results on simulated data are promising, keeping in mind that structure learning and inferring causal effects are difficult, even for the easier case with DGs. The main sources of errors (given the model assumptions are fulfilled) are sampling variability, finding a local optimum only, and not knowing 10 The results for the other 6 datasets are qualitatively similar. 59

73 hapter 3. Structure Learning with ow-free cyclic Path Diagrams Figure 3.5.: Greedy search runs on the first 8 of 14 genomic datasets with Ps and DGs. The x-axis is time in seconds, the y-axis is the (normalized) penalized likelihood score. Each path corresponds to a run of a greedy search with a different random restart, with each point on the path being a state visited by this greedy search run. 60

Causal Structure Learning and Inference: A Selective Review

Causal Structure Learning and Inference: A Selective Review Vol. 11, No. 1, pp. 3-21, 2014 ICAQM 2014 Causal Structure Learning and Inference: A Selective Review Markus Kalisch * and Peter Bühlmann Seminar for Statistics, ETH Zürich, CH-8092 Zürich, Switzerland

More information

Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models

Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models Kun Zhang Dept of Computer Science and HIIT University of Helsinki 14 Helsinki, Finland kun.zhang@cs.helsinki.fi Aapo Hyvärinen

More information

Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models

Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models JMLR Workshop and Conference Proceedings 6:17 164 NIPS 28 workshop on causality Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models Kun Zhang Dept of Computer Science and HIIT University

More information

Abstract. Three Methods and Their Limitations. N-1 Experiments Suffice to Determine the Causal Relations Among N Variables

Abstract. Three Methods and Their Limitations. N-1 Experiments Suffice to Determine the Causal Relations Among N Variables N-1 Experiments Suffice to Determine the Causal Relations Among N Variables Frederick Eberhardt Clark Glymour 1 Richard Scheines Carnegie Mellon University Abstract By combining experimental interventions

More information

Causal inference (with statistical uncertainty) based on invariance: exploiting the power of heterogeneous data

Causal inference (with statistical uncertainty) based on invariance: exploiting the power of heterogeneous data Causal inference (with statistical uncertainty) based on invariance: exploiting the power of heterogeneous data Peter Bühlmann joint work with Jonas Peters Nicolai Meinshausen ... and designing new perturbation

More information

Recovering the Graph Structure of Restricted Structural Equation Models

Recovering the Graph Structure of Restricted Structural Equation Models Recovering the Graph Structure of Restricted Structural Equation Models Workshop on Statistics for Complex Networks, Eindhoven Jonas Peters 1 J. Mooij 3, D. Janzing 2, B. Schölkopf 2, P. Bühlmann 1 1 Seminar

More information

Equivalence in Non-Recursive Structural Equation Models

Equivalence in Non-Recursive Structural Equation Models Equivalence in Non-Recursive Structural Equation Models Thomas Richardson 1 Philosophy Department, Carnegie-Mellon University Pittsburgh, P 15213, US thomas.richardson@andrew.cmu.edu Introduction In the

More information

Causal Discovery Methods Using Causal Probabilistic Networks

Causal Discovery Methods Using Causal Probabilistic Networks ausal iscovery Methods Using ausal Probabilistic Networks MINFO 2004, T02: Machine Learning Methods for ecision Support and iscovery onstantin F. liferis & Ioannis Tsamardinos iscovery Systems Laboratory

More information

Supplementary material to Structure Learning of Linear Gaussian Structural Equation Models with Weak Edges

Supplementary material to Structure Learning of Linear Gaussian Structural Equation Models with Weak Edges Supplementary material to Structure Learning of Linear Gaussian Structural Equation Models with Weak Edges 1 PRELIMINARIES Two vertices X i and X j are adjacent if there is an edge between them. A path

More information

Distinguishing between Cause and Effect: Estimation of Causal Graphs with two Variables

Distinguishing between Cause and Effect: Estimation of Causal Graphs with two Variables Distinguishing between Cause and Effect: Estimation of Causal Graphs with two Variables Jonas Peters ETH Zürich Tutorial NIPS 2013 Workshop on Causality 9th December 2013 F. H. Messerli: Chocolate Consumption,

More information

COMP538: Introduction to Bayesian Networks

COMP538: Introduction to Bayesian Networks COMP538: Introduction to Bayesian Networks Lecture 9: Optimal Structure Learning Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering Hong Kong University of Science and Technology

More information

arxiv: v3 [stat.me] 10 Mar 2016

arxiv: v3 [stat.me] 10 Mar 2016 Submitted to the Annals of Statistics ESTIMATING THE EFFECT OF JOINT INTERVENTIONS FROM OBSERVATIONAL DATA IN SPARSE HIGH-DIMENSIONAL SETTINGS arxiv:1407.2451v3 [stat.me] 10 Mar 2016 By Preetam Nandy,,

More information

Probabilistic Causal Models

Probabilistic Causal Models Probabilistic Causal Models A Short Introduction Robin J. Evans www.stat.washington.edu/ rje42 ACMS Seminar, University of Washington 24th February 2011 1/26 Acknowledgements This work is joint with Thomas

More information

Identifiability assumptions for directed graphical models with feedback

Identifiability assumptions for directed graphical models with feedback Biometrika, pp. 1 26 C 212 Biometrika Trust Printed in Great Britain Identifiability assumptions for directed graphical models with feedback BY GUNWOONG PARK Department of Statistics, University of Wisconsin-Madison,

More information

Causal Effect Identification in Alternative Acyclic Directed Mixed Graphs

Causal Effect Identification in Alternative Acyclic Directed Mixed Graphs Proceedings of Machine Learning Research vol 73:21-32, 2017 AMBN 2017 Causal Effect Identification in Alternative Acyclic Directed Mixed Graphs Jose M. Peña Linköping University Linköping (Sweden) jose.m.pena@liu.se

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Learning in Bayesian Networks

Learning in Bayesian Networks Learning in Bayesian Networks Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Berlin: 20.06.2002 1 Overview 1. Bayesian Networks Stochastic Networks

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Simplicity of Additive Noise Models

Simplicity of Additive Noise Models Simplicity of Additive Noise Models Jonas Peters ETH Zürich - Marie Curie (IEF) Workshop on Simplicity and Causal Discovery Carnegie Mellon University 7th June 2014 contains joint work with... ETH Zürich:

More information

Double Kernel Method Using Line Transect Sampling

Double Kernel Method Using Line Transect Sampling AUSTRIAN JOURNAL OF STATISTICS Volume 41 (2012), Number 2, 95 103 Double ernel Method Using Line Transect Sampling Omar Eidous and M.. Shakhatreh Department of Statistics, Yarmouk University, Irbid, Jordan

More information

Measurement Error and Causal Discovery

Measurement Error and Causal Discovery Measurement Error and Causal Discovery Richard Scheines & Joseph Ramsey Department of Philosophy Carnegie Mellon University Pittsburgh, PA 15217, USA 1 Introduction Algorithms for causal discovery emerged

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Predicting causal effects in large-scale systems from observational data

Predicting causal effects in large-scale systems from observational data nature methods Predicting causal effects in large-scale systems from observational data Marloes H Maathuis 1, Diego Colombo 1, Markus Kalisch 1 & Peter Bühlmann 1,2 Supplementary figures and text: Supplementary

More information

Part 6: Multivariate Normal and Linear Models

Part 6: Multivariate Normal and Linear Models Part 6: Multivariate Normal and Linear Models 1 Multiple measurements Up until now all of our statistical models have been univariate models models for a single measurement on each member of a sample of

More information

Econometrics Summary Algebraic and Statistical Preliminaries

Econometrics Summary Algebraic and Statistical Preliminaries Econometrics Summary Algebraic and Statistical Preliminaries Elasticity: The point elasticity of Y with respect to L is given by α = ( Y/ L)/(Y/L). The arc elasticity is given by ( Y/ L)/(Y/L), when L

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

Local Probability Models

Local Probability Models Readings: K&F 3.4, 5.~5.5 Local Probability Models Lecture 3 pr 4, 2 SE 55, Statistical Methods, Spring 2 Instructor: Su-In Lee University of Washington, Seattle Outline Last time onditional parameterization

More information

SC705: Advanced Statistics Instructor: Natasha Sarkisian Class notes: Introduction to Structural Equation Modeling (SEM)

SC705: Advanced Statistics Instructor: Natasha Sarkisian Class notes: Introduction to Structural Equation Modeling (SEM) SC705: Advanced Statistics Instructor: Natasha Sarkisian Class notes: Introduction to Structural Equation Modeling (SEM) SEM is a family of statistical techniques which builds upon multiple regression,

More information

Causal Models with Hidden Variables

Causal Models with Hidden Variables Causal Models with Hidden Variables Robin J. Evans www.stats.ox.ac.uk/ evans Department of Statistics, University of Oxford Quantum Networks, Oxford August 2017 1 / 44 Correlation does not imply causation

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

On the errors introduced by the naive Bayes independence assumption

On the errors introduced by the naive Bayes independence assumption On the errors introduced by the naive Bayes independence assumption Author Matthijs de Wachter 3671100 Utrecht University Master Thesis Artificial Intelligence Supervisor Dr. Silja Renooij Department of

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Deep Convolutional Neural Networks for Pairwise Causality

Deep Convolutional Neural Networks for Pairwise Causality Deep Convolutional Neural Networks for Pairwise Causality Karamjit Singh, Garima Gupta, Lovekesh Vig, Gautam Shroff, and Puneet Agarwal TCS Research, Delhi Tata Consultancy Services Ltd. {karamjit.singh,

More information

Learning causal network structure from multiple (in)dependence models

Learning causal network structure from multiple (in)dependence models Learning causal network structure from multiple (in)dependence models Tom Claassen Radboud University, Nijmegen tomc@cs.ru.nl Abstract Tom Heskes Radboud University, Nijmegen tomh@cs.ru.nl We tackle the

More information

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models 02-710 Computational Genomics Systems biology Putting it together: Data integration using graphical models High throughput data So far in this class we discussed several different types of high throughput

More information

Fast and Accurate Causal Inference from Time Series Data

Fast and Accurate Causal Inference from Time Series Data Fast and Accurate Causal Inference from Time Series Data Yuxiao Huang and Samantha Kleinberg Stevens Institute of Technology Hoboken, NJ {yuxiao.huang, samantha.kleinberg}@stevens.edu Abstract Causal inference

More information

Learning of Causal Relations

Learning of Causal Relations Learning of Causal Relations John A. Quinn 1 and Joris Mooij 2 and Tom Heskes 2 and Michael Biehl 3 1 Faculty of Computing & IT, Makerere University P.O. Box 7062, Kampala, Uganda 2 Institute for Computing

More information

An overview of applied econometrics

An overview of applied econometrics An overview of applied econometrics Jo Thori Lind September 4, 2011 1 Introduction This note is intended as a brief overview of what is necessary to read and understand journal articles with empirical

More information

Structure Learning: the good, the bad, the ugly

Structure Learning: the good, the bad, the ugly Readings: K&F: 15.1, 15.2, 15.3, 15.4, 15.5 Structure Learning: the good, the bad, the ugly Graphical Models 10708 Carlos Guestrin Carnegie Mellon University September 29 th, 2006 1 Understanding the uniform

More information

Inferring the Causal Decomposition under the Presence of Deterministic Relations.

Inferring the Causal Decomposition under the Presence of Deterministic Relations. Inferring the Causal Decomposition under the Presence of Deterministic Relations. Jan Lemeire 1,2, Stijn Meganck 1,2, Francesco Cartella 1, Tingting Liu 1 and Alexander Statnikov 3 1-ETRO Department, Vrije

More information

Gov 2002: 4. Observational Studies and Confounding

Gov 2002: 4. Observational Studies and Confounding Gov 2002: 4. Observational Studies and Confounding Matthew Blackwell September 10, 2015 Where are we? Where are we going? Last two weeks: randomized experiments. From here on: observational studies. What

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 5, 06 Reading: See class website Eric Xing @ CMU, 005-06

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Causal Inference on Discrete Data via Estimating Distance Correlations

Causal Inference on Discrete Data via Estimating Distance Correlations ARTICLE CommunicatedbyAapoHyvärinen Causal Inference on Discrete Data via Estimating Distance Correlations Furui Liu frliu@cse.cuhk.edu.hk Laiwan Chan lwchan@cse.euhk.edu.hk Department of Computer Science

More information

Type-II Errors of Independence Tests Can Lead to Arbitrarily Large Errors in Estimated Causal Effects: An Illustrative Example

Type-II Errors of Independence Tests Can Lead to Arbitrarily Large Errors in Estimated Causal Effects: An Illustrative Example Type-II Errors of Independence Tests Can Lead to Arbitrarily Large Errors in Estimated Causal Effects: An Illustrative Example Nicholas Cornia & Joris M. Mooij Informatics Institute University of Amsterdam,

More information

D-optimally Lack-of-Fit-Test-efficient Designs and Related Simple Designs

D-optimally Lack-of-Fit-Test-efficient Designs and Related Simple Designs AUSTRIAN JOURNAL OF STATISTICS Volume 37 (2008), Number 3&4, 245 253 D-optimally Lack-of-Fit-Test-efficient Designs and Related Simple Designs Wolfgang Bischoff Catholic University ichstätt-ingolstadt,

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 7, 04 Reading: See class website Eric Xing @ CMU, 005-04

More information

Respecting Markov Equivalence in Computing Posterior Probabilities of Causal Graphical Features

Respecting Markov Equivalence in Computing Posterior Probabilities of Causal Graphical Features Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10) Respecting Markov Equivalence in Computing Posterior Probabilities of Causal Graphical Features Eun Yong Kang Department

More information

Y1 Y2 Y3 Y4 Y1 Y2 Y3 Y4 Z1 Z2 Z3 Z4

Y1 Y2 Y3 Y4 Y1 Y2 Y3 Y4 Z1 Z2 Z3 Z4 Inference: Exploiting Local Structure aphne Koller Stanford University CS228 Handout #4 We have seen that N inference exploits the network structure, in particular the conditional independence and the

More information

Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise

Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise Makoto Yamada and Masashi Sugiyama Department of Computer Science, Tokyo Institute of Technology

More information

Automatic Causal Discovery

Automatic Causal Discovery Automatic Causal Discovery Richard Scheines Peter Spirtes, Clark Glymour Dept. of Philosophy & CALD Carnegie Mellon 1 Outline 1. Motivation 2. Representation 3. Discovery 4. Using Regression for Causal

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Introduction. Basic Probability and Bayes Volkan Cevher, Matthias Seeger Ecole Polytechnique Fédérale de Lausanne 26/9/2011 (EPFL) Graphical Models 26/9/2011 1 / 28 Outline

More information

Machine Learning (CS 567) Lecture 2

Machine Learning (CS 567) Lecture 2 Machine Learning (CS 567) Lecture 2 Time: T-Th 5:00pm - 6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

Probabilistic latent variable models for distinguishing between cause and effect

Probabilistic latent variable models for distinguishing between cause and effect Probabilistic latent variable models for distinguishing between cause and effect Joris M. Mooij joris.mooij@tuebingen.mpg.de Oliver Stegle oliver.stegle@tuebingen.mpg.de Dominik Janzing dominik.janzing@tuebingen.mpg.de

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network

More information

Estimation of linear non-gaussian acyclic models for latent factors

Estimation of linear non-gaussian acyclic models for latent factors Estimation of linear non-gaussian acyclic models for latent factors Shohei Shimizu a Patrik O. Hoyer b Aapo Hyvärinen b,c a The Institute of Scientific and Industrial Research, Osaka University Mihogaoka

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Chapter 8&9: Classification: Part 3 Instructor: Yizhou Sun yzsun@ccs.neu.edu March 12, 2013 Midterm Report Grade Distribution 90-100 10 80-89 16 70-79 8 60-69 4

More information

Interpreting and using CPDAGs with background knowledge

Interpreting and using CPDAGs with background knowledge Interpreting and using CPDAGs with background knowledge Emilija Perković Seminar for Statistics ETH Zurich, Switzerland perkovic@stat.math.ethz.ch Markus Kalisch Seminar for Statistics ETH Zurich, Switzerland

More information

Econometrics with Observational Data. Introduction and Identification Todd Wagner February 1, 2017

Econometrics with Observational Data. Introduction and Identification Todd Wagner February 1, 2017 Econometrics with Observational Data Introduction and Identification Todd Wagner February 1, 2017 Goals for Course To enable researchers to conduct careful quantitative analyses with existing VA (and non-va)

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

More information

Advances in Cyclic Structural Causal Models

Advances in Cyclic Structural Causal Models Advances in Cyclic Structural Causal Models Joris Mooij j.m.mooij@uva.nl June 1st, 2018 Joris Mooij (UvA) Rotterdam 2018 2018-06-01 1 / 41 Part I Introduction to Causality Joris Mooij (UvA) Rotterdam 2018

More information

QUANTIFYING CAUSAL INFLUENCES

QUANTIFYING CAUSAL INFLUENCES Submitted to the Annals of Statistics arxiv: arxiv:/1203.6502 QUANTIFYING CAUSAL INFLUENCES By Dominik Janzing, David Balduzzi, Moritz Grosse-Wentrup, and Bernhard Schölkopf Max Planck Institute for Intelligent

More information

ECO Class 6 Nonparametric Econometrics

ECO Class 6 Nonparametric Econometrics ECO 523 - Class 6 Nonparametric Econometrics Carolina Caetano Contents 1 Nonparametric instrumental variable regression 1 2 Nonparametric Estimation of Average Treatment Effects 3 2.1 Asymptotic results................................

More information

25 : Graphical induced structured input/output models

25 : Graphical induced structured input/output models 10-708: Probabilistic Graphical Models 10-708, Spring 2013 25 : Graphical induced structured input/output models Lecturer: Eric P. Xing Scribes: Meghana Kshirsagar (mkshirsa), Yiwen Chen (yiwenche) 1 Graph

More information

Arrowhead completeness from minimal conditional independencies

Arrowhead completeness from minimal conditional independencies Arrowhead completeness from minimal conditional independencies Tom Claassen, Tom Heskes Radboud University Nijmegen The Netherlands {tomc,tomh}@cs.ru.nl Abstract We present two inference rules, based on

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

Flexible Estimation of Treatment Effect Parameters

Flexible Estimation of Treatment Effect Parameters Flexible Estimation of Treatment Effect Parameters Thomas MaCurdy a and Xiaohong Chen b and Han Hong c Introduction Many empirical studies of program evaluations are complicated by the presence of both

More information

Causal Inference. Prediction and causation are very different. Typical questions are:

Causal Inference. Prediction and causation are very different. Typical questions are: Causal Inference Prediction and causation are very different. Typical questions are: Prediction: Predict Y after observing X = x Causation: Predict Y after setting X = x. Causation involves predicting

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

10708 Graphical Models: Homework 2

10708 Graphical Models: Homework 2 10708 Graphical Models: Homework 2 Due Monday, March 18, beginning of class Feburary 27, 2013 Instructions: There are five questions (one for extra credit) on this assignment. There is a problem involves

More information

Bayesian Discovery of Linear Acyclic Causal Models

Bayesian Discovery of Linear Acyclic Causal Models Bayesian Discovery of Linear Acyclic Causal Models Patrik O. Hoyer Helsinki Institute for Information Technology & Department of Computer Science University of Helsinki Finland Antti Hyttinen Helsinki

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Distinguishing Cause from Effect Using Observational Data: Methods and Benchmarks

Distinguishing Cause from Effect Using Observational Data: Methods and Benchmarks Journal of Machine Learning Research 17 (2016) 1-102 Submitted 12/14; Revised 12/15; Published 4/16 Distinguishing Cause from Effect Using Observational Data: Methods and Benchmarks Joris M. Mooij Institute

More information

Variable selection and machine learning methods in causal inference

Variable selection and machine learning methods in causal inference Variable selection and machine learning methods in causal inference Debashis Ghosh Department of Biostatistics and Informatics Colorado School of Public Health Joint work with Yeying Zhu, University of

More information

Learning Multivariate Regression Chain Graphs under Faithfulness

Learning Multivariate Regression Chain Graphs under Faithfulness Sixth European Workshop on Probabilistic Graphical Models, Granada, Spain, 2012 Learning Multivariate Regression Chain Graphs under Faithfulness Dag Sonntag ADIT, IDA, Linköping University, Sweden dag.sonntag@liu.se

More information

Causal Inference & Reasoning with Causal Bayesian Networks

Causal Inference & Reasoning with Causal Bayesian Networks Causal Inference & Reasoning with Causal Bayesian Networks Neyman-Rubin Framework Potential Outcome Framework: for each unit k and each treatment i, there is a potential outcome on an attribute U, U ik,

More information

Chapter 5. Introduction to Path Analysis. Overview. Correlation and causation. Specification of path models. Types of path models

Chapter 5. Introduction to Path Analysis. Overview. Correlation and causation. Specification of path models. Types of path models Chapter 5 Introduction to Path Analysis Put simply, the basic dilemma in all sciences is that of how much to oversimplify reality. Overview H. M. Blalock Correlation and causation Specification of path

More information

CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs

CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs Professor Erik Sudderth Brown University Computer Science October 6, 2016 Some figures and materials courtesy

More information

Directed and Undirected Graphical Models

Directed and Undirected Graphical Models Directed and Undirected Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Last Lecture Refresher Lecture Plan Directed

More information

Parametrizations of Discrete Graphical Models

Parametrizations of Discrete Graphical Models Parametrizations of Discrete Graphical Models Robin J. Evans www.stat.washington.edu/ rje42 10th August 2011 1/34 Outline 1 Introduction Graphical Models Acyclic Directed Mixed Graphs Two Problems 2 Ingenuous

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

arxiv: v6 [math.st] 3 Feb 2018

arxiv: v6 [math.st] 3 Feb 2018 Submitted to the Annals of Statistics HIGH-DIMENSIONAL CONSISTENCY IN SCORE-BASED AND HYBRID STRUCTURE LEARNING arxiv:1507.02608v6 [math.st] 3 Feb 2018 By Preetam Nandy,, Alain Hauser and Marloes H. Maathuis,

More information

Predicting the effect of interventions using invariance principles for nonlinear models

Predicting the effect of interventions using invariance principles for nonlinear models Predicting the effect of interventions using invariance principles for nonlinear models Christina Heinze-Deml Seminar for Statistics, ETH Zürich heinzedeml@stat.math.ethz.ch Jonas Peters University of

More information

Regression I: Mean Squared Error and Measuring Quality of Fit

Regression I: Mean Squared Error and Measuring Quality of Fit Regression I: Mean Squared Error and Measuring Quality of Fit -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD 1 The Setup Suppose there is a scientific problem we are interested in solving

More information

Bioinformatics: Network Analysis

Bioinformatics: Network Analysis Bioinformatics: Network Analysis Model Fitting COMP 572 (BIOS 572 / BIOE 564) - Fall 2013 Luay Nakhleh, Rice University 1 Outline Parameter estimation Model selection 2 Parameter Estimation 3 Generally

More information

Discovery of non-gaussian linear causal models using ICA

Discovery of non-gaussian linear causal models using ICA Discovery of non-gaussian linear causal models using ICA Shohei Shimizu HIIT Basic Research Unit Dept. of Comp. Science University of Helsinki Finland Aapo Hyvärinen HIIT Basic Research Unit Dept. of Comp.

More information

Lecture 6: Univariate Volatility Modelling: ARCH and GARCH Models

Lecture 6: Univariate Volatility Modelling: ARCH and GARCH Models Lecture 6: Univariate Volatility Modelling: ARCH and GARCH Models Prof. Massimo Guidolin 019 Financial Econometrics Winter/Spring 018 Overview ARCH models and their limitations Generalized ARCH models

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

On the Identification of a Class of Linear Models

On the Identification of a Class of Linear Models On the Identification of a Class of Linear Models Jin Tian Department of Computer Science Iowa State University Ames, IA 50011 jtian@cs.iastate.edu Abstract This paper deals with the problem of identifying

More information

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr

More information

Statistics: revision

Statistics: revision NST 1B Experimental Psychology Statistics practical 5 Statistics: revision Rudolf Cardinal & Mike Aitken 29 / 30 April 2004 Department of Experimental Psychology University of Cambridge Handouts: Answers

More information

arxiv: v1 [cs.lg] 3 Jan 2017

arxiv: v1 [cs.lg] 3 Jan 2017 Deep Convolutional Neural Networks for Pairwise Causality Karamjit Singh, Garima Gupta, Lovekesh Vig, Gautam Shroff, and Puneet Agarwal TCS Research, New-Delhi, India January 4, 2017 arxiv:1701.00597v1

More information

arxiv: v4 [stat.ml] 2 Dec 2017

arxiv: v4 [stat.ml] 2 Dec 2017 Distributional Equivalence and Structure Learning for Bow-free Acyclic Path Diagrams arxiv:1508.01717v4 [stat.ml] 2 Dec 2017 Christopher Nowzohour Seminar für Statistik, ETH Zürich nowzohour@stat.math.ethz.ch

More information

1. Einleitung. 1.1 Organisatorisches. Ziel der Vorlesung: Einführung in die Methoden der Ökonometrie. Voraussetzungen: Deskriptive Statistik

1. Einleitung. 1.1 Organisatorisches. Ziel der Vorlesung: Einführung in die Methoden der Ökonometrie. Voraussetzungen: Deskriptive Statistik 1. Einleitung 1.1 Organisatorisches Ziel der Vorlesung: Einführung in die Methoden der Ökonometrie Voraussetzungen: Deskriptive Statistik Wahrscheinlichkeitsrechnung und schließende Statistik Fortgeschrittene

More information