Estimating Causal Direction and Confounding Of Two Discrete Variables

Similar documents
Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models

Expected Value of Partial Perfect Information

arxiv: v1 [cs.lg] 3 Jan 2017

Robust Low Rank Kernel Embeddings of Multivariate Distributions

Cascaded redundancy reduction

6 General properties of an autonomous system of two first order ODE

Least-Squares Regression on Sparse Spaces

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences.

Convergence of Random Walks

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation

Parameter estimation: A new approach to weighting a priori information

Level Construction of Decision Trees in a Partition-based Framework for Classification

ensembles When working with density operators, we can use this connection to define a generalized Bloch vector: v x Tr x, v y Tr y

The total derivative. Chapter Lagrangian and Eulerian approaches

1 dx. where is a large constant, i.e., 1, (7.6) and Px is of the order of unity. Indeed, if px is given by (7.5), the inequality (7.

A Course in Machine Learning

Chapter 9 Method of Weighted Residuals

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

u!i = a T u = 0. Then S satisfies

Chapter 6: Energy-Momentum Tensors

Deep Convolutional Neural Networks for Pairwise Causality

Lecture 2: Correlated Topic Model

'HVLJQ &RQVLGHUDWLRQ LQ 0DWHULDO 6HOHFWLRQ 'HVLJQ 6HQVLWLYLW\,1752'8&7,21

STATISTICAL LIKELIHOOD REPRESENTATIONS OF PRIOR KNOWLEDGE IN MACHINE LEARNING

Necessary and Sufficient Conditions for Sketched Subspace Clustering

Equilibrium in Queues Under Unknown Service Times and Service Value

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

Linear First-Order Equations

Similarity Measures for Categorical Data A Comparative Study. Technical Report

KNN Particle Filters for Dynamic Hybrid Bayesian Networks

Capacity Analysis of MIMO Systems with Unknown Channel State Information

Causal Inference on Discrete Data via Estimating Distance Correlations

Influence of weight initialization on multilayer perceptron performance

SYNCHRONOUS SEQUENTIAL CIRCUITS

Collapsed Gibbs and Variational Methods for LDA. Example Collapsed MoG Sampling

Hybrid Fusion for Biometrics: Combining Score-level and Decision-level Fusion

The Exact Form and General Integrating Factors

A. Exclusive KL View of the MLE

05 The Continuum Limit and the Wave Equation

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors

Probabilistic latent variable models for distinguishing between cause and effect

Relative Entropy and Score Function: New Information Estimation Relationships through Arbitrary Additive Perturbation

Separation of Variables

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION

Lower Bounds for the Smoothed Number of Pareto optimal Solutions

Topic 7: Convergence of Random Variables

A Review of Multiple Try MCMC algorithms for Signal Processing

Non-Linear Bayesian CBRN Source Term Estimation

Introduction. A Dirichlet Form approach to MCMC Optimal Scaling. MCMC idea

Function Spaces. 1 Hilbert Spaces

Lecture 2 Lagrangian formulation of classical mechanics Mechanics

19 Eigenvalues, Eigenvectors, Ordinary Differential Equations, and Control

Local Linear ICA for Mutual Information Estimation in Feature Selection

Examining Geometric Integration for Propagating Orbit Trajectories with Non-Conservative Forcing

Table of Common Derivatives By David Abraham

Improving Estimation Accuracy in Nonrandomized Response Questioning Methods by Multiple Answers

Error Floors in LDPC Codes: Fast Simulation, Bounds and Hardware Emulation

ELEC3114 Control Systems 1

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs

Leaving Randomness to Nature: d-dimensional Product Codes through the lens of Generalized-LDPC codes

3.7 Implicit Differentiation -- A Brief Introduction -- Student Notes

LDA Collapsed Gibbs Sampler, VariaNonal Inference. Task 3: Mixed Membership Models. Case Study 5: Mixed Membership Modeling

d dx But have you ever seen a derivation of these results? We ll prove the first result below. cos h 1

Balancing Expected and Worst-Case Utility in Contracting Models with Asymmetric Information and Pooling

Estimation of the Maximum Domination Value in Multi-Dimensional Data Sets

Gaussian processes with monotonicity information

Lecture 6: Calculus. In Song Kim. September 7, 2011

Quantum Mechanics in Three Dimensions

arxiv:hep-th/ v1 3 Feb 1993

Optimal Signal Detection for False Track Discrimination

Nonlinear Dielectric Response of Periodic Composite Materials

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

Integration Review. May 11, 2013

Topic Modeling: Beyond Bag-of-Words

Math 342 Partial Differential Equations «Viktor Grigoryan

Jointly continuous distributions and the multivariate Normal

Concentration of Measure Inequalities for Compressive Toeplitz Matrices with Applications to Detection and System Identification

A Novel Decoupled Iterative Method for Deep-Submicron MOSFET RF Circuit Simulation

FLUCTUATIONS IN THE NUMBER OF POINTS ON SMOOTH PLANE CURVES OVER FINITE FIELDS. 1. Introduction

Multi-View Clustering via Canonical Correlation Analysis

Multi-View Clustering via Canonical Correlation Analysis

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

The new concepts of measurement error s regularities and effect characteristics

An Analytical Expression of the Probability of Error for Relaying with Decode-and-forward

A Note on Exact Solutions to Linear Differential Equations by the Matrix Exponential

arxiv: v3 [cs.lg] 3 Dec 2017

Lagrangian and Hamiltonian Mechanics

Schrödinger s equation.

Arrowhead completeness from minimal conditional independencies

Tutorial on Maximum Likelyhood Estimation: Parametric Density Estimation

Permanent vs. Determinant

Entanglement is not very useful for estimating multiple phases

A New Minimum Description Length

ELECTRON DIFFRACTION

Proof of SPNs as Mixture of Trees

Transcription:

Estimating Causal Direction an Confouning Of Two Discrete Variables This inspire further work on the so calle aitive noise moels. Hoyer et al. (2009) extene Shimizu s ientifiaarxiv:1611.01504v1 [stat.ml] 4 Nov 2016 Krzysztof Chalupka Computation an Neural Systems Caltech Abstract We propose a metho to classify the causal relationship between two iscrete variables given only the joint istribution of the variables, acknowleging that the metho is subject to an inherent baseline error. We assume that the causal system is acyclicity, but we o allow for hien common causes. Our algorithm presupposes that the probability istributions P (C) of a cause C is inepenent from the probability istribution P (E C) of the cause-effect mechanism. While our classifier is traine with a Bayesian assumption of flat hyperpriors, we o not make this assumption about our test ata. This work connects to recent evelopments on the ientifiability of causal moels over continuous variables uner the assumption of inepenent mechanisms. Carefully-commente Python notebooks that reprouce all our experiments are available online at vision.caltech.eu/ kchalupk/coe.html. 1 Introuction Take two iscrete variables X an Y that are probabilistically epenent. Assume there is no feeback between the variables: it is not the case that both X causes Y an Y causes X. Further, assume that any probabilistic epenence between variables always arises ue to a causal connection between the variables (Reichenbach, 1991). The funamental causal question is then to answer two questions: 1) Does X cause Y or oes Y cause X? An 2) Do X an Y have a common cause H? Since we assume no feeback in the system, the options in 1) are mutually exclusive. Each of them, however, can occur together with a confouner. Fig. 1 enumerates the set of six possible hypotheses. Freerick Eberhart Humanities an Social Sciences Caltech Pietro Perona Electrical Engineering Caltech In this article we present a metho to istinguish between these six hypotheses on the basis only of ata from the observational joint probability P (X, Y ) that is, without resorting to experimental intervention. Within the causal graphical moels framework (Pearl, 2000; Spirtes et al., 2000), ifferentiating between any two of the causally interesting possibilities (shown in Fig. 1B- F) is in general only possible if one has the ability to intervene on the system. For example, to ifferentiate between the pure-confouning an the irect-causal case (Fig. 1B an C), one can intervene on X an observe whether that has an effect on the istribution of Y. Given only observations of X an Y an no ability to intervene on the system however, the problem is in general not ientifiable. Roughly speaking, the reason is simply that without any further assumptions about the form of the istribution, any joint P (X, Y ) can be factorize as P (X)P (Y X) an P (Y )P (X Y ), an the hien confouner H can easily be enowe with a istribution that can give the marginal H P (X, Y, H) any esire form. 2 Relate Work There are two common remeies to the funamental unientifiability of the two-variable causal system: 1) Resort to interventions or 2) Introuce aitional assumptions about the system an erive a solution that works uner these assumptions. Whereas the first solution is straightforwar, research in the secon irection is a more recent an exciting enterprise. 2.1 Aitive Noise Moels A recent boy of work attacks the problem of establishing whether x y or y x when specific assumptions with respect to the functional form of the causal relationship hol. Shimizu et al. (2006) showe for continuous variables that when the effect is a linear function of the cause, with non-gaussian noise, then the causal irection can be ientifie in the limit of infinite sample size.

bility results to the case when the effect is any (except for a measure-theoretically small set of) nonlinear function of the cause, an the noise is aitive even Gaussian. Zhang an Hyvärinen (2009) showe that a postnonlinear moel that is, y = f(g(x) + ɛ) where f is an invertible function an ɛ a noise term is ientifiable. The aitive noise moels framework was applie to iscrete variables by Peters et al. (2011). Most of this work has focuse on istinguishing the causal orientation in the absence of confouning (i.e. istinguishing hypotheses A, C an D in Fig. 1, although Hoyer et al. (2008) have extene the linear non-gaussian methos to the general hypothesis space of Fig, 1 an Janzing et al. (2009) showe that the aitive noise assumption can be use to etect pure confouning with some success, i.e. to istinguish hypothesis B from hypotheses C an D. The assumption of aitive noise supplies remarkable ientifiability results, an has a natural application in many cases where the variation in the ata is thought to erive from the measurement process on an otherwise eterministic functional relation. With respect to the more general space of possible probabilistic causal relations it constitutes a very substantive assumption. In particular, in the iscrete case its application is no longer so natural. 2.2 Bayesian Causal Moel Selection From a Bayesian perspective, the question of causal moel ientification is softene to the question of moel selection base on the posterior probability of a causal structure give the ata. In a classic work on Bayesian Network learning (then calle Belief Net learning), Heckerman an Chickering (Heckerman et al., 1995; Chickering, 2002a,b) evelope a Bayesian scoring criterion that allowe them to efine the posterior probability of each possible Bayesian network given a ataset. Motivate by results of Geiger an Pearl (1988) an Meek (1995) that showe that for linear Gaussian parameterizations an for multinomial parameterizations, causal moels with ientical (conitional) inepenence structure cannot in principle be istinguishe given only the observational joint istribution over the observe variables, their score ha the property that it assigne the same score to graphs with the same inepenence structure. But the results of Geiger an Pearl (1988) an Meek (1995) only make an existential claim: That for any joint istribution a parameterization of the appropriate form exists for any causal structure in the Markov equivalence class. But one nee not conclue from such an existential claim, like Heckerman an Chickering i, that there aren t reasons in the ata that coul suggest that one causal structure in a Markov equivalence class is more probable than another. Thus, the crucial ifference between the work of Heckermann an ours is that their goal was to fin the Markov Figure 1: Possible causal structures linking X an Y. Assume X, Y an H are all iscrete but H is unobserve. In principle, it is impossible to ientify the correct causal structure given only X an Y samples. In this report, we will tackle this problem using a minimalistic set of assumptions. Our final result is a classifier that ifferentiates between these six cases the confusion matrices are shown in Fig. 9. Equivalence Class of the true causal moel. This, however, reners our task impossible: all the hypotheses enumerate in Fig. 1B-F are Markov-equivalent. Our contribution is to use assumptions formally ientical to theirs (when restricte to the causally sufficient case) to assess the likelihoo of causal structures over two observe variables, bearing in min that even in the limit of infinitely many samples, the true moel cannot be etermine, the structures can only be eeme more or less likely. Our approach is most similar in spirit to the work of Sun et al. (2006). Sun puts an explicit Bayesian prior on what a likely causal system is: if X causes Y, then the conitionals p(y X = x) are less complex than the reverse conitionals P (X Y = y), where complexity is measure by the Hilbert space norm of the conitional ensity functions. This formulation is plausible an easily applicable to iscrete systems (by efining the complexity of iscrete probability tables by their entropy). 3 Assumptions Our contribution is to create an algorithm with the following properties: 1. Applicable to iscrete variables X an Y with finitely many states. 2. Decies between all the six possible graphs shown in Fig. 1. 3. Does not make assumptions about the functional form of the iscrete parameterization (in contrast to e.g. an

aitive noise assumption). In a recent review, Mooij et al. (2014) compares a range of methos that ecie the causal irection between two variables, incluing the methos iscusse above. To our knowlege, none of these methos attempt to istinguish between the pure-causal (A, C, D), the confoune (B), an the causal+confoune case (E,F). We take an approach inspire by the Bayesian methos iscusse in Sec. 2. Consier the Bayesian moel in which P (X, Y ) is sample from an uninformative hyperprior with the property that the istribution of the cause is inepenent of the istribution of the effect conitione on the cause: 1. Assume that P (eff ect cause) P (cause). 2. Assume that P (effect cause = c) is sample from the uninformative hyperprior for each c. 3. Assume that P (cause) is sample from the uninformative hyperprior. Since all the istributions uner consierations are multinomial, the uninformative hyperprior is the Dirichlet istribution with parameters all equal to 1 (which we will enote as Dir(1), remembering that 1 is actually a vector whose imensionality will be clear from context). Which variable is a cause, an which the effect, or whether there is confouning, epens on which of the causal systems in Fig. 1 are sample. For example, if X Y an there is also confouning X h Y (Fig. 1D), then our assumptions set P (X) Dir(1) x P (Y X = x) Dir(1) P (H) Dir(1) h P (X H = h) Dir(1) h P (Y H = h) Dir(1) 4 An Analytical Solution: Causal Direction Consier first the problem of ientifying the causal irection. That is, assume that either X Y or Y X, an there is no confouning. The assumptions of Sec. 3 then allow us to compute, for any given joint P (X, Y ) (which we will from now on enote P XY to simplify notation), the likelihoo p(x Y P XY ) an the likelihoo p(y X P XY ). The likelihoo ratio allows us to ecie which causal irection P XY more likely represents. We first erive an visualize the likelihoo for the case of X an Y both binary variables. Next, we generalize the result to general X an Y. Finally, we analyze experimentally how sensitive such causal irection classifier is to breaking the assumption of uninformative Dirichlet hyperpriors (but keeping the inepenent mechanisms assumption). 4.1 Optimal Classifier for Binary X an Y [ ] a Consier first the binary case. Let P X = an 1 a [ ] b 1 b P Y X =. Assume P c 1 c X is sample inepenently from P Y X, an that the ensities (parameterize by a an b, c) are F a, F b, F c : (0, 1) R. This efines a ensity over (a, b, c), the three-imensional parameterization of an x y system, as F(a, b, c) = F a (a)f b (b)f c (c): (0, 1) 3 R. [ ] e Now, consier P XY = a threeimensional parameterization of the joint. If we assume f 1 ( + e + f) that P XY is sample accoring to the X Y sampling proceure, we can compute its ensity H XY : (0, 1) 3 R as a function of F using the multivariate change of variables formula. We have e = ab a(1 b) f (1 a)c an the inverse transformation is a b = c + e +e f 1 e (1) The Jacobian of the inverse transformation is 1 1 0 (a, b, c) (, e, f) = e (+e) 2 (+e) 0 2 f (1 e) 2 ( ) its eterminant et (a,b,c) (,e,f) change of variables formula then gives us f (1 e) 2 1 1 e, 1 = (+e) (+e). 2 F( + e, +e H XY (, e, f) =, ( + e) ( + e) 2, where a, b, c are obtaine from Eq. (1). f 1 e ) The We can repeat the same reasoning for the inverse causal irection, Y X. In this case, we obtain F( + f, +f H Y X (, e, f) =, ( + f) ( + f) 2. e 1 f ) Given P XY an the hyperpriors F, we can now test which causal irection P XY most likely correspons to. Assuming equal priors on both causal irections, we have p(x Y (, e, f)) p(y X (, e, f)) = H xy(, e, f) H yx (, e, f) ( ) F + e, +e, f 1 e = ( F + f, +f, e 1 f ) ( + f) ( + f)2 ( + e) ( + e) 2

p(x Y P XY ) p(y X P XY = (2) = F ( ) P X, P Y X F ( ) et J XY 1, (3) P Y, P X Y et J Y X 1 ( ) Figure 2: Log likelihoo-ratio log P (X Y (,e,f)) P (Y X (,e,f)) as a function of e, f for nine ifferent values of. Re correspons to values larger than 0 that is, X Y is more likely than the opposite causal irection in the re regions. Blue signifies the opposite. The ecision bounary is shown in black. It is a union of two orthogonal planes that cut the (, e, f) simplex into four connecte components along a skewe axis. Only the first factor in the likelihoo ratio epens on the hyperprior F. If we fix F a, F b, F c to all be Dir(1), the factor reuces to 1 an the likelihoo ratio becomes p(x Y (, e, f)) ( + f) ( + f)2 = p(y X (, e, f)) ( + e) ( + e) 2. Denote the uninformative-hyperprior likelihoo ratio function LR: P XY (, e, f) ( + f) ( + f)2 ( + e) ( + e) 2. The classifier that assigns the X Y class to P XY with LR(P XY ) > 1, an the Y X class otherwise is the optimal classifier uner our assumptions. Fig.2 shows LR across the three-imensional P XY simplex. The figure shows nine slices of this simplex for ifferent values of the coorinate. 4.2 Optimal Classifier for Arbitrary X an Y Deriving the optimal classifier for the case where X an Y are not binary is analogous to the binary erivation. The resulting likelihoo ratio is where J XY is the Jacobian of the linear transformation (P X, P Y X ) P XY an J Y X is the Jacobian of the transformation (P Y, P X Y ) P XY. The transformation, its eterminant an Jacobian are reaily computable on paper or using computer algebra systems. In our implementation, we use Theano (Theano Development Team, 2016) to perform the computation for us. Note that if X has carinality k X an Y has carinality k Y, the Jacobians have (k X k Y 1) 2 entries. Computing their eterminants has complexity O((k X k Y 1) 6 ) or, if we assume k X = k Y = k, O(k 12 ) it grows rather quickly with growing carinality. If F is flat, that is all the priors are Dir(1), we will call the causal irection classifier that follows Eq. (3) the LR classifier. That is, the LR classifier outputs X Y if the uninformative-hyperprior likelihoo ratio is larger than 1, an outputs Y X otherwise. Note that the optimal classifier is not perfect there is a baseline error that the optimal classifier has uner the assumptions it is built on. This error is E LR = p(y X P XY )I [LR(PXY )>1]+ p(x Y P XY )I [LR(PXY )<1]P XY, where the integral varies over all the possible joints P XY with uniform measure, an I [LR(PXY )<>1] is the inicator function that evaluates to 1 if its subscript conition hols, an to 0 otherwise. That is, assuming that each P XY is sample from the uninformative Dirichlet prior given that either X Y or Y X with given probability, in the limit of infinite classification trials the error rate of the LR classifier is E LR. Whereas this integral is not analytically computable (at least neither by the authors nor by available computer algebra systems), we can estimate it using Monte Carlo methos in the following sections. In Fig. 6, the leftmost entry on each curve correspons to E LR for various carinalities of X an Y. For example, for X = Y = 2, E LR.4 but alreay for X = Y = 10, E LR <.001. 4.3 Robustness: Changing the Hyperprior F What if we use the LR classifier, but our assumptions o not match reality? Namely, what if F is not Dir(1)? For example, what if F is a mixture of ten Dirichlet istribu-

Figure 3: Samples from Dirichlet mixtures. Each plot shows three ranom samples from a ten-component mixture of Dirichlet istributions over the 1D simplex. Each mixture component has a ifferent, ranom parameter α. For each plot we fixe a ifferent log 2 (α max ), a parameter which limits both the smallest an largest value of any of the two α coorinates that efine each mixture component. tions 1? We will raw F from mixtures with fixe log 2 (α max ). Let the k-th component of the mixture have parameter α k = (α1, k, αn k ) where N is the carinality of X or Y. Then fixe α max means that we rew each αi k uniformly at ranom from the interval 2 αmax, 2 αmax. Fig. 4 shows samples from such mixtures with growing α max. The figure shows that increasing the parameter allows the istributions to grow in complexity. Note that if α max = 0, we recover the noninformative prior case. How oes the likelihoo ratio an the causal irection ecision bounary change as we allow α max to epart from 0? For binary X an Y, Fig. 4 illustrates the change. Comparing with Fig. 2, we see that as α max grows, the likelihoo ratios become more extreme, an the ecision bounaries become more complex. Fig. 5 makes it clear that a fixe α max allows for the ecision bounary to vary significantly. That the inepenent mechanisms assumption as we frame it is not sufficient to provie ientifiability of the causal irection was clear from the outset (since each joint 1 A mixture of Dirichlet istributions with arbitrary many components can approximate any istribution over the simplex. Figure 4: Log-likelihoo ratios for the causal irection when F is a mixture of ten Dirichlet istributions with growing α max (see Fig. 3). can be factorize as P (X)P (Y X) an P (Y )P (X Y )). However, the above consierations suggest that the assumption of noninformative hyperpriors is rather strong: In fact, it is possible to show that the ecision surface can be precisely flippe with appropriate ajustment of F, making the LR classifier s error precisely 100%. Our experiments, however, suggest that using the LR classifier is a reasonable choice in a wie range of circumstances, especially as the carinality of X an Y grows. In our experiments, we checke how the error changes as we allow the α max parameter of all the hyperpriors to grow. Our experimental proceure is as follows: 1. Fix the imensionality of X an Y, an fix α max. 2. Sample 100 hyperpriors for each imensionality an α max. Sample α parameters for F within given α max bouns, where F consists of Dirichlet mixtures (with 10 components), as escribe above. 3. Sample 100 priors for each hyperprior. Sample P (cause) an P (effect cause) 100 times for each hyperpriors (that is, for each α setting). 4. Sample the causal label uniformly. If chose X Y then let P XY = P (cause)p (effect cause). If chose Y X, let P XY = transpose[p (cause)p (eff ect cause)]. 5. Classify. Use the LR classifier to classify P XY s causal irection an recor error if the causal label isagrees with the classifier. Figure 6 shows the results. As the carinality of the sys-

Figure 6: Results of the irection-classification experiment. We varie carinality of X, Y as well as α max of the mixture of Dirichlets F. For each setting, we sample 100 P XY istributions accoring to our causal moel an recore the classification error of the simple LR classifier. The results show that, as carinality of X an Y grows, the LR classifier s accuracy increases. Figure 5: Log-likelihoo ratios for the causal irection when F is a mixture of ten Dirichlet istributions with α max = 2 8 (see Fig. 3) each plot correspons to ifferent, ranomly sample α. tem grows, the LR classifier s ecision bounary approximates the ecision bounary for most Dirichlet mixtures. Another tren is that as α max grows, the variance of the error grows, but there is only a small growing tren in the error itself. In aition, Fig. 7 shows that the error oes not increase as we allow more mixture components, up to 128 components, while holing α max at the large value of 7. Thus, the LR classifier performs well even for extremely complex hyperpriors, at least on average. 5 A Black-box Solution: Detecting Confouning Consier now the question of whether X Y or X H Y, where H is a latent variable (a confouner). In this section we present a solution to this problem, uner assumptions from Sec. 3. Unfortunately, eriving the optimal classifier for this case is ifficult without aitional assumptions on the latent H. Instea, we propose a black-box classifier. We create a ataset of istributions from both the irect-causal an confoune case, using the uninformative Dirichlet prior on either P (X) an P (Y X) (the irect-causal case) or P (H), P (X H) an P (Y H) in the confoune case. For each confoune istribution, we chose the carinality of H, the hien confouner, uniformly at ranom between 2 an 100. Next, we traine a neural network to classify the causal structure (Python coe that reprouces Figure 7: Results of the irection-classification experiment when the number of Dirichlet mixture moel hyperprior components varies. We fixe α to vary between 2 7 an 2 7. The results show that the max-likelihoo classifier that assumes the noninformative priors is not sensitive to the number of Dirichlet mixture components that the test ata is sample from. the experiment is available at vision.caltech.eu/ kchalupk/coe.html). We then checke how well this classifier performs as we vary the carinality of the variables, an as we allow the true hyperprior to be a mixture of 10 Dirichlets, analogously to the experiment from Sec. 4. Fig. 8 shows the results. Note that the classification errors are much lower than for the eciing causal irection case. Both problems (eciing causal irection an etecting confouning) are in principle unientifiable, but it appears the latter is inherently easier. The neural net classifier seems to be little bothere by growing α max. The largest source of error, for carinality of X an Y larger than 3, seems to be neural network training rather than anything

Figure 8: Results of the black-box confouning etector. We varie carinality of X, Y as well as α max of the mixture of Dirichlets F. For each setting, we sample 1000 P XY istributions accoring to our causal moel an recore the classification error of a neural net classifier traine on noninformative Dirichlet hyperprior ata. The results show that, as carinality of X an Y grows, the LR classifier s accuracy increases. else. 6 A Black-Box Solution to the General Problem Finally, we present a solution to the general causal iscovery problem over the two variables X, Y : eciing between the six alternatives shown in Fig. 1. The iea is a natural extension of the black-box classifier from Sec. 5. We create a ataset containing all the six cases, sample uner the assumptions of Sec. 3. We then traine a neural network on this ataset (the neural network architecture, as well as the etails of the training proceure, are available in the accompanying Python coe). Figure 9 shows the results of applying the classifier to istributions sample from flat hyperpriors (that is, from a test set with statistics ientical to the training set), for carinalities X = Y = 2 an X = Y = 10. As expecte, the number of errors is much lower for the higher carinality. For the carinality of 2, the confusion matrix shows that the neural networks: 1. easily learn to classify inepenent vs epenent variables, 2. confuse the X Y an Y X cases, an 3. confuse the two irecte-causal plus confouning cases (Fig. 1E,F). However, all these are insignificant issues when X = 10, where the total error is 85 out of 10000 testpoints. For X = 2, the error is 25.7%. We remark again that the problem is not ientifiable that is, there is no true causal class for any point in our training or test ataset. Each istribution coul arise from any of the possible five causal Figure 9: Confusion matrices for the all-causal-classes classification task. The test set consists of istributions sample from uniform hyperpriors that is, sample from the same statistics as the training ata (equivalent to α max = 0 in previous sections). A) Results for X = Y = 2. Total number of errors=2477. B) Results for X = Y = 10, total errors=85. C) Average results for X = Y = 10, same classifier as in B) but test set sample with non-uniform hyperpriors with α max = 7 (see text). 201 errors on average. In each case, the test set contains 10000 istributions, with all the classes sample with an equal chance.

systems in which X an Y are not inepenent. The fact that the error nears 0 in the high-carinality case inicates that the likelihoos uner our assumptions grow very peake as the carinality grows. Thus, the optimal ecision can quite safely be calle the true ecision. In aition, Fig. 9C shows the average confusion table for a hunre trials in which our classifier was applie to istributions over X an Y with carinality 10, corresponing to all the possible six causal structures, but sample from non-uniform hyperpriors with α max = 7. The performance rop is not rastic compare to Fig. 9B. 7 Discussion We evelope a neural network that etermines the causal structure that links two iscrete variables. We allow for confouning between the two variables, but assume acyclicity. The classifier takes as input a joint probability table P XY between the two variables an outputs the most likely causal graph that correspons to this joint. The possible causal graphs span the range shown in Fig. 1 - from inepenence to confouning co-occurring with irect causation. We emphasize two limitations of the classifier: 1. Since the classifier makes a force choice between the six acyclic alternatives, it will necessarily prouce 100% error on P XY s generate from cyclic systems. 2. Our goal was not, an can not be, to achieve 100% accuracy. For example, error in Fig. 9A is about 25%. However, this is not necessarily a ba result. Our consierations in Sec. 3 an 4 show that even when all our assumptions hol, the optimal classifier has a non-zero error. The latter is a consequence of the non-ientifiability of the problem: it is not possible, in general, to ientify the causal structure between two variables by looking at the joint istribution an without intervention. Our goal was to introuce a minimal set of assumptions that, while acknowleging the nonientifiability, enable us to make useful inferences. We note that as the carinality of the variables raises, the task becomes more an more ientifiable in the sense that, for each given P XY, one out of the possible six causal graphs strongly ominates the others with respect to its likelihoo. In this situation, the most likely causal structure becomes essentially the only possible one, barring a small error, an the problem becomes practically ientifiable. All of the above applies assuming that our generative moel correspons to reality. The assumptions, iscusse in Sec. 3, boil own to two ieas: 1) The worl creates causes inepenently of causal mechanisms an 2) Causes are ranom variables whose istributions are sample from flat Dirichlet hyperpriors. Causal mechanisms are conitional istributions of effects given causes, an are also sample from flat Dirichlet hyperpriors. Whether these assumptions are realistic or not is an uneciable question. Nevertheless, through a series of simple experiments (Fig. 6, Fig. 7, Fig. 8, Fig. 9) we showe that the assumption of flat hyperpriors is not essential our classifiers average performance oes not ecrease significantly as we allow the hyperpriors to vary, although the variance of the performance grows. In future work, we will carefully analyze uner what conitions the flat-hyperprior classifier performs well even if the hyperpriors are not flat. The current working hypothesis is that as long as the hyperprior on P (cause) is the same as the hyperprior on P (effect cause), the classification performance oesn t change significantly on average, but as seen in our experiments it will have increase variance. Shohei Shimizu explaine our task (for the case of continuous variables) as: Uner what circumstances an in what way can one etermine causal structure base on ata which is not obtaine by controlle experiments but by passive observation only? (Shimizu et al., 2006). Our answer is, For high-carinality iscrete variables, it seems enough to assume inepenence of P (cause) from P (effect cause), an train a neural network that learns the black-box mapping between observations an their causal generative mechanism. References Davi Maxwell Chickering. Learning equivalence classes of bayesian-network structures. Journal of machine learning research, 2(Feb):445 498, 2002a. Davi Maxwell Chickering. Optimal structure ientification with greey search. Journal of machine learning research, 3(Nov):507 554, 2002b. D. Geiger an J. Pearl. On the logic of causal moels. In Proceeings of UAI, 1988. Davi Heckerman, Dan Geiger, an Davi M Chickering. Learning bayesian networks: The combination of knowlege an statistical ata. Machine learning, 20(3):197 243, 1995. Patrik O Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, an Bernhar Schölkopf. Nonlinear causal iscovery with aitive noise moels. In Avances in neural information processing systems, pages 689 696, 2009. P.O. Hoyer, S. Shimizu, A.J. Kerminen, an M. Palviainen. Estimation of causal effects using linear non-gaussian causal moels with hien variables. International Journal of Approximate Reasoning, 49:362 378, 2008. Dominik Janzing, Jonas Peters, Joris Mooij, an Bernhar Schölkopf. Ientifying confouners using aitive noise moels. In Proceeings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 249 257. AUAI Press, 2009.

C. Meek. Strong completeness an faithfulness in Bayesian networks. In Eleventh Conference on Uncertainty in Artificial Intelligence, pages 411 418, 1995. Joris M Mooij, Jonas Peters, Dominik Janzing, Jakob Zscheischler, an Bernhar Schölkopf. Distinguishing cause from effect using observational ata: methos an benchmarks. arxiv preprint arxiv:1412.3773, 2014. J. Pearl. Causality: Moels, Reasoning an Inference. Cambrige University Press, 2000. Jonas Peters, Dominik Janzing, an Bernhar Scholkopf. Causal inference on iscrete ata using aitive noise moels. IEEE Transactions on Pattern Analysis an Machine Intelligence, 33(12):2436 2450, 2011. Hans Reichenbach. The irection of time, volume 65. Univ of California Press, 1991. Shohei Shimizu, Patrik O Hoyer, Aapo Hyvärinen, an Antti Kerminen. A linear non-gaussian acyclic moel for causal iscovery. Journal of Machine Learning Research, 7(Oct):2003 2030, 2006. P. Spirtes, C. N. Glymour, an R. Scheines. Causation, preiction, an search. Massachusetts Institute of Technology, 2n e. eition, 2000. Xiaohai Sun, Dominik Janzing, an Bernhar Schölkopf. Causal inference by choosing graphs with most plausible markov kernels. In ISAIM, 2006. Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arxiv e-prints, abs/1605.02688, May 2016. URL http:// arxiv.org/abs/1605.02688. Kun Zhang an Aapo Hyvärinen. On the ientifiability of the post-nonlinear causal moel. In Proceeings of the twenty-fifth conference on uncertainty in artificial intelligence, pages 647 655. AUAI Press, 2009.