Causal Inference & Reasoning with Causal Bayesian Networks

Similar documents
Graphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence

Automatic Causal Discovery

Causal Discovery. Richard Scheines. Peter Spirtes, Clark Glymour, and many others. Dept. of Philosophy & CALD Carnegie Mellon

Causal Reasoning with Ancestral Graphs

Directed Graphical Models or Bayesian Networks

Introduction to Causal Calculus

Tutorial: Causal Model Search

Causality in Econometrics (3)

Exchangeability and Invariance: A Causal Theory. Jiji Zhang. (Very Preliminary Draft) 1. Motivation: Lindley-Novick s Puzzle

1. what conditional independencies are implied by the graph. 2. whether these independecies correspond to the probability distribution

Causality. Pedro A. Ortega. 18th February Computational & Biological Learning Lab University of Cambridge

Causal Bayesian networks. Peter Antal

Data Mining 2018 Bayesian Networks (1)

Generalized Do-Calculus with Testable Causal Assumptions

Towards an extension of the PC algorithm to local context-specific independencies detection

Arrowhead completeness from minimal conditional independencies

Introduction to Probabilistic Graphical Models

Directed Graphical Models

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Challenges (& Some Solutions) and Making Connections

CS Lecture 3. More Bayesian Networks

Motivation. Bayesian Networks in Epistemology and Philosophy of Science Lecture. Overview. Organizational Issues

Day 3: Search Continued

Bayesian networks as causal models. Peter Antal

Learning causal network structure from multiple (in)dependence models

Chris Bishop s PRML Ch. 8: Graphical Models

Introduction to Artificial Intelligence. Unit # 11

2 : Directed GMs: Bayesian Networks

Causal Directed Acyclic Graphs

arxiv: v2 [cs.lg] 9 Mar 2017

Abstract. Three Methods and Their Limitations. N-1 Experiments Suffice to Determine the Causal Relations Among N Variables

Learning in Bayesian Networks

Causal Bayesian networks. Peter Antal

1 : Introduction. 1 Course Overview. 2 Notation. 3 Representing Multivariate Distributions : Probabilistic Graphical Models , Spring 2014

Identifiability assumptions for directed graphical models with feedback

Directed Graphical Models

Lecture 4 October 18th

Intervention, determinism, and the causal minimality condition

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Probabilistic Graphical Models and Bayesian Networks. Artificial Intelligence Bert Huang Virginia Tech

Introduction to Bayes Nets. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Supplementary material to Structure Learning of Linear Gaussian Structural Equation Models with Weak Edges

Interpreting and using CPDAGs with background knowledge

BN Semantics 3 Now it s personal!

Chapter 16. Structured Probabilistic Models for Deep Learning

Directed and Undirected Graphical Models

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

Causal Effect Evaluation and Causal Network Learning

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

Probabilistic Graphical Networks: Definitions and Basic Results

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Causal Effect Identification in Alternative Acyclic Directed Mixed Graphs

arxiv: v4 [math.st] 19 Jun 2018

LEARNING CAUSAL EFFECTS: BRIDGING INSTRUMENTS AND BACKDOORS

The International Journal of Biostatistics

Lecture 5: Bayesian Network

TDT70: Uncertainty in Artificial Intelligence. Chapter 1 and 2

Conditional Independence

Causal Models with Hidden Variables

STA 4273H: Statistical Machine Learning

Tópicos Especiais em Modelagem e Análise - Aprendizado por Máquina CPS863

Introduction to Probabilistic Graphical Models

COMP538: Introduction to Bayesian Networks

The Two Sides of Interventionist Causation

2 : Directed GMs: Bayesian Networks

Causal Structure Learning and Inference: A Selective Review

Lecture 8. Probabilistic Reasoning CS 486/686 May 25, 2006

On Causal Explanations of Quantum Correlations

4.1 Notation and probability review

CPSC 540: Machine Learning

Interventions and Causal Inference

Summary of the Bayes Net Formalism. David Danks Institute for Human & Machine Cognition

Learning Causal Bayesian Networks from Observations and Experiments: A Decision Theoretic Approach

Rapid Introduction to Machine Learning/ Deep Learning

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network.

Y. Xiang, Inference with Uncertain Knowledge 1

CS839: Probabilistic Graphical Models. Lecture 2: Directed Graphical Models. Theo Rekatsinas

Axioms of Probability? Notation. Bayesian Networks. Bayesian Networks. Today we ll introduce Bayesian Networks.

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

On the Causal Structure of the Sensorimotor Loop

Undirected Graphical Models

Ch.6 Uncertain Knowledge. Logic and Uncertainty. Representation. One problem with logical approaches: Department of Computer Science

CAUSAL MODELS: THE MEANINGFUL INFORMATION OF PROBABILITY DISTRIBUTIONS

Probabilistic Causal Models

An Introduction to Bayesian Machine Learning

arxiv: v3 [stat.me] 10 Mar 2016

CAUSAL INFERENCE IN THE EMPIRICAL SCIENCES. Judea Pearl University of California Los Angeles (

Conditional Independence and Factorization

Based on slides by Richard Zemel

Learning Semi-Markovian Causal Models using Experiments

Respecting Markov Equivalence in Computing Posterior Probabilities of Causal Graphical Features

ANALYTIC COMPARISON. Pearl and Rubin CAUSAL FRAMEWORKS

Faithfulness of Probability Distributions and Graphs

Identification of Conditional Interventional Distributions

STAT 598L Probabilistic Graphical Models. Instructor: Sergey Kirshner. Bayesian Networks

Introduction to Probabilistic Graphical Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Counterfactual Reasoning in Algorithmic Fairness

COMP538: Introduction to Bayesian Networks

Transcription:

Causal Inference & Reasoning with Causal Bayesian Networks

Neyman-Rubin Framework Potential Outcome Framework: for each unit k and each treatment i, there is a potential outcome on an attribute U, U ik, representing the outcome unit k would display if it received treatment i. Treatment effect can be defined in terms of potential outcome. e.g., a drug (indexed by 1) has an effect on a patient k vs. a placebo (indexed by 0) on recovery if the potential recovery U 1k U 0k. A counterfactual framework for each unit k, only one U ik can be observed.

Example: Fisher s Tea-Tasting Lady Treatment variable Z 70 possible values, corresponding to 70 possible arrangements of 8 cups of tea. Potential responses: U Z=i, 1 i 70. Fisher s null hypothesis: no effect of Z on U, i.e., (the probability distribution of) U Z=i does not vary with i. Upon complete randomization, the null hypothesis implies that P(U = Z) = 1/70.

Generalization Potential response Y X=x can be interpreted as the value Y would take given that X is intervened to take value x. Let us write P X=x (Y) to represent the probability distribution of Y given that X is intervened to take value x. Multivariate generalization: P X=x (Y). Causal knowledge is useful for estimating such post-intervention quantities. A well-developed machinery for this purpose is Causal Bayesian Networks

Directed Acyclic Graphs (DAGs) A set of nodes and a set of directed edges between nodes. - standard terminology: parent/child, ancestor/descendant, adjacency, path, directed path, etc. No directed cycles, i.e., no directed path from one variable back to itself. X1 X2 X3 X4 X5

Probabilistic Interpretation of DAGs Take vertices as random variables. (Local) Markov condition: every variable is probabilistically independent of its non-descendants given its parents. I(X1, X2) X1 I(X2, {X1, X3}) X2 X3 I(X3, {X2, X4, X5} X1) X4 I(X4, X3 X1, X2) X5 I(X5, {X1, X2, X3} X4)

Factorization Given a DAG G, choose any ancestral order of the variables: X 1,, X n, s.t. if i<j, then X j is not an ancestor of X i For a joint distribution P that is Markov to (i.e., satisfies the Markov condition with) G, we can factorize it as follows: P (X 1,, X n ) = P(X 1 ) P(X 2 X 1 ) P(X n X 1,, X n-1 ) Chain Rule = i P(X i Pa(X i )) Exploit Local Markov Property Conversely, factorization implies the Markov condition.

Example X1 X2 X3 X4 X5 P(X1, X2, X3, X4, X5) = P(X1)P(X2 X1)P(X3 X1,X2)P(X4 X1,X2,X3)P(X5 X1, X2, X3, X4) = P(X1)P(X2)P(X3 X1)P(X4 X1,X2)P(X5 X4)

Parameterization Because of the factorization, to specify a joint distribution of the variables in a DAG, it suffices to specify the distribution of each variable conditional on its parents: P(X i Pa(X i )). A DAG together with these local conditional distributions is known as a Bayesian Network.

Example of a Bayesian Network P(W) P(C W) P(A W) P(F C) P(R C,A)

Causal Interpretation of DAGs parent/child direct cause/effect (relative to the set of variables in the graph). Smoking Radon Yellow Teeth Lung Cancer Weight Loss

Surgical Intervention Intervention on a cause Smoking Radon Yellow Teeth Lung Cancer Intervention on an effect Weight Loss

Causal Bayesian Network Recall the factorization of a Bayesian network: P(V) = Y V P(Y Pa G (Y)) A causal Bayesian network over V is a DAG G and a set of distributions {P X=x (V) X V} that satisfies the following intervention postulate: P X=x (V) = Y V\X P(Y PA G (Y)) for values of V consistent with X=x, where P(Y PA G (Y)) = P (Y PA G (Y)), denoting distributions in the absence of interventions. A causal Bayesian network provides a link between postintervention distributions and pre-intervention distributions, the latter of which can be estimated from passive observations.

Causal Reasoning The intervention postulate assumes the following: - Interventions are effective (arrow-breaking) and local (no fat hand); - The post-intervention joint distribution factorizes according to the postintervention causal DAG. - Modularity: the local conditional distribution for a variable Y is invariant under interventions of other variables (including Y s causes); The intervention postulate forms the basis of reasoning about consequences of interventions with the help of a causal diagram.

Causal Reasoning Smoking Radon Yellow Teeth Lung Cancer Weight Loss P s=yes (V) = P s=yes (S) P s=yes (YT S) P S=yes (R) P S=yes (LC S, R) P S=yes (WL LC) = P(YT S) P(R) P(LC S, R) P(WL LC), for values consistent with S=yes From which we can derive, e.g., P S=yes (LC) = P(LC S = yes) P YT=yes (V) = P YT=yes (YT) P YT=yes (S) P YT=yes (R) P YT=yes (LC S, R) P YT=yes (WL LC) = P(S) P(R) P(LC S, R) P(WL LC), for values consistent with YT=yes From which we can derive, e.g., P YT=yes (LC) = P(LC)

Recall: Simpson s Reversal Z Z X Y X Y P X:=1 (Y=1) = P X:=1 (Y=1 X=1) = P X:=1 (Y=1 X=1, Z=1) P X:=1 (Z=1 X=1) + P X:=1 (Y=1 X=1, Z=0) P X:=1 (Z=0 X=1) = P X:=1 (Y=1 X=1, Z=1) P X:=1 (Z=1) + P X:=1 (Y=1 X=1, Z=0) P X:=1 (Z=0) = P(Y=1 X=1, Z=1) P(Z=1) + P(Y=1 X=1, Z=0) P(Z=0) P X:=1 (Y=1) = P X:=1 (Y=1 X=1) = P(Y=1 X=1)

Interventionist Interpretation of Arrows Take the notion of direct cause as primitive? Assuming there is a causal Bayesian network for V, we can derive that there is a uniquely minimal one: X is a parent of Y iff. P X=x1, Z=z (Y) P X=x2, Z=z (Y) for some x1 x2 and z, where Z = V \{X, Y}. It is natural to refer to this minimal causal DAG the true causal structure for V. This agrees well with the standard interventionist interpretation of direct cause.

Causal Markov Condition A special case of the intervention postulate obtains when there is no intervention: P (V) = P(V) = Y V P(Y Pa G (Y)) Equivalently the pre-intervention distribution satisfies the Markov condition with the causal DAG G. This is known as the Causal Markov Condition (CMC): every variable is probabilistically independent of its non-effects (nondescendants in the causal graph) given its direct causes (parents in the causal graph).

Intuitions Smoking Radon Yellow Teeth Lung Cancer Weight Loss No causal relation or common cause, no dependence: e.g., I(S, R). Common cause screens off collateral effects: e.g., I (YT, LC S). Intermediate cause screens off remote cause: e.g., I (R, W LC). The CMC is a generalization of these intuitions.

Entailed Independence The conditional independence relations explicitly stated in the CMC can entail others (in virtue of the probability calculus). These entailed conditional independence relations are in principle testable with observational data. A graphical criterion called d-separation captures precisely the (conditional) independence relations entailed by a causal structure.

Collider/Non-collider Given an path u, X is a collider on u if arrows collide at X on u. Otherwise it is a non-collider on u. The notion of collider/non-collider is relative to a path! A node can be a collider on one path and a non-collider on another. Smoking Radon Yellow Teeth Lung Cancer Weight Loss

Peculiarity of Colliders No gas Battery dead Car dead No gas and Battery dead are independent. However, after we learn that Car dead is true, they become dependent. This structure, which we call unshielded collider, is very important in causal discovery from observational data.

D-Separation In a DAG G, a path between X and Y is active given Z iff. (i) every non-collider on the path is not in Z; (ii) every collider on the path has a descendant in Z. X and Y are d-separated by Z if no path between a variable in X and a variable in Y is active given Z. Otherwise they are d- connected by Z. Theorem: X and Y are d-separated by Z in a DAG G if and only if G entails (via the Markov condition) that X and Y are independent conditional on Z.

Example X V Y U W Z S X and Y are d-separated (by empty set) X and Y are d-separated by { Z } X and Y are d-separated by {Z, V} X and Y are d-connected by {U, S}

Causal Faithfulness Condition The CMC reformulated: d-separation conditional independence. But it does not say any anything about d-connection. Causal Faithfulness Condition (CFC): All conditional independence relations that hold (in the pre-intervention population) are entailed by (the Markov condition of) the causal graph. In other words, d-connection implies dependence. CMC and CFC: conditional independence d-separation.

Aside: Two Facts Related to Topic 3 Proposition 1: in a DAG G, if X is not a descendant of Y and there is no directed path from X to Y, then X and Y are d- separated by empty set or by some non-descendants of X. Proposition 2: in a DAG G, if there is a directed path from X to Y, then X and Y are d-connected given any set of nondescendants of X (including the empty set). Hints: proposition 1 may be used to argue that assuming CMC, Suppes-type definition of genuine causal relevance gives sufficient conditions; proposition 2 may be used to argue that assuming CFC, the definition gives necessary conditions.

Causal Inference under CMC & CFC A B C D Ex 1 0 0 1 0 Ex 2 1 0 1 1 Ex 3 0 0 0 0 Ex 4 0 1 0 0... A is independent of B A is NOT independent of B conditional on C... Independence Facts Data A Causal Graph C D B? A is d-separated from B A is NOT d-separated from B by C... D-separation features

Bad News: Markov Equivalence Different DAGs may share the exact same d-separation structures. They are called Markov Equivalent. In this sense, CMC and CFC do not reduce causality to probabilities. An Equivalence Class of DAGs? A is d-separated from B A is NOT d-separated from B by C... D-separation features

Example X1 X2 X1 X2 X3 X4 X3 X4 X5 X5

Good News: Common Substructure Markov equivalent networks may share common features. Even though we cannot distinguish between Markov equivalent causal DAGs based on patterns of conditional independence, their common substructure can be inferred. X1 X2 X1 X2 X3 X4 X3 X4 X5 X5

Representing Equivalence Classes A Pattern (or PDAG) can contain both directed edges (that represent the common directions shared by the equivalent DAGs) and undirected edges (that indicate ambiguity about the direction). X1 X2 X3 X4 X5

Constraint-based Search A B C D Ex 1 0 0 1 0 Ex 2 1 0 1 1 Ex 3 0 0 0 0 Ex 4 0 1 0 0... Data An Equivalence Class of DAGs represented by a Pattern A is independent of B A is NOT independent of B conditional on C... Independence Facts A is d-separated from B A is NOT d-separated from B by C... D-separation features

Latent Variables So far we have assumed that all variables in V are observed. Given this assumption, algorithms are available that are sound and complete in extracting causal information from an oracle of conditional independence. More realistically, we only observe a subset of O V. We are only entitled to assume that there is some set of latent variables L, such that there is a causal Bayesian network over O L, and the CMC and CFC hold of O L. Causal inference gets much more difficult and complicated in the presence of latent variables, but not hopeless.

Partial Ancestral Graphs (PAGs) A PAG is a graph that can contain four kinds of edges: ο ο, ο,,, representing an (infinite) set of DAGs that entail the exact same observed conditional independence relations. A ο ο B A ο B A B A B Either A is a cause of B, or B a cause of A, or there is a latent common cause. B is NOT a cause of A A is a cause of B There is a latent common cause of A and B X1 X2 X3 X4 X5

Reasoning with a PAG Since a PAG gives partial information about the underlying causal Bayesian network, not all post-intervention probabilities are identifiable based on a PAG. But the partial information is often enough to identify some post-intervention probabilities. Rules for when we can identify a post-intervention probability given a PAG are available.

Scheines Example: College Plan Real data set created by Sewell and Shah (1968); 10,138 instances from Wisconsin high schools; Variable SEX IQ (intelligent quotient) CP (college plan) PE (parental encouragement) SES (socioeconomic status) Values male, female low, lower middle, upper middle, high yes, no low, high low, lower middle, upper middle, high

The PAG Learned from Data SES SEX PE CP IQ

Estimating Causal Effect SES SEX PE CP IQ P PE (CP) = SES P(CP PE, SES)P(SES) P PE=high (CP = yes)=0.517 P PE=low (CP = yes)=0.081

References A bunch of causal discovery algorithms are implemented in the Tetrad package: www.phil.cmu.edu/projects/tetrad A fine online course if you want to know more but do not want to consult me any more: www.phil.cmu.edu/projects/csr A single best reference in this field is Pearl, J. (2000) Causality: Models, Reasoning and Inference For computational aspects, Neapolitan, R. E. (2003) Learning Bayesian Networks