Graphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence

General overview Introduction Directed acyclic graphs (DAGs) and conditional independence DAGs and causal effects Learning DAGs from observational data IDA algorithm Further problems 2

Overview DAGs and conditional (in)dependence Directed acyclic graphs Factorization of the joint density Markov property d-separation 3

Graph terminology A graph G = V, E consists of vertices (nodes) V and edges E There is at most one edge between every pair of vertices Two vertices are adjacent if there is an edge between them If all edges are directed (i j), the graph is called directed A path between i and j is a sequence of distinct vertices (i,, j) such that successive vertices are adjacent A directed path from i to j is a path between i and j where all edges are pointing towards j, i.e., i j. A non-directed path from i to j is a path from i to j that is not directed. A cycle is a path (i, j,, k) plus an edge between k and i A directed cycle is a directed path (i, j,, k) from i to k, plus an edge k i A directed acyclic graph (DAG) is a directed graph without directed cycles 4

Example 1 2 3 4 5 Which of the following are (directed) paths? (4,2,1,3) (3,4,2,1,4,5) (3,4,2,1) 3,4,5 a directed path from 3 to 5 (1,2,4,1) is a cycle, but not a directed cycle This graph is a DAG not a path: 1 and 3 are not adjacent not a path: vertices are not distinct a path, but not a directed path Adding the edge 5 2 yields a directed cycle 2,4,5,2. The graph is no longer a DAG. 5

Graph terminology If i j, then i is a parent of j, and j is a child of i If there is a directed path from i to j, then i is an ancestor of j and j is a descendant of i. Each vertex is also an ancestor and descendant of itself. The sets of parents, children, descendants and ancestors of i in G are denoted by pa(i, G), ch(i, G), desc(i, G), an i, G. We omit G if the graph is clear from the context. We write sets of variables in bold face All definitions are applied disjunctively to sets. Example: pa S = k S pa(k) The non-descendants of S are the complement of desc S : nondesc S V desc(s) 6

Example 1 2 3 4 5 1 is a parent of 4, and 4 is a child of 1 1 is an ancestor of 5, and 5 is a descendant of 1 ch(4) = {5} desc(4) = {4,5} pa(4) = {1,2,3} an(4) = {1,2,3,4} pa 2,4 = 1 1,2,3 = 1,2,3 desc 2,3 = 2,4,5 3,4,5 = 2,3,4,5 nondesc 2,3 = 1,2,3,4,5 2,3,4,5 = {1} 7

DAGs and random variables Each vertex represents a random variable: we use i to denote both the vertex i and the random variable X i An edge denotes a relationship between the pair of variables (we will make this more precise later) 8

Factorization of the joint density Suppose we have p binary random variables. Then specifying the joint distribution requires a probability table of size 2 p. Can we do this more efficiently? We always have: f x 1,, x p = f x 1 f x 2 x 1 f(x p x 1, x p 1 ) A set of variables pa(j) is said to be the Markovian parents of X j if it is a minimal subset of {X 1, X j 1 } such that f x j x 1,, x j 1 = f(x j pa j ). p Then f x 1,, x p = j=1 f(x j pa j ) We can draw a DAG accordingly 9

Examples (X 1, X 2, X 3 ) and X 1 X 3 X 2 is only (conditional) independence: f x 1 x 2, x 3 = f x 1 x 2 and f x 3 x 1, x 2 = f(x 3 x 2 ) Then Or Or f x 1, x 2, x 3 = f x 1 f x 2 x 1 f x 3 x 1, x 2 = f x 1 f x 2 x 1 f x 3 x 2 DAG: 1 2 3 f x 3, x 2, x 1 = f x 3 f x 2 x 3 f x 1 x 2, x 3 = f x 3 f x 2 x 3 f x 1 x 2 DAG: 3 2 1 f x 1, x 3, x 2 = f x 1 f x 3 x 1 f x 2 x 1, x 3 DAG: 1 3 2 10

Example First order Markov chain: f X 1,, X p = f X 1 f X 2 X 1 f X p X 1,, X p 1 = f X 1 f X 2 X 1 f(x p X p 1 ) DAG: 1 2 p 11

Bayesian network A Bayesian network is pair (G, f), where f factorizes according to G: p f x 1,, x p = f(x j pa j, G ) j=1 Bayesian networks can be used for: Estimation of the joint density from lower order conditional densities (if the parent sets are small) Reading off conditional independencies from the DAG Probabilistic reasoning (expert systems) Causal inference 12

Bayesian network A Bayesian network is pair (G, f), where f factorizes according to G: p f x 1,, x p = f(x j pa j, G ) j=1 Bayesian networks can be used for: Estimation of the joint density from low order conditional densities (if the parent sets are small) Reading off conditional independencies from the DAG Probabilistic reasoning (expert systems) Causal inference 13

Reading off conditional independencies: Markov property Markov model: 1 2 t 1 t t + 1. The future is independent of the past given the present: X t+1 X t 1, X t 2,, X 1 X t In Bayesian networks, we have a similar Markov property. Let S be any collection of nodes. Then S is independent of its nondescendants given its parents: S nondesc S pa S pa(s) 14

Example Markov property smoking yellow teeth tar in lungs asbestos cancer Note: pa yellow teeth = smoking cancer nondesc(yellow teeth) Hence, yellow teeth cancer smoking in any distribution that factorizes according to this DAG 15

Graph terminology The skeleton of a graph is the graph obtained by removing all arrowheads A node i is a collider on a path if the path contains i (the arrows collide at i). Otherwise, it is a non-collider on the path. 1 2 3 4 5 Examples: 4 is a collider on the path (3,4,1) 4 is a non-collider on the path (3,4,5) Collider status depends is relative to a path 16

Reading off conditional independencies: d-separation The Markov property cannot be used to read off arbitrary conditional (in)dependencies. For this we have d-separation. A path between i to j is blocked by a set S if at least one of the following holds: There is a non-collider on the path that is in S; or There is a collider on the path such that neither this collider nor any of its descendants are in S. A path that is not blocked is d-connecting. If all paths between i to j are blocked by S, then i and j are d- separated by S. Otherwise they are d-connected given S. In any distribution that factorizes according to a DAG: if i and j are d-separated by S in the DAG, then X i and X j are conditionally independent given S in the distribution 17

Examples d-separation smoking yellow teeth tar in lungs asbestos cancer Denote d-separation by. Which of the following hold? yellow teeth cancer smoking tar asbestos tar asbestos cancer yellow teeth asbestos cancer yes yes no no 18

Bayesian network A Bayesian network is pair (G, f), where f factorizes according to G: p f x 1,, x p = f(x j pa j, G ) j=1 Bayesian networks can be used for: Estimation of the joint density from low order conditional densities (if the parent sets are small) Reading off conditional independencies from the DAG Probabilistic reasoning (expert systems); see R-code Causal inference 19

Bayesian network A Bayesian network is pair (G, f), where f factorizes according to G: p f x 1,, x p = f(x j pa j, G ) j=1 Bayesian networks can be used for: Estimation of the joint density from low order conditional densities (if the parent sets are small) Reading off conditional independencies from the DAG Probabilistic reasoning (expert systems) Causal inference: next topic 20