The Algebra of Causal Trees and Chain Event Graphs

Similar documents
Embellishing a Bayesian Network using a Chain Event Graph

A Differential Approach for Staged Trees

The Dynamic Chain Event Graph

The geometry of causal probability trees that are algebraically constrained

3 : Representation of Undirected GM

Introduction to Probabilistic Graphical Models

STA 4273H: Statistical Machine Learning

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

arxiv: v6 [math.st] 19 Oct 2011

Chris Bishop s PRML Ch. 8: Graphical Models

Directed Graphical Models or Bayesian Networks

Learning in Bayesian Networks

Beyond Uniform Priors in Bayesian Network Structure Learning

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Introduction to Probabilistic Graphical Models

An Empirical-Bayes Score for Discrete Bayesian Networks

Probabilistic Graphical Models

CS 343: Artificial Intelligence

STA 4273H: Statistical Machine Learning

arxiv: v1 [stat.me] 21 Jan 2015

The dynamic chain event graph

This is a repository copy of A New Method for tackling Asymmetric Decision Problems.

Probabilistic Graphical Models

Probabilistic Models

CS 5522: Artificial Intelligence II

p L yi z n m x N n xi

Chapter 16. Structured Probabilistic Models for Deep Learning

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Machine Learning Summer School

Bayes Nets III: Inference

Learning Bayesian network : Given structure and completely observed data

Lecture 6: Graphical Models: Learning

Tópicos Especiais em Modelagem e Análise - Aprendizado por Máquina CPS863

Learning Bayesian networks

CS 188: Artificial Intelligence. Bayes Nets

Lecture 15. Probabilistic Models on Graph

Tensor rank-one decomposition of probability tables

Bayesian (conditionally) conjugate inference for discrete data models. Jon Forster (University of Southampton)

CS 188: Artificial Intelligence Spring Announcements

The Origin of Deep Learning. Lili Mou Jan, 2015

STA 4273H: Statistical Machine Learning

The Dynamic Chain Event Graph

Intelligent Systems (AI-2)

9 Forward-backward algorithm, sum-product on factor graphs

Undirected Graphical Models

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

CS 188: Artificial Intelligence Fall 2008

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Machine Learning Lecture 14

CS 188: Artificial Intelligence Fall 2009

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

Bayesian Machine Learning

Belief Update in CLG Bayesian Networks With Lazy Propagation

6.867 Machine learning, lecture 23 (Jaakkola)

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

bound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o

Intelligent Systems:

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

CS 5522: Artificial Intelligence II

Graphical Models 359

University of Warwick institutional repository:

Announcements. CS 188: Artificial Intelligence Spring Probability recap. Outline. Bayes Nets: Big Picture. Graphical Model Notation

Outline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car

Expectation Propagation in Factor Graphs: A Tutorial

Probabilistic Graphical Models

Structure learning in human causal induction

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Building Bayesian Networks. Lecture3: Building BN p.1

Bayesian Networks Basic and simple graphs

CS 188: Artificial Intelligence Spring Announcements

Bayes Nets: Sampling

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks

Learning With Bayesian Networks. Markus Kalisch ETH Zürich

Symbolic Variable Elimination in Discrete and Continuous Graphical Models. Scott Sanner Ehsan Abbasnejad

Graphical Models and Kernel Methods

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

CS 343: Artificial Intelligence

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Sampling Algorithms for Probabilistic Graphical models

Respecting Markov Equivalence in Computing Posterior Probabilities of Causal Graphical Features

Data Structures for Efficient Inference and Optimization

Text Mining for Economics and Finance Latent Dirichlet Allocation

An Introduction to Reversible Jump MCMC for Bayesian Networks, with Application

Undirected Graphical Models: Markov Random Fields

What Causality Is (stats for mathematicians)

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

Bayes Nets: Independence

13: Variational inference II

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

10708 Graphical Models: Homework 2

CS Lecture 4. Markov Random Fields

4 : Exact Inference: Variable Elimination

STA 4273H: Sta-s-cal Machine Learning

A new parametrization for binary hidden Markov modes

Announcements. Inference. Mid-term. Inference by Enumeration. Reminder: Alarm Network. Introduction to Artificial Intelligence. V22.

1 : Introduction. 1 Course Overview. 2 Notation. 3 Representing Multivariate Distributions : Probabilistic Graphical Models , Spring 2014

Bayesian Networks BY: MOHAMAD ALSABBAGH

Transcription:

The Algebra of Causal Trees and Chain Event Graphs Jim Smith (with Peter Thwaites and Eva Riccomagno), University of Warwick August 2010 Jim Smith (Warwick) Chain Event Graphs August 2010 1 / 31

Introduction Algebraic geometry increasingly used to study the structure of discrete multivariate statistical models. Discrete graphical models e.g. Bayesian Networks especially useful. For many families of graphical model, each atomic probability in the joint pmf is a polynomial in certain "primitive" conditional probabilities. Inferring primitive probabilities from seeing the values of the marginal mass function of a measurable function then corresponds to the inverse map of a corresponding set of polynomials taking this value. Well known examples include the relationship between decomposable graphical models to toric ideals (see e.g. Garcia et al, 2005) and binary phylogenetic tree models and their semi algebraic characterisations, in terms of Mobius inversions and hyperdeterminants (see e.g. Zwiernik and Smith,2010). Jim Smith (Warwick) Chain Event Graphs August 2010 2 / 31

A Typical example of a Binary Tree Model (Settimi and Smith,1998) X 1 X 2 X 3 - " % H X i = 1, 1, i = 1, 2, 3 H = 1, 1 Form an algebraic point of view model on margin (X 1, X 2, X 3 ) saturated.(dim 7) However e.g. when µ 1 = µ 2 = µ 3 then 1 µ 2 H µ 2 123 4µ 2 H µ 12 µ 23 µ 13 = 0 Probabilities real and in [0, 1] )e.g µ 12 µ 23 µ 13 > 0 and jµ 123 j 4 3 p 3. Also observed means µ 1, µ 2, µ 3 induce further inequalily constraints. Jim Smith (Warwick) Chain Event Graphs August 2010 3 / 31

More General Trees (Settimi and Smith,2000) X 1 X 2 X 3 X - " % i = 1, 1, i = 1, 2, 3, 4 H = 1, 1 H! X 4 This must now satisfy equations reducing dimension of (X 1, X 2, X 3, X 4 ). We lose 6 dimensions and demand moments satisfy 5 quadratic equations e.g. µ 12 µ 34 µ 14 µ 23 = 0 and a quartic. Many other inequality constraints exist. Now marginals on binary phylogenetic trees fully characterized Allman and Rhodes(2010), Zwiernik and Smith(2010,2010a) Jim Smith (Warwick) Chain Event Graphs August 2010 4 / 31

Contents of this talk Today I will focus on a di erent class of graphical model based on event trees. They are a much richer class of models than the BN and so support much more varied polynomial forms in their algebraic formulations. De nition of an event tree and a chain event graph (CEG). Their corresponding polynomial representation. Inferential questions asked of the CEG. Causal questions associated with the CEG. Future challenges. Jim Smith (Warwick) Chain Event Graphs August 2010 5 / 31

Advantages of an Event Tree The most natural expression of a model describing how things happen. Does not need a preferred set of measurement variables a priori. Explicitly represents the event space of a model, e.g. levels of variables. Asymmetries of the model space explicitly represented. Framework for probabilistic embellishment and algebraic descriptions. Causal hypotheses much more richly expressed than in their BN analogues. Jim Smith (Warwick) Chain Event Graphs August 2010 6 / 31

Example of an Event Tree die full recovery v 7 v 11 % survive % up v 3! v {z} 8! v 12 hostile % partial v 0! v 1! v 4! v 10 & down & die full recovery benign v 2 v {z} 9! v 13 # & up survive & down v 6 v 5 v 14 partial Jim Smith (Warwick) Chain Event Graphs August 2010 7 / 31

Chain Event Graphs Typically topologically much simpler than event trees but still describe how things happen. Their paths represent fully the structure of the sample space. Expresses rich variety of dependence structures to be graphically queried. Embellishes to a probability model and its associated algebraic rep. Like BNs provides a framework for fast propagation and conjugate learning. Almost as expressive of causal hypotheses as the event tree. Jim Smith (Warwick) Chain Event Graphs August 2010 8 / 31

Constructing a CEG Event tree! Staged tree! CEG [by positions and stages] Start with an event tree Convert it into a staged tree Then transform into a chain event graph by pooling positions and stages together. Jim Smith (Warwick) Chain Event Graphs August 2010 9 / 31

Example of an Event Tree die full recovery v 7 v 11 % survive % up v 3! v {z} 8! v 12 hostile % partial v 0! v 1! v 4! v 10 & down & die full recovery benign v 2 v {z} 9! v 13 # & up survive & down v 6 v 5 v 14 partial Jim Smith (Warwick) Chain Event Graphs August 2010 10 / 31

Example of a CEG Elicit stages: i.e. distribution partition of situations with the same associated u 0 = fv 0 g, u 1 = fv 1, v 2 g, u 2 = fv 3, v 4 g, u 3 = fv 8, v 9 g, u = fleavesg Deduce positions: i.e. partition of situations with subsequent isomorphic trees w 0 = fv 0 g, w 1 = fv 1 g, w 2 = fv 2 g, w 3 = fv 3, v 4 g, w 4 = fv 8, v 9 g, w = fleavesg Each position has an associated oret: that position and its emanating edges. Edges in orets of positions in the same stage are colored to convey isomorphism. Jim Smith (Warwick) Chain Event Graphs August 2010 11 / 31

Example of a CEG Draw CEG with vertices as positions colour stages. w 3 (3, 4) + %% # survive & w 1 (1) w 4 (8, 9) w hostile " + %% w 0 (0)! w 2 (2) Jim Smith (Warwick) Chain Event Graphs August 2010 12 / 31

Properties of CEG s - Smith and Anderson(2008) Theorem If the random variables X 1, X 2,..., X n with known sample spaces are fully expressed as a BN, G, or as a context speci c BN G, and you know its CEG, C, then the random variables X 1, X 2,..., X n and all their conditional independence structure together with their sample spaces can be retrieved from C. Theorem Downstream q Upstreamj w Cut! % % % & &&! & % & % %!! Jim Smith (Warwick) Chain Event Graphs August 2010 13 / 31

Probabilities on the gene CEG Embellish a CEG with probabilities just as in a tree. Note that the positions in the same stage have the same associated edge probabilities. Probabilities of atoms calculated by producting up edge probabilities on each root to leaf path. w 3 (3, 4) π 11 %% π12 # π 22 π 21 & w 1 (1) w 4 (8, 9) π 01 " w 0 (0) π 02! w 2 (2) π 31 π 32 π 11 %% π12 w Jim Smith (Warwick) Chain Event Graphs August 2010 14 / 31

Probabilities and Algebra of CEG s Each stage u has an associated simplex of probabilities fπ i,u : 1 i l u gassociated with its emanating edges in the CEG. In our example l u = 2 and the root to sink probabilities are given by p(v 5 ) = π 02 π 21 p(v 6 ) = π 02 π 11 p(v 7 ) = π 01 π 11 π 21 p(v 10 ) = π 01 π 12 π 21 p(v 11 ) = π 01 π 11 π 22 π 31 p(v 12 ) = π 01 π 11 π 22 π 32 p(v 13 ) = π 01 π 12 π 22 π 32 p(v 14 ) = π 01 π 12 π 22 π 31 The probability of learning the margin of a random variable on this space is to learn about some sums of these monomials. Note that the 8 vector of atomic probabilities is constrained to lie in a 4 (rather than 7) dimensional space. Unlike the BN of the generating monomials need not be multilinear or homogeneous - in above they range from degree 2 to 4. Jim Smith (Warwick) Chain Event Graphs August 2010 15 / 31

Manifest polynomials, identi ability and independence Conditional independences appear as usual in terms of factorization. Thus π 1 21 (p(v 5), p(v 10 ), p(v 14 ), p(v 13 )) = π 1 11 (p(v 6), p(v 7 ), p(v 11 ), p(v 12 )) = (π 02, π 01 π 21, π 01 π 22 π 31, π 01 π 22 π 32 ) so that under the appropriate identi cation of events (as can be read from the CEG) X () rest Now suppose we learn the distribution of a variable determining whether or not the organism survives unharmed. This probability is simply the value of a polynomial: π 02 + π 01 π 22 π 31. The ats of this polynomial within the model space above, de ne the conditioned model space. Jim Smith (Warwick) Chain Event Graphs August 2010 16 / 31

Causal Bayesian Networks Recall that for causal BNs Variables not downstream of X, a manipulated node, are una ected by the manipulation. X is set to the manipulated value ˆx with probability 1. E ect on downstream variables is identical to ordinary conditioning. But many manipulations don t follow these rules, e.g. Whenever a unit is in set A of positions, take it to another position B. Recently the algebraic formulation of causal models has been studied. Jim Smith (Warwick) Chain Event Graphs August 2010 17 / 31

Causal CEGs This can be implemented on a CEG by making paths through a position w pass along a designated edge to a designated position w 0, retaining all other oret distributions elsewhere. Similarly to Bayesian Networks: Probabilities of edges not after w are unchanged. An edge from w to w 0 forces w 0 after w. Downstream probabilities after w 0 are unchanged. Generalizations of Pearl s Backdoor Theorem can be proven Thwaites et al(2010). Uses topology of the CEG to determine when the Bayes estimate of the e ect of a manipulation is consistent, given partially observed data from the corresponding unmanipulated CEG. Jim Smith (Warwick) Chain Event Graphs August 2010 18 / 31

An example of a causal CEG Control: we plan to ensure gene is up regulated if in a hostile environment. up w 3 (3, 4) hostile 1 % # π 22 π 21 & w 1 (1) w 4 (8, 9) π 31 π 32 π 01 " π 11 %% π12 w 0 (0) π 02! w 2 (2) Now probability in the controlled system of being up regulated and dying is p (v 7 ) = π 01 π 21 In the uncontrolled system it is p(v 7 ) = π 01 π 11 π 21 Often probabilities in a controlled system have a degree lower than in the unmanipulated system. Jim Smith (Warwick) Chain Event Graphs August 2010 19 / 31 w

An example of a causal CEG Here we plan to ensure gene is up regulated if in a hostile environment the root to sink probabilities are given by p (v 5 ) = π 02 π 21 p (v 6 ) = π 02 π 11 p (v 7 ) = π 01 π 21 p (v 10 ) = 0 p (v 11 ) = π 01 π 22 π 31 p (v 12 ) = π 01 π 11 π 22 π 32 p (v 13 ) = 0 p (v 14 ) = 0 Now suppose we want to estimate the probability of surviving unharmed if we force the gene to be up regulated if it is in a hostile environment. Then the probability that this occurs is p (v 5 ) + p (v 6 ) + p (v 11 ) = π 02 + π 01 π 22 π 31 Note that this is the probability it would have survived if we did not control the environment. So we can identify this probability by observing the idle system. Jim Smith (Warwick) Chain Event Graphs August 2010 20 / 31

Concluding Remarks The geometry of moderate dimensional polynomials is relatively well understood but only now being exploited to understand discrete statistical graphical models. In large problems, identi ability functions of parameters - like causal functions - as well as the inverse image of these observed functions when there is no identi ability, are giving valuable insight into reliability of estimation techniques and robustness of statistical inferences - especially in a Bayesian domain. We can see where the problems are! Two challenges in exploiting algebraic geometry for discrete statistical modeling are: 1 Many results in algebraic geometry apply over polynomials over the complex eld. 2 Probabilities are positive and sum to one. The rst issue demands a semi-algebraic rather than algebraic description. THANK YOU FOR YOUR ATTENTION!! Jim Smith (Warwick) Chain Event Graphs August 2010 21 / 31

Selected References of mine Zwernik, P. and Smith, J.Q. (2010) "Tree-cumulants and the identi ability of Bayesian tree models" CRiSM Research Report (submitted) Zwiernik, P. and Smith, J.Q. (2010) "The Geometry of Conditional Independence Tree Models with Hidden Variables" CRiSM Research Report (submitted) Thwaites, P., Riccomagno, E.M.and Smith, J.Q. (2010) "Causal Analysis with Chain Event Graphs" Arti cial Intelligence, 174, 889 909 Thwaites, P., Smith, J.Q. and Cowell, R. (2008) "Propagation using Chain Event Graphs" Proceedings of the 24th Conference in UAI, Editors D. McAllester and P. Myllymaki, 546-553 Smith, J.Q. and Anderson P.E. (2008) "Conditional independence and Chain Event Graphs" Arti cial Intelligence, 172, 1, 42-68 Jim Smith (Warwick) Chain Event Graphs August 2010 22 / 31

Formal de nitions of stages and positions Two nodes v, v 0 are in the same stage u exactly when X (v), X (v 0 ) have the same distribution under a bijection ψ u (v, v 0 ), where ψ u (v, v 0 ) : X(v) = E (F(v, T ))! X(v 0 ) = E (F(v 0, T )) In other words, the two nodes have identical probability distributions on their edges. Two nodes v, v 0 are in the same position w exactly when there exists a bijection φ w (v, v 0 ) from Λ(v, T ), the set of paths in the tree from v to a leaf node, to Λ(v 0, T ),the set of paths from v 0 to a leaf node, such that all edges in all the paths are coloured, and that the sequence of colors in any path is the same as that in the path under the bijection. Jim Smith (Warwick) Chain Event Graphs August 2010 23 / 31

Formal de nition of a staged tree A staged tree is a tree with stage set L(T ) and edges coloured as follows: When v 2 u 2 L(T ), but u contains only one node, all edges emanating from v are left uncoloured When u contains more than one node, all edges emanating from v are coloured, such that two edges e(v, v), e(v 0, v 0 ) have the same colour if and only if ψ u (e(v, v)) = e(v 0, v 0 ) Jim Smith (Warwick) Chain Event Graphs August 2010 24 / 31

Formal De nition of a Probability Graph The probability graph of a staged tree is a directed graph, possibly with some coloured edges. Each node represents a set of nodes from the probability tree in the same position in the staged tree Its edges are constructed as follows: For each position w, choose a representative node v(w). For each edge from v(w) to v 0 (w 0 ), construct a single edge e(w, w 0 ), where w 0 = w if v 0 is a leaf node in the tree; otherwise w 0 is the position of v 0. The colour of the edge is the colour of the edge between v and v 0. So the number of edges in the probability tree is the same as in the staged tree. Jim Smith (Warwick) Chain Event Graphs August 2010 25 / 31

A formal de nition of the CEG The chain event graph is the mixed graph with the same nodes as the probability graph; the same directed edges as the probability graph; and undirected edges drawn between di erent positions that are at the same stage The colors of the edges are also inherited from the probability graph Jim Smith (Warwick) Chain Event Graphs August 2010 26 / 31

Conjugate Bayesian Inference on CEG s Because the likelihood separates, the class of regular CEG s admits simple conjugate learning. Explicitly the likelihood under complete random sampling is given by l(π) = l u (π u ) u2u l u (π u ) = π x (i,u) i,u i2u where x(i, u) is the number of units entering stage u and proceeding along edge labelled (i, u). and i π u,i = 1 Independent Dirichlet priors D(α(u)) on the vectors π u leads to independent Dirichlet D(α (u)) posteriors where α (i, u) = α(i, u) + x(i, u) Jim Smith (Warwick) Chain Event Graphs August 2010 27 / 31

Conjugate Bayesian Inference on CEG s Prior stage oret independence is a generalisation of local and global independence in BNs. Just as in Geiger and Heckerman(1997), oret independence, together with appropriate Markov equivalence characterises this product Dirichlet prior (see Freeman and Smith, 2009) Just like for BNs, non - ancestral sampling of a CEG data destroys conjugacy, but inference is no more di cult than for a BN Jim Smith (Warwick) Chain Event Graphs August 2010 28 / 31

Learning the topology of a CEG. Choosing appropriate priors on model space and modular parameter priors over CEGs, for any CEG log marginal likelhood score is linear in stage components. Explicitly for α = (α 1,..., α k ), let s(α) = log Γ( k i=1 α i ) and t(α) = k i=1 log Γ(α i ) Ψ(C ) = log p(c ) = Ψ u(c) u2c Ψ u(c) = s(α(i, u)) s(α (i, u)) + t (α(i, u)) t(α(i, u)) Conjugacy and linearity implies e.g. MAP model selection using AHC or weighted MAX SAT is simple and fast over the albeit vast space class of CEG s (see Freeman and Smith,2009). Jim Smith (Warwick) Chain Event Graphs August 2010 29 / 31

Challenges in searching for a CEG. CEG Search space of substantive problems is huge (orders of magnitude > for BNs). However rationale behind the CEG helps design of intelligent search procedures. Often contextual information describing how things happen the size of space and makes methods feasible. E.g. in educational example (Freeman and Smith,2009) only need search over CEG s consistent with order of courses (event tree). CEG s can also be used to embellishing BN search. The score function above exactly corresponds to Bayes Factor score for BNs. So: search BN tree space! search BN space associated with the best partial order! search CEG embellishments. Without strong information, sparse tables seem to be combined to give MAP models with simple but unusual structure. Jim Smith (Warwick) Chain Event Graphs August 2010 30 / 31

A CEG which extends a BN. BN X % & Y Z =) but context speci c BN + ts much better CEG % & & w 0! w & Y % % X Z BN + X % & Y Z =) CEG % & & w 0! w & Y % % X Z (the distribution of Z is the same whether or not X takes a medium or large value) Jim Smith (Warwick) Chain Event Graphs August 2010 31 / 31