STAT 598L Probabilistic Graphical Models. Instructor: Sergey Kirshner. Bayesian Networks

STAT 598L Probabilistic Graphical Models Instructor: Sergey Kirshner Bayesian Networks

Representing Joint Probability Distributions 2 n -1 free parameters

Reducing Number of Parameters: Conditional Independence Conditional independence can reduce the number of parameters 1 parameter 2 parameters 2 parameters

Reducing Number of Parameters: Conditional Independence Z X Y 1 parameter 2 parameters 2 parameters

Example: Naïve Bayes & & L A P F E L: like/dislike A: ambience P: price F: food E: ethnic &

Structure Graph Skeleton for factorization of a joint distribution Representation for a set of conditional independence relations Two are the same

Causal Interpretation

With Probability Tables

Marginals

Causal Reasoning

Evidential Reasoning

Explaining Away

Bayesian Network Model Graphical way to describe a particular chain rule decomposition of the joint distribution Parents in the graph are conditioning variables Factors are conditional probability distributions

Conditional Independence Assumptions What conditional independence assumptions are made?

Representation Theorem Local Markov assumption Each variable is independent of its non-descendants given its parents P factorizes according to G

Representation Theorem G is an I-map for P P could potentially exhibit more CI relations

Representation Theorem X 1 G is an I-map for P X 2 X 5 P could potentially exhibit more CI relations X 3 X 4 Is there an I-map graph for every P?

Representation Theorem Each variable is independent of its non-descendants given its parents P factorizes according to G

Representation Theorem Each variable is independent of its non-descendants given its parents P factorizes according to G Proof: 1. Assume topological order (or reorder) 2. 3. 4. & & &

Representation Theorem Each variable is independent of its non-descendants given its parents P factorizes according to G Homework!

Bayesian Network + DAG: conditioning variables=parents Bayesian network structure = Conditional probability tables Bayesian network parameters Joint distribution as a chain rule of conditional probabilities

Dimensionality Reduction Full probability table Bayesian network (BN) O(2 n ) free parameters O(n2 k ) free parameters (assuming at most k parents per node)

Representation Theorem DAG: conditioning variables=parents Bayesian Network Structure Structure Local Markov assumption Can be more independencies in P Independencies Joint distribution as a chain rule of conditional probabilities

Finding Conditional Independences A B E F H I C G J D Is A E?

Finding Conditional Independences A B E F H I C G J D Is A E B?

Finding Conditional Independences A B E F H I C G J D Is A E B,G? How to convert local Markov properties into conditional independencies?

Simple Case X Y direct connection Is possible to find Z so that for X Y Z? Not always, e.g., Y is a deterministic function of X Verdict: dependent edge is active in the flow of influence active = dependence

Independencies for Three Variables X Z Y active indirect causal effect X blocked Y X Z Y indirect evidential effect active Z common effect X Z Y active common cause

Independencies for Three Variables X Z Y blocked indirect causal effect X active Y X Z Y indirect evidential effect blocked Z common effect X Z Y blocked common cause

Trails A B E F H I C G J D Trail = undirected path

Active Trails A B E F H I C G J D Active trail = all consecutive triples are active v-triples do not have descendants among conditioning nodes

Finding Active Trails Algorithm 3.1 in the textbook Find the ancestors for evidence nodes (to test v- structures) Breadth-first search (a bit tricky as both up and down directions have to be considered) Instead, play Bayes-Ball (The Rational Pastime)

http://ai.stanford.edu/~paskin/gm-short-course/lec2.pdf

Direction-dependent Separation X and Y are d-separated given Z = no active trails between nodes in X and Y given nodes in Z Denoted by Will examine the set of independencies induced by d-separation How can we formally tie d-separation and conditional independence?

Soundness of d-separation For all P that factorizes according to G G is a BN structure for P Will prove later in the course d-separation in G conditional independence in P G is an I-map for P

Completeness of d-separation What would be a good converse? Faithfullness Does not hold! Relaxing

Completeness of d-separation Relaxing relaxation contraposition

Completeness of d-separation Interpreting the statement Active trail between X and Y given Z A B E F H I X and Y are dependent given Z in some P that factorizes according to G C G J D

Completeness of d-separation Interpreting the statement Active trail between X and Y given Z A E F X and Y are dependent given Z in some P that factorizes according to G C G J D Sketch of proof (by construction): All CPDs not in the trail are uniform (remove nodes and arrows not in the trail Make the remaining dependencies deterministic (xor)

More General Result Soundness Completeness (almost) Intuition: Two binary variables X and Y; 2-d space of possible conditional distributions with a 1-d curve for independence P(Y=1 X=1) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 X Y X Y 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 P(Y=1 X=0)

Dimensionality Reduction 1 0.9 X Y X Y P(Y=1 X=1) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 P(Y=1 X=0) Key for dimensionality reduction: finding a parametrization on such a low-dimensional manifold

Equivalent Structures Is DAG structure unique for a set of conditional independencies? X Z Y X Z Y X Z Y What makes DAGs equivalent for CI modeling?

More Formally: I-Equivalence X Z Y X Z Y I-Equivalence: I(G 1 )=I(G 2 ) What makes graphs I-equivalent?

Skeleton A B E F H I C G J D A B E F H I C G J D

Structure Equivalence Is skeleton enough? X Z Y No, because of v-structures What if we add v-structures? skeleton(g 1 )=skeleton(g 2 ) + v-structures(g 1 )=v-structures(g 2 ) Converse? No, e.g., two-different complete DAGs X Z I(G 1 )=I(G 2 ) Y

Structure Equivalence Still not a full characterization Refining the last piece X no edge Y Z Immorality skeleton(g 1 )=skeleton(g 2 ) + immoralities(g 1 )=immoralities(g 2 ) I(G 1 )=I(G 2 )

How To Construct G? A B E C G D?

What Can We Reconstruct? Ideally, want G such that I(G)=I Suppose such graph exists Graph may not be unique n! ordering to search over May not be able to such G Will settle for G an I-map for I, I(G) I But no extra edges (dependencies) Removing edges can only add independencies G is a minimal I-map = removing edges in G adds independencies not in I

How To Construct Minimal I-Map? Assume ordering is given (X 1,,X n ) For each node Find smallest subset so that Add edges from U to X i

Example of Finding I-Maps A B E C D Order=(A,B,E,C,D) What if a different order is chosen?

Summary Bayesian Network = DAG + CPDs Distribution factorizes according to graph (chain rule decomposition) distribution satisfies local independence in graph distribution satisfies global independence in graph (d-separation) D-separation precisely characterizes independencies in the distribution (almost) Convert high-dimensional real-valued space (distribution) to a discrete space (graph) However, graph may not be unique or may not be able to capture the exact set of independencies