Conditional Independence and Factorization

Similar documents
Lecture 4 October 18th

Review: Directed Models (Bayes Nets)

3 : Representation of Undirected GM

4.1 Notation and probability review

Undirected Graphical Models: Markov Random Fields

Chris Bishop s PRML Ch. 8: Graphical Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Undirected Graphical Models

Undirected Graphical Models

Expectation Maximization

CSC 412 (Lecture 4): Undirected Graphical Models

Directed and Undirected Graphical Models

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

Rapid Introduction to Machine Learning/ Deep Learning

Statistical Approaches to Learning and Discovery

1. what conditional independencies are implied by the graph. 2. whether these independecies correspond to the probability distribution

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

3 Undirected Graphical Models

2 : Directed GMs: Bayesian Networks

Representation of undirected GM. Kayhan Batmanghelich

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Undirected graphical models

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

An Introduction to Bayesian Machine Learning

Machine Learning Lecture 14

Gibbs Fields & Markov Random Fields

Graphical Models and Kernel Methods

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics. EECS 281A / STAT 241A Statistical Learning Theory

Lecture 8: Bayesian Networks

Based on slides by Richard Zemel

Markov Random Fields

Probabilistic Graphical Networks: Definitions and Basic Results

CS Lecture 4. Markov Random Fields

Bayes Nets: Independence

Markov Chains and MCMC

6.867 Machine learning, lecture 23 (Jaakkola)

Probabilistic Graphical Models (I)

Introduction to Probabilistic Graphical Models

Directed Graphical Models or Bayesian Networks

2 : Directed GMs: Bayesian Networks

Markov Random Fields

Statistical Learning

Chapter 16. Structured Probabilistic Models for Deep Learning

Machine Learning 4771

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

CPSC 540: Machine Learning

Independencies. Undirected Graphical Models 2: Independencies. Independencies (Markov networks) Independencies (Bayesian Networks)

A brief introduction to Conditional Random Fields

Machine Learning Summer School

Artificial Intelligence Bayes Nets: Independence

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network.

Probabilistic Graphical Models

Probabilistic Graphical Models

Information Theory Primer:

Bayes Decision Theory

Alternative Parameterizations of Markov Networks. Sargur Srihari

Bayesian Networks. Alan Ri2er

Probabilistic Graphical Models and Bayesian Networks. Artificial Intelligence Bert Huang Virginia Tech

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

University of Washington Department of Electrical Engineering EE512 Spring, 2006 Graphical Models

Random Field Models for Applications in Computer Vision

Alternative Parameterizations of Markov Networks. Sargur Srihari

Directed and Undirected Graphical Models

Probabilistic Machine Learning

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Nonparameteric Regression:

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Density Estimation. Seungjin Choi

An Introduction to Exponential-Family Random Graph Models

CS 5522: Artificial Intelligence II

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

Gibbs Field & Markov Random Field

Prof. Dr. Lars Schmidt-Thieme, L. B. Marinho, K. Buza Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course

1 : Introduction. 1 Course Overview. 2 Notation. 3 Representing Multivariate Distributions : Probabilistic Graphical Models , Spring 2014

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)

Causal Bayesian networks. Peter Antal

Probabilistic Graphical Models

Introduction to Graphical Models

Directed Probabilistic Graphical Models CMSC 678 UMBC

Markov properties for undirected graphs

EE512 Graphical Models Fall 2009

Using Graphs to Describe Model Structure. Sargur N. Srihari

Graphical Models and Independence Models

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Graphical Models for Collaborative Filtering

Machine Learning Lecture Notes

Directed Graphical Models

Variational Inference (11/04/13)

Graphical Models - Part II

Graphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence

Causal Bayesian networks. Peter Antal

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Intelligent Systems:

Causality in Econometrics (3)

Lecture 1: Probabilistic Graphical Models. Sam Roweis. Monday July 24, 2006 Machine Learning Summer School, Taiwan

Graphical Models - Part I

Probabilistic Graphical Models Lecture Notes Fall 2009

Transcription:

Conditional Independence and Factorization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1/37

Graphical Models: Efficient Representation Consider a set of (discrete) random variables {X 1,...,X n } where X i takes on each of its r different values x i. A direct representation of p(x 1,...,x n ) requires an n-dimensional table (table of r n values representing the probabilities of each of the r n possible values of X 1,...,X n ). Graphical models provide an efficient means of representing joint probability distributions over many random variables in situations where each random variable is conditionally independent of all but a handful of the n random variables. A graphical model can be thought of as a probabilistic database, a machine that can answer queries regarding the values of sets of random variables. We build up the database in pieces, using probability theory to ensure that the pieces have a consistent overall interpretation. Probability theory also justifies the inferential machinery that allows the pieces to be put together on the fly to answer queries. 2/37

The chain rule gives p(x 1,x 2,x 3,x 4 ) = p(x 4 x 3,x 2,x 1 )p(x 3 x 2,x 1 )p(x 2 x 1 )p(x 1 ). X 1 X 2 X 3 X 4 X 1 X 2 X 3 X 4 GM (chain rule) GM (Markov chain) 3/37

Outline: Directed Graphical Models Directed graphs and joint probabilities Conditional independence and d-separation Three canonical graphs Bayes ball algorithm Characterization of directed graphical models 4/37

Notations Given a set of random variables, {X 1,...,X n }, let x i represent the realization of random variable X i. The probability mass function p(x 1,...,x n ) is defined as P(X 1 = x 1,...,X n = x n ). Use X to stand for {X 1,...,X n } and x to stand for {x 1,...,x n }. X A is shorthand for {X 1,X 2 } if A = {1,2}. 5/37

Directed Graphs A directed graph is a pair, G(V,E), where V is a set of nodes (vertices) and E is a set of oriented edges. Assume that G is acyclic. Nodes Edges Associated with random variables. a One-to-one mapping from nodes to random variables. Parent node, πi, is a set of parents of node i. Edges represent conditional dependence. 6/37

Conditional Independence X A and X B are independent, X A X B, if p(x A,x B ) = p(x A )p(x B ). X A and X C are conditionally independent given X B, X A X C X B, if p(x A,x C x B ) = p(x A x B )p(x C x B ) or p(x A x B,x C ) = p(x A x B ), for all x B such that p(x B ) > 0. 7/37

An Example of DAG X 4 X 2 X 1 X 6 X 3 X 5 p(x 1,x 2,x 3,x 4,x 5,x 6 ) = p(x 1 )p(x 2 x 1 )p(x 3 x 1 )p(x 4 x 2 )p(x 5 x 3 )p(x 6 x 2,x 5 ) 8/37

Factorization of Joint Probability Distributions Use the locality defined by the parent-child relationship to construct economical representation of joint probability distributions. Associate a function f i(x i,x πi ) to each node i V (Properties of conditional probability distributions: Nonnegativity and sum-to-one). Given a set of functions {f i(x i,x πi ) : i V} for V = 1,2,...,n, we define a joint probability distribution as follows: p(x 1,x 2,...,x n) n f i(x i,x πi ). Given that f i(x i,x πi ) are conditional probabilities, we write in terms of p(x i x πi ): p(x 1,x 2,...,x n) i=1 n p(x i x πi ). i=1 9/37

An Example of DAG: Revisited X 4 X 2 X 1 X 6 X 3 X 5 p(x 1,x 2,x 3,x 4,x 5,x 6 ) = p(x 1 )p(x 2 x 1 )p(x 3 x 1 )p(x 4 x 2 )p(x 5 x 3 )p(x 6 x 2,x 5 ) 10/37

Economical Representation? Consider n discrete random variables {x 1,x 2,...,x n } with each variable x i ranging over r values. Naive approach PGM Need an n-dimensional table of size r n Need an (mi+1)-dimensional table of size r m i+1 where m i is the number of parents of node X i 11/37

Conditional Independence in DAG Two different factorization of a probability distribution function The chain rule leads to DAG leads to p(x 1,x 2,...,x n ) = n p(x i x 1,...,x i 1 ). i=1 p(x 1,x 2,...,x n ) = n p(x i x πi ). Comparing these expressions, we might interpret that missing variables in the local conditional probability functions corresponds to missing edges (conditional independence) in the underlying graph. i=1 12/37

Basic Conditional Independence Statements An ordering I of the nodes in a graph G is said to be topological if the nodes in π i appear before i in I for every node i V. (I = {1,2,3,4,5,6} is a topological ordering in our example) Let ν i denote the set of all nodes that appear earlier than i in I, excluding π i. (for example, ν 5 = {1,2,4}) Given a topological ordering I, the set of basic conditional independence statements: for i V. {X i X νi X πi }, These statement can be verified by algebraic calculations Example: X 4 {X 1,X 3} X 2 p(x 4 x 1,x 2,x 3) = p(x1,x2,x3,x4) p(x 1,x 2,x 3) = p(x1)p(x2 x1)p(x3 x1)p(x4 x2) p(x 1)p(x 2 x 1)p(x 3 x 1) = p(x 4 x 2) 13/37

Markov Chain X Y Z X Y Z X Z Y p(z x,y) = p(x,y,z) p(x, y) = p(x)p(y x)p(z y) p(x)p(y x) There are no other conditional independencies = p(z y) Asserted conditional independencies always hold for the family of distributions associated with a given graph. Non-asserted conditional independencies sometimes fail to hold but sometimes do hold. 14/37

Hidden Cause Y Y X Z X Z X Z Y p(x,z y) = p(y)p(x y)p(z y) p(y) = p(x y)p(z y) We do not necessarily assume that X and Z are dependent. 15/37

Explaining-Away X Z X Z Y Y X Z p(y x,z) = p(x)p(z)p(y x,z) p(x, z) p(x,z) = p(x)p(z) No X Z Y 16/37

Bayes Ball Algorithm We wish to decide whether a given conditional independence statement, X A X B X C, is true for a directed graph G. A reachability algorithm A d-separation test Shade the nodes in X C. Place balls at each node in X A, let them bounce around according to some rules, and then ask if any of the balls reach any of the nodes in X B. 17/37

Bayes Ball Algorithm - Rules 1,2 X Y Z X Y Z Y Y X Z X Z 18/37

Bayes Ball Algorithm - Rules 3,4,5 X Z X Z Y Y X Y X Y X Y X Y 19/37

Example 1 X 4 X 2 X 1 X 6 X 3 X 5 X 4 {X 1,X 3} X 2 is true. 20/37

Example 2 X 4 X 2 X 1 X 6 X 3 X 5 X 1 X 6 {X 2,X 3} is true. 21/37

Example 3 X 4 X 2 X 1 X 6 X 3 X 5 X 2 X 3 {X 1,X 6} is not true. 22/37

Characterization of Directed Graphical Models A graphical model is associated with a family of probability distributions and this family can be characterized in two equivalent ways. D 1= { p(x) = n i=1 p(xi xπ i )}. D 2= {p(x) subject to X i X νi X πi }. (A family of probability distributions associated with G, that includes all p(x v) that satisfy every conditional independence statement associated with G) D 1 = D 2 (Details will be discussed later) This provides a strong and important link between graph theory and probability theory. 23/37

Markov Blanket Markov blanket: V is a Markov blanket for X iff X Y V for all Y / V. In a DAG, Markov blanket for node i is composed of its parents, its children, and its children s parents. Markov blanket identifies all the variables that shield off the node from the rest of the network. This means that the Markov blanket of a node is the only knowledge that is needed to predict the behaviour of that node. (coined by J. Pearl) 24/37

A 25/37

Outline: Undirected Graphical Models Markov networks Clique potentials Characterization of undirected graphical models 26/37

Markov Networks (Undirected Graphical Models) X 1 X 4 Examples of Markov networks: X 2 X 3 X 5 Boltzmann machines Markov random fields Semantics: Every node is conditionally independent from its non-neighbors, given its neighbors. Markov blanket: V is a Markov blanket for X iff X Y V for all Y / V. Markov boundary: Minimal Markov blanket 27/37

Boltzmann Machines Markov network over a vector of binary variables, x i {0,1}, where some variables may be hidden (xi H ) and some may be observed (visible, xi V ). Learning for Boltzmann machines involves adjusting weights W such that the generative model p(x W) = 1 { } 1 Z exp 2 x Wx is well-matched to a set of examples {x(t)} (t = 1,...,N). The learning is done via EM! 28/37

Conditional Independence For undirected graphs, the conditional independence semantics is the more intuitive. Every path from a node in X A to a node in X C includes at least one node in X B X A X C X B. X B X A X A X C X B X C 29/37

Comparative Semantics W X Y {W,Z} W Z {X,Y} X Y not X Y Z X Y X Y Z X Y Z not X Y Z X Y X Z X Y X Y Z Z 30/37

Clique Potentials Clique: A fully connected subgraph (usually maximal) Denote by x Ci the set of variables in clique C i. X 1 X 2 X 3 X 4 X 5 For each clique C i, we assign a nonnegative function (potential function), ψ Ci (x Ci ) which measures compatibility (agreement, constraint, or energy). where Z = x p(x) = 1 Z ψ Ci (x Ci ), C i C C i C ψ C i (x Ci ) is the normalization. 31/37

Example 0 x 1 1 x 2 0 1 1.5 0.2 0.2 1.5 X 1 0 x 2 1 X 2 x 4 0 1 1.5 0.2 0.2 1.5 X 4 0 x 5 1 x 6 1 0 1.5 0.2 0.2 1.5 0 1 x 2 x 3 X 6 x 0 1 1 0 1 1.5 0.2 0.2 1.5 X 3 x 0 3 1 x 5 0 1 1.5 0.2 0.2 1.5 X 5 32/37

Potential Functions: Parameterization The potential functions must be nonnegative. ψ C (x C ) = exp{ H C (x C )}. Therefore, the joint probability for undirected models can be derived as follows : p(x) = 1 Z exp H Ci (x Ci ). C i C The sum in the expression is generally referred to as the energy : H(x) C i CH Ci (x Ci ) Finally, the joint probability can be represented as a Boltzman distribution : p(x) = 1 Z exp{ H(x)}. 33/37

Potential Functions, again! The factorization of the joint probability distribution in a Markov network is given by p(x) = 1 Z ψ Ci (x Ci ). C i C If all potential functions are strictly positive, then we have p(x) = exp logψ Ci (x Ci ) logz }{{}}{{} C i C. θ i T i (X) A(θ) In some cases, we have { } p(x) = exp θ it i(x) A(θ), i which is known as exponential family. In statistical physics, log Z is referred to as free energy. 34/37

Markov Random Fields Let X = {X 1,...,X n} be a family of random variables defined on the set S in which each random variable X i takes a value in L. Definition The family X is called a random field. Definition X is said to be Markov random field (MRF) with respect to a neighborhood system N if and only if the following two conditions are satisfied: 1. Positivity: P(x) > 0 x X. 2. Markovianity: P(x i x S\i ) = P(x i x Ni ). 35/37

Gibbs Random Fields Definition A set of random variables X is said to be a Gibbs random field (GRF) if and only iff its configurations obey the Gibbs distribution that has the form p(x) = 1 { Z exp 1 } T E(x), where E(x) = ψ C (x). We often consider cliques of size up to 2, E(x) ψ 1(x i)+ ψ 2(x i,x j). i S i S j N i Theorem (Hammersley-Clifford) X is an MRF on S with respect to N if and only iff X is a GRF on S with respect to N. 36/37

Characterization of Undirected Graphical Models U 1 : A family of probability distributions by ranging over all possible choices of positive potential functions on the maximal cliques of the graph. p(x) = 1 } { H(x) Z exp. U 2 : A family of probability distributions via the conditional independence assertions associated (X A X B X C ) with G. U 1 U 2 by Hammersley-Clifford theorem (proof, p85) 37/37