Undirected Graphical Models: Markov Random Fields

Similar documents
3 : Representation of Undirected GM

Representation of undirected GM. Kayhan Batmanghelich

Probabilistic Graphical Models

Undirected Graphical Models

CSC 412 (Lecture 4): Undirected Graphical Models

Alternative Parameterizations of Markov Networks. Sargur Srihari

MAP Examples. Sargur Srihari

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Example: multivariate Gaussian Distribution

CS Lecture 4. Markov Random Fields

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Alternative Parameterizations of Markov Networks. Sargur Srihari

Chris Bishop s PRML Ch. 8: Graphical Models

Independencies. Undirected Graphical Models 2: Independencies. Independencies (Markov networks) Independencies (Bayesian Networks)

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

Review: Directed Models (Bayes Nets)

Directed and Undirected Graphical Models

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Probabilistic Graphical Models Lecture Notes Fall 2009

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Conditional Independence and Factorization

Intelligent Systems (AI-2)

Lecture 6: Graphical Models

Chapter 16. Structured Probabilistic Models for Deep Learning

Rapid Introduction to Machine Learning/ Deep Learning

Bayesian & Markov Networks: A unified view

Markov Random Fields

Machine Learning 4771

Undirected Graphical Models 4 Bayesian Networks and Markov Networks. Bayesian Networks to Markov Networks

Undirected Graphical Models

Probabilistic Graphical Models

Gibbs Fields & Markov Random Fields

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Probabilistic Graphical Models

Course 16:198:520: Introduction To Artificial Intelligence Lecture 9. Markov Networks. Abdeslam Boularias. Monday, October 14, 2015

CPSC 540: Machine Learning

Statistical Approaches to Learning and Discovery

An Introduction to Bayesian Machine Learning

The Origin of Deep Learning. Lili Mou Jan, 2015

Markov Networks.

Machine Learning Lecture 14

Probabilistic Graphical Models

Undirected graphical models

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

From Bayesian Networks to Markov Networks. Sargur Srihari

Undirected Graphical Models

3 Undirected Graphical Models

Probabilistic Graphical Models

Directed Graphical Models or Bayesian Networks

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

4 : Exact Inference: Variable Elimination

Lecture 4 October 18th

Probabilistic Graphical Models (I)

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

The Ising model and Markov chain Monte Carlo

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

6.867 Machine learning, lecture 23 (Jaakkola)

Learning MN Parameters with Approximation. Sargur Srihari

Introduction to Graphical Models

Graphical Models and Kernel Methods

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

Statistical Learning

Introduction to Restricted Boltzmann Machines

Probabilistic Graphical Models

Markov and Gibbs Random Fields

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)

CS Lecture 3. More Bayesian Networks

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

A brief introduction to Conditional Random Fields

Introduction to Probabilistic Graphical Models

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

Notes on Markov Networks

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Restricted Boltzmann Machines

Gibbs Field & Markov Random Field

Graphical models. Sunita Sarawagi IIT Bombay

Markov Chains and MCMC

Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

2 : Directed GMs: Bayesian Networks

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Probabilistic Graphical Models

Using Graphs to Describe Model Structure. Sargur N. Srihari

Random Field Models for Applications in Computer Vision

Intelligent Systems:

Inference as Optimization

Markov properties for undirected graphs

Intelligent Systems (AI-2)

4.1 Notation and probability review

From Distributions to Markov Networks. Sargur Srihari

Graphical Models and Independence Models

CRF for human beings

Intelligent Systems (AI-2)

Introduction to Probabilistic Graphical Models

Bayesian Networks BY: MOHAMAD ALSABBAGH

Prof. Dr. Lars Schmidt-Thieme, L. B. Marinho, K. Buza Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Undirected graphical models

CPSC 540: Machine Learning

Bayesian Learning in Undirected Graphical Models

Transcription:

Undirected Graphical Models: Markov Random Fields 40-956 Advanced Topics in AI: Probabilistic Graphical Models Sharif University of Technology Soleymani Spring 2015

Markov Random Field Structure: undirected graph Undirected edges show correlations (non-causal relationships) between variables e.g., Spatial image analysis: intensity of neighboring pixels are correlated A B Markov Network C D 2

MRF: Joint distribution Factor φ(x 1,, X k ) φ: Val(X 1,, X k ) R Scope: {X 1,, X k } Joint distribution parametrized by factors Φ = φ 1 D 1,, φ K D K : P X 1,, X N = 1 Z k φ k (D k ) D k : the set of variables in the k-th factor 3 Z = φ k (D k ) k X Z: normalization constant called partition function

Relation between factorization and independencies Theorem: Let X, Y, Z be three disjoint sets of variables: P X Y Z iff P X, Y, Z = f X, Z g(y, Z) 4

MRF: Gibbs distribution Gibbs distribution with factors Φ = {φ 1 X C1,, φ K X CK }: P Φ X 1,, X N = 1 φ Z i (X Ci ) i=1 K Z = X K i=1 φ i (X Ci ) φ i X Ci : potential function on clique C i X Ci : the set of variables in the clique C i Potential functions and cliques in the graph completely determine the joint distribution. 5

MRF Factorization: clique Factors are functions of the variables in the cliques To reduce the number of factors we can only allow factors for maximal cliques Clique: subsets of nodes in the graph that are fully connected (complete subgraph) Maximal clique: where no superset of the nodes in a clique are also compose a clique, the clique is maximal Cliques: {A,B,C}, {B,C,D}, {A,B}, {A,C}, {B,C}, {B,D}, {C,D}, {A}, {B}, {C}, {D} A B Max-cliques: {A,B,C}, {B,C,D} C D 6

Interpretation of clique potentials P X 1, X 2, X 3 = P X 1 P X 2 X 1 P X 3 X 2, X 1 P X 1, X 2, X 3 = P X 1, X 2 P(X 3 X 2, X 1 ) Potentials cannot all be marginal or conditional distributions A positive clique potential can be considered as general compatibility or goodness measure over values of the variables in its scope 7

MRF: local independencies Pairwise independencies: X i X j X {X i, X j } Markov Blanket (local independencies): A variable is conditionally independent of every other variables conditioned only on its neighboring nodes X i X X i MB X i MB(X i ) MB X i = {X X (X i, X ) edges} 8

MRF Factorization and pairwise independencies A distribution P Φ with Φ = {φ 1 D 1,, φ K D K } factorizes over an MRF H if each D k is a complete subgraph of H If there is not a direct path between X i and X j then: X i X j X {X i, X j } To hold conditional independence property, X i and X j that are not directly connected do not appear in the same factor in the distributions belonging to the graph 9

MRFs: Global Independencies A path is active given C if no node in it is in C A and B are separated given C if there is no active path between A and B given C A C B Global independencies for any disjoint sets A, B, C: A B C If all paths that connect a node in A to a node in B pass through one or more nodes in set C 10

MRF: independencies Determining conditional independencies in undirected models is much easier than in directed ones Conditioning in undirected models can only eliminate dependencies while in directed ones can create new dependencies (v-structure) 11

Different factorizations Maximal cliques: P Φ X 1, X 2, X 3, X 4 = 1 Z φ 123 X 1, X 2, X 3 φ 234 X 2, X 3, X 4 X 1 X 2 X 3 X 4 Z = X1,X 2,X 3,X 4 φ 123 X 1, X 2, X 3 φ 234 X 2, X 3, X 4 Sub-cliques: P Φ X 1, X 2, X 3, X 4 = 1 Z φ 12 X 1, X 2 φ 23 X 2, X 3 φ 13 X 1, X 3 φ 24 X 2, X 4 φ 34 X 3, X 4 Z = X1,X 2,X 3,X 4 φ 12 X 1, X 2 φ 23 X 2, X 3 φ 13 X 1, X 3 φ 24 X 2, X 4 φ 34 X 3, X 4 Canonical representation P Φ X 1, X 2, X 3, X 4 = 1 Z φ 123 X 1, X 2, X 3 φ 234 X 2, X 3, X 4 φ 12 X 1, X 2 φ 23 X 2, X 3 φ 13 X 1, X 3 φ 24 X 2, X 4 φ 34 X 3, X 4 φ 1 X 1 φ 2 X 2 φ 3 X 3 φ 4 X 4 Z = X1,X 2,X 3,X 4 φ 123 X 1, X 2, X 3 φ 234 X 2, X 3, X 4 φ 12 X 1, X 2 φ 23 X 2, X 3 φ 13 X 1, X 3 φ 24 X 2, X 4 φ 34 X 3, X 4 φ 1 X 1 φ 2 X 2 φ 3 X 3 φ 4 X 4 12

Pairwise MRF All of the factors on single variables or pair of variables (X i, X j ): P X = 1 Z Xi,Xj H φ ij X i, X j i φ i X i Pairwise MRFs are popular (simple special case of general MRFs) They consider pairwise interactions and not interactions of larger subset of variables In general, do not have enough parameters to encompass the space of joint distributions 13

Factor graph Markov network structure doesn t fully specify the factorization of P does not generally reveal all the structure in a Gibbs parameterization Factor graph: two kinds of nodes Variable nodes Factor nodes P X 1, X 2, X 3 = f 1 X 1, X 2, X 3 f 2 X 1, X 2 f 3 X 2, X 3 f 4 (X 3 ) X 1 X 2 X 3 f 1 f 2 f 3 f 4 Factor graph is a useful structure for inference and parametrization (as we will see) 14

Energy function Constraining clique potentials to be positive could be inconvenient We represent a clique potential in an unconstrained form using a real-value "energy" function If potential functions are strictly positive φ C X C > 0: φ C X C = exp E C (X C ) E(X C ): energy function E C X C = ln φ C X C P X = 1 Z exp{ C E C (X C )} log-linear representation 15

Log-linear models Defining the energy function as a linear combination of features A set of m features {f 1 D 1,, f m D m } on complete subgraphs where D i shows the scope of the i-th feature: Scope of a feature is a complete subgraph We can have different features over a sub-graph m P X = 1 Z exp i=1 w i f i (D i ) Example: Ising model uses f ij x i, x j = x i x j 16

Ising model Most likely joint-configurations usually correspond to a "low-energy" state X i 1,1 P x = 1 Z exp i u i x i + w ij x i x j i,j E Grid model Image processing, lattice physics, etc. The states of adjacent nodes are related 17

Shared features in log-linear models P x = 1 Z exp i u i x i + (i,j) H w ij x i x j f ij x i, x j = f x i, x j = x i x j In most practical models, same feature and weight are used over many scopes P x = 1 Z exp i ux i + (i,j) H wx i x j w ij = w 18

Image denoising y i 1,1, i = 1,, D: array of observed noisy pixels x i 1,1, i = 1,, D: noise free image 19 [Bishop]

Image denoising E x, y = h x i β x i x j η x i y i i i,j H i P x, y = 1 Z exp{ E(x, y)} x = argmax x P(x y) MPA: Most probable assignment of x variables given an evidence y 20

Image denoising (gray-scale image) E x, y = β i,j H min( x i x j 2, d) η i x i y j 2 f ij x i, x j = f x i, x j = min( x i x j 2, d) x = argmax x 1 Z exp{ E(x, y)} MPE: Most probable explanation of x variables given an evidence y 21

Restricted Boltzmann Machine (RBM) RBM (Hinton-2002): binary Efficient learning Hidden Visible P v, h = 1 Z exp i a i h i + b j v j + w i,j h i v j j i,j P v h = j P v j h P h v = P h i v 22 i

Restricted Boltzmann machine P h v = P v h = i i P(h i v) P(v i h) P h i = 1 v = σ a i + P v j = 1 h = σ b i + j i w ij v j w ij h i 23

MRF: global independencies Independencies encoded by H (that are found using the graph separation discussed previously): I(H) = {(X Y Z) sep H (X, Y Z)} If P satisfies I(H), we say that H is an I-map (independency map) of P I H I P where I P = X, Y Z P (X Y Z)} 24

Factorization & Independence Factorization Independence (soundness of separation criterion) Theorem: If P factorizes over H, and sep H (X, Y Z) then P satisfies X Y Z (i.e., H is an I-map of P) Independence Factorization Theorem (Hammersley Clifford): For a positive distribution P, if P satisfies I(H) = {(X Y Z) sep H (X, Y Z)} then P factorizes over H 25

Factorization & Independence Two equivalent views of graph structure for positive distributions: If P satisfies all independencies held in H, then it can be represented factorized on cliques of H If P factorizes over a graph H, we can read from the graph structure, independencies that must hold in P 26

Relationship between local and global Markov properties If P I l (H) then P I p (H). If P I(H) then P I l (H). For a positive distribution P, the following three statements are equivalent: P I p (H) P I l (H) P I(H) 27

Loop of at least 4 nodes without chord has no equivalent in BNs Is there a BN that is a perfect map for this MN? A A C B, D D B B D A, C C A A A D B D B D B C C C A C B, D B D A, C B D A, C B D A, C A C B, D A C B, D 28

V-structure has no equivalent in MNs Is there an MN that is a perfect I-map of this BN? A B A B C A B C A B A B C A B A B C C A B A B C 29

Perfect map of a distribution Not every distribution has a MN perfect map Not every distribution has a BN perfect map Graphical models Probabilistic models Directed Undirected 30

Minimal I-map Since we may not find an MN that is a perfect map of a BN and vice versa, we study the minimal I-map property H is a minimal I-map for G if I(H) I(G) Removal of a single edge in H renders it is not an I-map of G 31

Minimal I-maps: from DAGs to MNs The moral graph M(G) of a DAG G is an undirected graph that contains an undirected edge between X and Y if: there is a directed edge between them in either direction X and Y are parents of the same node Moralization turns a node and its parent into a fully connected sub-graph A B A B C C 32

Minimal I-maps: from DAGs to MNs The moral graph M(G) of a DAG G is a minimal I-map for G The moral graph loses some independence information But all independencies in the moral graph are also satisfied in G If a DAG G is "moral", then its moralized graph M(G) is a perfect I-map of G. 33

Minimal I-maps: from MNs to DAGs If G is a BN that is minimal I-map for an MN, then G can have no immoralities. If G is a minimal I-map for an MN then it is chordal Any BN that is I-map for an MN must add triangulating edges into the graph D A B A B C D C An undirected graph is chordal if any loop with more than three nodes has a chord G is a minimal I-map of the left MN 34

Perfect I-map Theorem: Let H be a non-chordal MN. Then there is no BN that is a perfect I-map for H. A D B C If the independencies in an MN can be exactly represented via a BN then the MN graph is chordal 35

Perfect I-map Theorem: Let H be a chordal MN. Then there exists a DAG G that is a perfect I-map for H A A C B C B D D E E The independencies in a graph can be represented in both type of models if and only if the graph is chordal 36

Relationship between BNs and MNs Directed and undirected models represent different families of independence assumptions Under certain condition, they can be converted to each other Chordal graphs can be represented in both BNs and MNs For inference, we can use a single representation for both types of these models simpler design and analysis of the inference algorithm 37

Conditional Random Field (CRF) Undirected graph H with nodes X Y 38 X: observed variables Y: target variables Consider factors Φ = {φ 1 D 1,, φ K D K } where each D i X: P Y X = 1 P Y, X Z X P Y, X = Z(X) = Y K i=1 P Y, X φ i (D i ) Nodes are connected by edge in H whenever they appear together in the scope of some factor

Linear-chain CRF Y 1 Y 2 Y K Linear-chain CRF X 1 X 2 X K P Y, X = P Y X = 1 Z X K i=1 φ(y i, Y i+1 ) P Y, X K i=1 φ(y i, X i ) Z(X) = Y P Y, X 39

CRF as a discriminative model Discriminative approach for labeling CRF does not model the distribution over the observations Dependencies between observed variables may be quite complex or poorly understood but we don t worry about modeling them Y 1 Y 2 Y T X 1 X 2 X T Y 1 Y 2 Y T X 1,, X T When labeling X i future observations are taken into account 40

CRF: discriminative model Conditional probability P(Y X) rather than joint probability P(Y, X) The probability of a transition between labels may depend on past and future observations CRF is based on the conditional probability of label sequence given observation sequence Allow arbitrary dependency between features on the observation sequence As opposed to independence assumptions in generative models 41

Naïve Markov Model X 1 X 2 X k Y X i is binary random variable Y: binary random variable φ i X i, Y = exp w i I X i = 1, Y = 1 φ 0 Y = exp w 0 I Y = 1 P Y = 1 X 1, X 2,, X k = σ w 0 + k j=1 w j X j 42

CRF: logistic model Naïve Markov model P Y, X = exp w 0 I Y = 1 + w i I X i = 1, Y = 1 m i=1 m P Y = 1, X = exp w 0 + w i X i i=1 P Y = 0, X = exp 0 = 1 P Y = 1 X = = σ w 0 + m i=1 1 k 1 + exp w 0 + j=1 w i X i w j X j Number of parameters is linear 43

CRF: Image segmentation example A node Y i for the label of each super-pixel Val(Y i ) = {1,2,, K} (i.e., grass, sky, water, ) An edge between Y i and Y j where the corresponding superpixels share a boundary A node X i for the features (e.g., color, texture, location) of each super-pixel 44

CRF: Image segmentation example Simple: φ Y i, Y j = exp{ λi(y i Y j )} More complex: e.g., horse adjacent to vegetation than water depends on the relative pixel location, e.g., water below vegetation, sky above every thing 45

CRF: Image segmentation example 46 [Koller s Book]

CRF: Named Entity Recognition φ(y i, Y i+1 ) φ(y i, X 1,, X T ) [Koller Book] Features: word capitalized, word in atlas of locations, previous word is Mrs, next word is Times, 47