Undirected Graphical Models: Markov Random Fields 40-956 Advanced Topics in AI: Probabilistic Graphical Models Sharif University of Technology Soleymani Spring 2015
Markov Random Field Structure: undirected graph Undirected edges show correlations (non-causal relationships) between variables e.g., Spatial image analysis: intensity of neighboring pixels are correlated A B Markov Network C D 2
MRF: Joint distribution Factor φ(x 1,, X k ) φ: Val(X 1,, X k ) R Scope: {X 1,, X k } Joint distribution parametrized by factors Φ = φ 1 D 1,, φ K D K : P X 1,, X N = 1 Z k φ k (D k ) D k : the set of variables in the k-th factor 3 Z = φ k (D k ) k X Z: normalization constant called partition function
Relation between factorization and independencies Theorem: Let X, Y, Z be three disjoint sets of variables: P X Y Z iff P X, Y, Z = f X, Z g(y, Z) 4
MRF: Gibbs distribution Gibbs distribution with factors Φ = {φ 1 X C1,, φ K X CK }: P Φ X 1,, X N = 1 φ Z i (X Ci ) i=1 K Z = X K i=1 φ i (X Ci ) φ i X Ci : potential function on clique C i X Ci : the set of variables in the clique C i Potential functions and cliques in the graph completely determine the joint distribution. 5
MRF Factorization: clique Factors are functions of the variables in the cliques To reduce the number of factors we can only allow factors for maximal cliques Clique: subsets of nodes in the graph that are fully connected (complete subgraph) Maximal clique: where no superset of the nodes in a clique are also compose a clique, the clique is maximal Cliques: {A,B,C}, {B,C,D}, {A,B}, {A,C}, {B,C}, {B,D}, {C,D}, {A}, {B}, {C}, {D} A B Max-cliques: {A,B,C}, {B,C,D} C D 6
Interpretation of clique potentials P X 1, X 2, X 3 = P X 1 P X 2 X 1 P X 3 X 2, X 1 P X 1, X 2, X 3 = P X 1, X 2 P(X 3 X 2, X 1 ) Potentials cannot all be marginal or conditional distributions A positive clique potential can be considered as general compatibility or goodness measure over values of the variables in its scope 7
MRF: local independencies Pairwise independencies: X i X j X {X i, X j } Markov Blanket (local independencies): A variable is conditionally independent of every other variables conditioned only on its neighboring nodes X i X X i MB X i MB(X i ) MB X i = {X X (X i, X ) edges} 8
MRF Factorization and pairwise independencies A distribution P Φ with Φ = {φ 1 D 1,, φ K D K } factorizes over an MRF H if each D k is a complete subgraph of H If there is not a direct path between X i and X j then: X i X j X {X i, X j } To hold conditional independence property, X i and X j that are not directly connected do not appear in the same factor in the distributions belonging to the graph 9
MRFs: Global Independencies A path is active given C if no node in it is in C A and B are separated given C if there is no active path between A and B given C A C B Global independencies for any disjoint sets A, B, C: A B C If all paths that connect a node in A to a node in B pass through one or more nodes in set C 10
MRF: independencies Determining conditional independencies in undirected models is much easier than in directed ones Conditioning in undirected models can only eliminate dependencies while in directed ones can create new dependencies (v-structure) 11
Different factorizations Maximal cliques: P Φ X 1, X 2, X 3, X 4 = 1 Z φ 123 X 1, X 2, X 3 φ 234 X 2, X 3, X 4 X 1 X 2 X 3 X 4 Z = X1,X 2,X 3,X 4 φ 123 X 1, X 2, X 3 φ 234 X 2, X 3, X 4 Sub-cliques: P Φ X 1, X 2, X 3, X 4 = 1 Z φ 12 X 1, X 2 φ 23 X 2, X 3 φ 13 X 1, X 3 φ 24 X 2, X 4 φ 34 X 3, X 4 Z = X1,X 2,X 3,X 4 φ 12 X 1, X 2 φ 23 X 2, X 3 φ 13 X 1, X 3 φ 24 X 2, X 4 φ 34 X 3, X 4 Canonical representation P Φ X 1, X 2, X 3, X 4 = 1 Z φ 123 X 1, X 2, X 3 φ 234 X 2, X 3, X 4 φ 12 X 1, X 2 φ 23 X 2, X 3 φ 13 X 1, X 3 φ 24 X 2, X 4 φ 34 X 3, X 4 φ 1 X 1 φ 2 X 2 φ 3 X 3 φ 4 X 4 Z = X1,X 2,X 3,X 4 φ 123 X 1, X 2, X 3 φ 234 X 2, X 3, X 4 φ 12 X 1, X 2 φ 23 X 2, X 3 φ 13 X 1, X 3 φ 24 X 2, X 4 φ 34 X 3, X 4 φ 1 X 1 φ 2 X 2 φ 3 X 3 φ 4 X 4 12
Pairwise MRF All of the factors on single variables or pair of variables (X i, X j ): P X = 1 Z Xi,Xj H φ ij X i, X j i φ i X i Pairwise MRFs are popular (simple special case of general MRFs) They consider pairwise interactions and not interactions of larger subset of variables In general, do not have enough parameters to encompass the space of joint distributions 13
Factor graph Markov network structure doesn t fully specify the factorization of P does not generally reveal all the structure in a Gibbs parameterization Factor graph: two kinds of nodes Variable nodes Factor nodes P X 1, X 2, X 3 = f 1 X 1, X 2, X 3 f 2 X 1, X 2 f 3 X 2, X 3 f 4 (X 3 ) X 1 X 2 X 3 f 1 f 2 f 3 f 4 Factor graph is a useful structure for inference and parametrization (as we will see) 14
Energy function Constraining clique potentials to be positive could be inconvenient We represent a clique potential in an unconstrained form using a real-value "energy" function If potential functions are strictly positive φ C X C > 0: φ C X C = exp E C (X C ) E(X C ): energy function E C X C = ln φ C X C P X = 1 Z exp{ C E C (X C )} log-linear representation 15
Log-linear models Defining the energy function as a linear combination of features A set of m features {f 1 D 1,, f m D m } on complete subgraphs where D i shows the scope of the i-th feature: Scope of a feature is a complete subgraph We can have different features over a sub-graph m P X = 1 Z exp i=1 w i f i (D i ) Example: Ising model uses f ij x i, x j = x i x j 16
Ising model Most likely joint-configurations usually correspond to a "low-energy" state X i 1,1 P x = 1 Z exp i u i x i + w ij x i x j i,j E Grid model Image processing, lattice physics, etc. The states of adjacent nodes are related 17
Shared features in log-linear models P x = 1 Z exp i u i x i + (i,j) H w ij x i x j f ij x i, x j = f x i, x j = x i x j In most practical models, same feature and weight are used over many scopes P x = 1 Z exp i ux i + (i,j) H wx i x j w ij = w 18
Image denoising y i 1,1, i = 1,, D: array of observed noisy pixels x i 1,1, i = 1,, D: noise free image 19 [Bishop]
Image denoising E x, y = h x i β x i x j η x i y i i i,j H i P x, y = 1 Z exp{ E(x, y)} x = argmax x P(x y) MPA: Most probable assignment of x variables given an evidence y 20
Image denoising (gray-scale image) E x, y = β i,j H min( x i x j 2, d) η i x i y j 2 f ij x i, x j = f x i, x j = min( x i x j 2, d) x = argmax x 1 Z exp{ E(x, y)} MPE: Most probable explanation of x variables given an evidence y 21
Restricted Boltzmann Machine (RBM) RBM (Hinton-2002): binary Efficient learning Hidden Visible P v, h = 1 Z exp i a i h i + b j v j + w i,j h i v j j i,j P v h = j P v j h P h v = P h i v 22 i
Restricted Boltzmann machine P h v = P v h = i i P(h i v) P(v i h) P h i = 1 v = σ a i + P v j = 1 h = σ b i + j i w ij v j w ij h i 23
MRF: global independencies Independencies encoded by H (that are found using the graph separation discussed previously): I(H) = {(X Y Z) sep H (X, Y Z)} If P satisfies I(H), we say that H is an I-map (independency map) of P I H I P where I P = X, Y Z P (X Y Z)} 24
Factorization & Independence Factorization Independence (soundness of separation criterion) Theorem: If P factorizes over H, and sep H (X, Y Z) then P satisfies X Y Z (i.e., H is an I-map of P) Independence Factorization Theorem (Hammersley Clifford): For a positive distribution P, if P satisfies I(H) = {(X Y Z) sep H (X, Y Z)} then P factorizes over H 25
Factorization & Independence Two equivalent views of graph structure for positive distributions: If P satisfies all independencies held in H, then it can be represented factorized on cliques of H If P factorizes over a graph H, we can read from the graph structure, independencies that must hold in P 26
Relationship between local and global Markov properties If P I l (H) then P I p (H). If P I(H) then P I l (H). For a positive distribution P, the following three statements are equivalent: P I p (H) P I l (H) P I(H) 27
Loop of at least 4 nodes without chord has no equivalent in BNs Is there a BN that is a perfect map for this MN? A A C B, D D B B D A, C C A A A D B D B D B C C C A C B, D B D A, C B D A, C B D A, C A C B, D A C B, D 28
V-structure has no equivalent in MNs Is there an MN that is a perfect I-map of this BN? A B A B C A B C A B A B C A B A B C C A B A B C 29
Perfect map of a distribution Not every distribution has a MN perfect map Not every distribution has a BN perfect map Graphical models Probabilistic models Directed Undirected 30
Minimal I-map Since we may not find an MN that is a perfect map of a BN and vice versa, we study the minimal I-map property H is a minimal I-map for G if I(H) I(G) Removal of a single edge in H renders it is not an I-map of G 31
Minimal I-maps: from DAGs to MNs The moral graph M(G) of a DAG G is an undirected graph that contains an undirected edge between X and Y if: there is a directed edge between them in either direction X and Y are parents of the same node Moralization turns a node and its parent into a fully connected sub-graph A B A B C C 32
Minimal I-maps: from DAGs to MNs The moral graph M(G) of a DAG G is a minimal I-map for G The moral graph loses some independence information But all independencies in the moral graph are also satisfied in G If a DAG G is "moral", then its moralized graph M(G) is a perfect I-map of G. 33
Minimal I-maps: from MNs to DAGs If G is a BN that is minimal I-map for an MN, then G can have no immoralities. If G is a minimal I-map for an MN then it is chordal Any BN that is I-map for an MN must add triangulating edges into the graph D A B A B C D C An undirected graph is chordal if any loop with more than three nodes has a chord G is a minimal I-map of the left MN 34
Perfect I-map Theorem: Let H be a non-chordal MN. Then there is no BN that is a perfect I-map for H. A D B C If the independencies in an MN can be exactly represented via a BN then the MN graph is chordal 35
Perfect I-map Theorem: Let H be a chordal MN. Then there exists a DAG G that is a perfect I-map for H A A C B C B D D E E The independencies in a graph can be represented in both type of models if and only if the graph is chordal 36
Relationship between BNs and MNs Directed and undirected models represent different families of independence assumptions Under certain condition, they can be converted to each other Chordal graphs can be represented in both BNs and MNs For inference, we can use a single representation for both types of these models simpler design and analysis of the inference algorithm 37
Conditional Random Field (CRF) Undirected graph H with nodes X Y 38 X: observed variables Y: target variables Consider factors Φ = {φ 1 D 1,, φ K D K } where each D i X: P Y X = 1 P Y, X Z X P Y, X = Z(X) = Y K i=1 P Y, X φ i (D i ) Nodes are connected by edge in H whenever they appear together in the scope of some factor
Linear-chain CRF Y 1 Y 2 Y K Linear-chain CRF X 1 X 2 X K P Y, X = P Y X = 1 Z X K i=1 φ(y i, Y i+1 ) P Y, X K i=1 φ(y i, X i ) Z(X) = Y P Y, X 39
CRF as a discriminative model Discriminative approach for labeling CRF does not model the distribution over the observations Dependencies between observed variables may be quite complex or poorly understood but we don t worry about modeling them Y 1 Y 2 Y T X 1 X 2 X T Y 1 Y 2 Y T X 1,, X T When labeling X i future observations are taken into account 40
CRF: discriminative model Conditional probability P(Y X) rather than joint probability P(Y, X) The probability of a transition between labels may depend on past and future observations CRF is based on the conditional probability of label sequence given observation sequence Allow arbitrary dependency between features on the observation sequence As opposed to independence assumptions in generative models 41
Naïve Markov Model X 1 X 2 X k Y X i is binary random variable Y: binary random variable φ i X i, Y = exp w i I X i = 1, Y = 1 φ 0 Y = exp w 0 I Y = 1 P Y = 1 X 1, X 2,, X k = σ w 0 + k j=1 w j X j 42
CRF: logistic model Naïve Markov model P Y, X = exp w 0 I Y = 1 + w i I X i = 1, Y = 1 m i=1 m P Y = 1, X = exp w 0 + w i X i i=1 P Y = 0, X = exp 0 = 1 P Y = 1 X = = σ w 0 + m i=1 1 k 1 + exp w 0 + j=1 w i X i w j X j Number of parameters is linear 43
CRF: Image segmentation example A node Y i for the label of each super-pixel Val(Y i ) = {1,2,, K} (i.e., grass, sky, water, ) An edge between Y i and Y j where the corresponding superpixels share a boundary A node X i for the features (e.g., color, texture, location) of each super-pixel 44
CRF: Image segmentation example Simple: φ Y i, Y j = exp{ λi(y i Y j )} More complex: e.g., horse adjacent to vegetation than water depends on the relative pixel location, e.g., water below vegetation, sky above every thing 45
CRF: Image segmentation example 46 [Koller s Book]
CRF: Named Entity Recognition φ(y i, Y i+1 ) φ(y i, X 1,, X T ) [Koller Book] Features: word capitalized, word in atlas of locations, previous word is Mrs, next word is Times, 47