Undirected Graphical Models - PDF Free Download

Readings: K&F 4. 4.2 4.3 4.4 Undirected Graphical Models Lecture 4 pr 6 20 SE 55 Statistical Methods Spring 20 Instructor: Su-In Lee University of Washington Seattle ayesian Network Representation irected acyclic graph structure onditional parameterization Independencies in graphs From distribution to N graphs onditional probability distributions (Ps) Table eterministic ontext-specific (Tree Rule Ps) Independence of causal influence (Noisy OR GLMs) ontinuous variables Hybrid models SE 55 Statistical Methods Spring 20 2

The Misconception Example Four students get together in pairs to work on HWs: lice ob harles ebbie Only the following pairs meet: (&) (&) (&) (&) Let s say that the prof accidentally misspoke in class Each student may subsequently have figured out the problem. In subsequent study pairs they may transmit this newfound understanding to their partners. onsider 4 binary random variables : whether the student has the misconception or not. Independence assumptions? an we find the P-map for these? 3 Reminder: Perfect Maps G is a perfect map (P-map) for P if I(P)=I(G) oes every distribution have a P-map? No: some structures cannot be represented in a N Independencies in P: ( ) and ( ) ( ) does not hold ( ) also holds SE 55 Statistical Methods Spring 20 4 2

Representing ependencies ( ) and ( ) annot be modeled with a ayesian network. an be modeled with an undirected graphical models (Markov networks). SE 55 Statistical Methods Spring 20 5 Undirected Graphical Models (Informal) Nodes correspond to random variables Edges correspond to direct probabilistic interaction n interaction not mediated by any other variables in the network. How to parameterize? Local factor models are attached to sets of nodes Factor elements are positive o not have to sum to Represent affinities compatibilities π a 0 d 0 00 a 0 d a d 0 a d 00 π 2 a 0 b 0 30 a 0 b 5 a b 0 a b 0 π 3 π 4 c 0 d 0 b 0 c 0 00 c 0 d 00 b 0 c c d 0 00 b c 0 c d b c 000 SE 55 Statistical Methods Spring 20 6 3

4 Undirected Graphical Models (Informal) Represents joint distribution Unnormalized factor Probability Partition function s undirected graphical models represent joint distributions they can be used for answering queries. ) ( 4 3 2 d c d b c a b a d c b a F π π π = π = d c b a d c d b c a b a 4 3 2 π π π π ) ( 4 3 2 d c d b c a b a d c b a P π π π π = SE 55 Statistical Methods Spring 20 7 Undirected Graphical Models lurb Useful when edge directionality cannot be assigned Simpler interpretation of structure Simpler inference Simpler independency structure Harder to learn parameters/structures We will also see models with combined directed and undirected edges Markov networks SE 55 Statistical Methods Spring 20 8

Markov Network Structure Undirected graph H Nodes X X n represent random variables H encodes independence assumptions path X -X 2 - -X k is active if none of the X i variables along the path are observed X and Y are separated in H given if there is no active path between any node x X and any node y Y given enoted sep H (X;Y ) {} Global independencies associated with H: I(H) = {(X Y ) : sep H (X;Y )} 9 Relationship with ayesian Network ayesian network Local independencies Independence by d-separation (global) Markov network Global independencies Local independencies an all independencies encoded by Markov networks be encoded by ayesian networks? No counter example ( ) and ( ) an all independencies encoded by ayesian networks be encoded by Markov networks? No immoral v-structures (explaining away) Markov networks encode monotonic independencies If sep H (X;Y ) and then sep H (X;Y ) SE 55 Statistical Methods Spring 20 0 5

Markov Network Factors factor is a function from value assignments of a set of random variables to real positive numbers R + The set of variables is the scope of the factor Factors generalize the notion of Ps Every P is a factor (with additional constraints) X W π XW x 0 w 0 00 x 0 w x w 0 x w 00 W X X Y π 2 XY x 0 y 0 30 x 0 y 5 x y 0 x y 0 Y Factors and Joint istribution an we represent any joint distribution by using only factors that are defined on edges? No! ompare # of parameters Example: n binary RVs Joint distribution has 2 n - independent parameters Markov network with edge factors has 4 n parameters 2 Needed: 2 7 - = 27! G Edge parameters: 4 ( 7 2 )=84 F E Factors introduce constraints on joint distribution SE 55 Statistical Methods Spring 20 2 6

Factors and Graph Structure re there constraints imposed on the network structure H by a factor whose scope is? Hint : think of the independencies that must be satisfied Hint 2: generalize from the basic case of =2 The induced subgraph over must be a clique (fully connected) Why? otherwise two unconnected variables may be independent by blocking the active path between them contradicting the direct dependency between them in the factor over X XX2X3X4 xx2x3x4 cliques (FFFF) 00 X4 X2 (FFFT) 5 (FFTF) 3 X3 (FFTT) 00 Markov Network Factors: Examples Maximal cliques {} {} {} {} Maximal cliques {} {} SE 55 Statistical Methods Spring 20 4 7

Markov Network istribution distribution P factorizes over H if it has: set of subsets... m where each i is a complete (fully connected) subgraph in H Factors π...π m m such that P( X... X n) = f ( X... X n) = π i i where un-normalized factor: = is called the partition function P is also called a Gibbs distribution over H f ( X... X n ) = π i i f ( X... X n ) = π i i X... X n X... X n SE 55 Statistical Methods Spring 20 5 Pairwise Markov Networks pairwise Markov network over a graph H has: set of node potentials {πx i :i=...n} set of edge potentials {πx i X j : X i X j H} Example: X X 2 X 3 X 4 X 2 X 22 X 23 X 24 X 3 X 32 X 33 X 34 SE 55 Statistical Methods Spring 20 6 8

Logarithmic Representation We represent energy potentials by applying a log transformation to the original potentials π = exp(-ε) where ε = -ln π ny Markov network parameterized with factors can be converted to a logarithmic representation The log-transformed potentials can take on any real value The joint distribution decomposes as m P( X... X n) = exp ε i i i= Log P(X) is a linear function. SE 55 Statistical Methods Spring 20 7 I-Maps and Factorization Independency mappings (I-map) I(P) set of independencies (X Y ) in P I-map independencies by a graph is a subset of I(P) ayesian Networks Factorization and reverse factorization theorems G is an I-map of P iff P factorizes as n P( X... X n) = P( X i Pa( X i )) i= Markov Networks Factorization and reverse factorization theorems H is an I-map of P iff P factorizes as P( X... X n) = π i i SE 55 Statistical Methods Spring 20 8 9

Reverse Factorization P( X... X n) = π i i H is an I-map of P Proof: Let XYW be any three disjoint sets of variables such that W separates X and Y in H We need to show (X Y W) I(P) ase : X Y W=U (all variables) s W separates X and Y there are no direct edges between X and Y any clique in H is fully contained in X W or Y W Let I X be subcliques in X W and I Y be subcliques in Y W (not in X W) P( X... X n ) = π i i π i i = f ( X W ) g( Y W ) i I X i I Y (X Y W) I(P) : : : SE 55 Statistical Methods Spring 20 9 Reverse Factorization P X... X ) = π H is an I-map of P ( n i i Proof: Let XYW be any three disjoint sets of variables such that W separates X and Y in H We need to show (X Y W) I(P) ase 2: X Y W U (all variables) Let S=U-(X Y W) S can be partitioned into two disjoint sets S and S 2 such that W separates X S and Y S 2 in H From case we can derive (XS YS 2 W) I(P) From decomposition of independencies (X Y W) I(P) : : : SE 55 Statistical Methods Spring 20 20 0

Factorization If H is an I-map of P then P( X... X n) = π i i Holds only for positive distributions P Hammerly-lifford theorem efer proof SE 55 Statistical Methods Spring 20 2 Relationship with ayesian Network ayesian Networks Semantics defined via local independencies I L (G). Global independencies induced by d-separation Local and global independencies equivalent since one implies the other Markov Networks Semantics defined via global separation property I(H) an we define the induced local independencies? We show two definitions (call them Local Markov assumptions ) ll three definitions (global and two local) are equivalent only for positive distributions P SE 55 Statistical Methods Spring 20 22

Pairwise Independencies Every pair of disconnected nodes are separated given all other nodes in the network Formally: I P (H) = { ( X Y U - {XY} ) : X Y H} Example: ( E) ( E) ( E ) E SE 55 Statistical Methods Spring 20 23 Local Independencies Every node is independent of all other nodes given its immediate neighboring nodes in the network Markov blank of X M H (X) Formally: I L (H) = { (X U-{X}-M H (X) M H (X)) : X H} Example: ( E) ( E) ( E) ( E ) (E ) E SE 55 Statistical Methods Spring 20 24 2

Relationship etween Properties Let I(H) be the global separation independencies Let I L (H) be the local (Markov blanket) independencies Let I P (H) be the pairwise independencies For any distribution P: I(H) I L (H) The assertion in I L (H) that a node is independent of all other nodes given its neighbors is part of the separation independencies since there is no active path between a node and its non-neighbors given its neighbors I L (H) I P (H) Follows from the monotonicity of independencies in Markov networks (if (X Y ) and then (X Y )) SE 55 Statistical Methods Spring 20 25 Relationship etween Properties Let I(H) be the global separation independencies Let I L (H) be the local (Markov blanket) independencies Let I P (H) be the pairwise independencies For any positive distribution P: I P (H) I(H) Proof relies on intersection property for probabilities (X Y W) and (X W Y) (X YW ) which holds in general only for positive distributions etails on the textbook Thus for positive distributions I(H) I L (H) I P (H) How about a non-positive distribution? SE 55 Statistical Methods Spring 20 26 3

The Need for Positive istribution Let P satisfy is uniformly distributed == H P satisfies I P (H) ( ) ( ) (since each variable determines all others) P does not satisfy I L (H) ( ) needs to hold according to I L (H) but does not hold in the distribution SE 55 Statistical Methods Spring 20 27 onstructing Markov Network for P Goal: Given a distribution we want to construct a Markov network which is an I-map of P omplete (fully connected) graphs will satisfy but are not interesting Minimal I-maps: graph G is a minimal I-Map for P if: G is an I-map for P Removing any edge from G renders it not an I-map Goal: construct a graph which is a minimal I-map of P SE 55 Statistical Methods Spring 20 28 4

onstructing Markov Network for P If P is a positive distribution then I(H) I L (H) I P (H) Thus sufficient to construct a network that satisfies I P (H) onstruction algorithm For every (XY) add edge if (X Y U-{XY}) does not hold in P Theorem: network is minimal and unique I-map Proof: I-map follows since I P (H) by construction and I(H) by equivalence Minimality follows since deleting an edge implies (X Y U-{XY}) ut we know by construction that this does not hold in P since we added the edge in the construction process Uniqueness follows since any other I-map has at least these edges and to be minimal cannot have additional edges SE 55 Statistical Methods Spring 20 29 onstructing Markov Network for P If P is a positive distribution then I(H) I L (H) I P (H) Thus sufficient to construct a network that satisfies I L (H) onstruction algorithm onnect each X to every node in the minimal set Y s.t.: {(X U-{X}-Y Y) : X H} Theorem: network is minimal and unique I-map SE 55 Statistical Methods Spring 20 30 5

Markov Network Parameterization Markov networks have too many degrees of freedom clique over n binary variables has 2 n parameters but the joint has only 2 n - parameters The network has clique {} and {} oth capture information on which we can choose where we want to encode (in which clique) We can add/subtract between the cliques We can come up with infinitely many sets of factor values that lead to the same distribution Need: conventions for avoiding ambiguity in parameterization an be done using a canonical parameterization (see K&F 4.4.2.) SE 55 Statistical Methods Spring 20 3 Factor Graphs From the Markov network structure we do not know whether parameterization involves maximal cliques or edge potentials Example: fully connected graph may have pairwise potentials or one large (exponential) potential over all nodes Solution: Factor Graphs Undirected graph Two types of nodes Variable nodes Factor nodes Parameterization Each factor node is associated with exactly one factor Scope of factor are all neighbor variables of the factor node SE 55 Statistical Methods Spring 20 32 6

Factor Graphs Example Exponential (joint) parameterization Pairwise parameterization V V Markov network V V Factor graph for joint parameterization Factor graph for pairwise parameterization SE 55 Statistical Methods Spring 20 33 Local Structure Factor graphs still encode complete tables X feature φ on variables is an indicator function that for some y : when x = w φ = 0 otherwise distribution P is a log-linear model over H if it has Features φ...φ k k where each i is a complete subgraph in H set of weights w...w k such that P ) = exp X W π XW x 0 w 0 00 x 0 w x w 0 x w 00 k φi = ( X... X n w i i i SE 55 Statistical Methods Spring 20 34 W Y 7

Feature Representation Several features can be defined on one clique any factor can be represented by features where in the most general case we define a feature and weight for each entry in the factor Log-linear model is more compact for many distributions especially with large domain variables Representation is intuitive and modular Features can be modularly added between any interacting sets of variables SE 55 Statistical Methods Spring 20 35 Markov Network Parameterizations hoice : Markov network Product over potentials Right representation for discussing independence queries hoice 2: Factor graph Product over graphs Useful for inference (later) hoice 3: Log-linear model Product over feature weights Useful for discussing parameterizations Useful for representing context specific structures ll parameterizations are interchangeable 36 8

omain pplication: Vision The image segmentation problem Task: Partition an image into distinct parts of the scene Example: separate water sky background SE 55 Statistical Methods Spring 20 37 Markov Network for Segmentation Grid structured Markov network Random variable X i corresponds to pixel i omain is {...K} Value represents region assignment to pixel i Neighboring pixels are connected in the network X X 2 X 3 X 4 X 2 X 22 X 23 X 24 X 3 X 32 X 33 X 34 SE 55 Statistical Methods Spring 20 38 9

Markov Network for Segmentation ppearance distribution w ik extent to which pixel i fits region k (e.g. difference from typical pixel for region k) Introduce node potential exp(-w ik {X i =k}) Edge potentials Encodes contiguity preference by edge potential exp(λ{x i =X j }) for λ>0 ppearance distribution X X 2 X 3 X 4 X 2 X 22 X 23 X 24 X 3 X 32 X 33 X 34 ontiguity preference 39 Markov Network for Segmentation ppearance distribution X X 2 X 3 X 4 X 2 X 22 X 23 X 24 X 3 X 32 X 33 X 34 Solution: inference ontiguity preference Find most likely assignment to X i variables SE 55 Statistical Methods Spring 20 40 20

Example Results Result of segmentation using node potentials alone so that each pixel is classified independently Result of segmentation using a pairwise Markov network encoding interactions between adjacent pixels SE 55 Statistical Methods Spring 20 4 Summary: Markov Network Representation Independencies in graph H Global independencies I(H) = {(X Y ) : sep H (X;Y )} Local independencies I L (H) = { (X U-{X}-M H (X) M H (X)) : X H} Pairwise independencies I P (H) = { ( X Y U - {XY} ) : X Y H} For any positive distribution P they are equivalent. (Reverse) factorization theorem: I-map factorization Markov network factors Has to encompass cliques Maximal cliques edge factors Log-linear model Features instead of factors Pairwise Markov network Node/ edge potentials pplication in vision (image segmentation) What next? onstructing Markov networks from ayesian networks Hybrid models (e.g. onditional Random Fields) 42 2