Undirected Graphical Models - PDF Free Download

Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012)

Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional Models MEMM and Label Bias

Graphical Models: A Few Definitions Nodes (vertices) + links (arcs, edges) Node: a random variable Link: a probabilistic relationship Directed graphical models or Bayesian networks useful to express causal relationships between variables. Undirected graphical models or Markov random fields useful to express soft constraints between variables. Factor graphs convenient for solving inference problems

The Family of Graphical Models

Main References and Readings Section 8.3 in Patter Recognition and Machine Learning. J. Lafferty, A. McCallum and F. Pereira. : Probabilistic Models for Segmenting and Labeling Sequence Data. ICML 2001 (Test-of-Time Award, 2011). A. Ng and M. Jordan. On Discriminative vs. Generative classifies: A comparison of logistic regression and naive Bayes. NIPS 2001. Carlos Guestrin, Ben Taskar and Daphne Koller. Max-margin Markov Networks. NIPS 2003. D. McAllester, T. Hazan and J. Keshet. Direct Loss Minimization for Structured Output Learning. NIPS 2010.

Also called or Markov Network. Consists of nodes which correspond to variables or group of variables Links within the graph do not carry arrows Conditional independence is determined by simple graph separation

A C B Conditional independence properties simplify both the model structure and the computations needed to perform inference and learning under that model. Let a, b and c be three variables. If p(a b, c) = p(a c) then we say that a is conditionally independent of b given c, denoted as a b c

We identify three sets of nodes A,B, and C. To test whether the conditional independence property A B C holds, we consider all possible paths that connect nodes in set A to nodes in set B: If all such paths pass through one or more nodes in set C, then conditional independence property holds. Testing for conditional independence in undirected graphs is simpler than in directed graphs.

Alternative View Introduction An alternative way to view the conditional independence test is to imagine removing all nodes in set C from the graph together with any links that connect to those nodes. We then ask if there exists a path that connects any node in A to any node in B. If there are no such paths, then the conditional independence property must hold.

Markov Blanket Introduction The Markov blanket for an undirected graph takes a particularly simple form, because a node will be conditionally independent of all other nodes conditioned only on the neighboring nodes. Thus, the Markov blanket of a node simply consists of the set of all neighboring nodes.

Consideration If we consider two nodes (variables) x i and x j that are not connected by a link, then they must be conditionally independent given all other nodes in the graph. Corresponding conditional independence property: p(x i, x j x \{i,j} ) = p(x i x \{i,j} )p(x j x \{i,j} ) where x \{i,j} denotes the set x of all variables with x i and x j removed. The factorization of the joint distribution must be such that x i and x j do not appear in the same factor in order for the conditional independence property to hold for all possible distributions belonging to the graph.

Cliques Introduction A clique is a subset of the nodes in a graph such that there exists a link between all pairs of nodes in the subset, i.e., it is a fully connected or complete subgraph. A maximal clique is a clique such that it is not possible to include in the set any other nodes from the graph without it ceasing to be a clique. x 1 x 2 x 3 x 4 Figure: A four-node undirected graph showing a clique (outlined in green) and a maximal clique (outlined in blue).

Based on Cliques We can define the factors in the decomposition of the joint distribution to be functions of the variables in the maximal cliques. Let C denote a maximal clique and x C the set of variables in C. Joint distribution: p(x) = 1 ψ C (x C ) (1) Z which is a product of potential functions ψ C (x C ) over the maximal cliques of the graph. The partition function Z is a normalization constant given by Z = ψ C (x C ) x to ensure that p(x) is a probability distribution. C C

Potential Functions Introduction To ensure that p(x) 0, we consider only potential functions s.t. ψ C (x C ) 0. Unlike directed graphs in which each factor represents the conditional distribution of the corresponding variable conditioned on the state of its parents, here we do not restrict the choice of potential functions to those that have a specific probabilistic interpretation as marginal or conditional distributions. Due to the generality of the potential functions, their product will in general not be correctly normalized and hence the partition function has to be introduced as a normalization factor.

Partition Function Introduction The need for the partition function as a normalization constant is one of the major limitations of undirected graphs: If a graph has M discrete nodes each with K states, then evaluation of the partition function involves summing over K M states, which is exponential to the graph size. The partition function is needed for parameter learning because it will be a function of any parameters that govern the potential functions ψ C (x C ). However, there are situations when evaluation of the partition function is not needed, e.g. It is not needed for evaluating local conditional distributions because a conditional distribution is the ratio of two marginal distributions and hence the partition function cancels between numerator and denominator when evaluating the ratio.

and We need to restrict attention to potential functions ψ C (x C ) that are strictly positive, i.e., ψ C (x C ) > 0 Given an undirected graph whose nodes correspond to a fixed set of variables. The Hammersley-Clifford theorem states that the following two sets of distributions are identical: The set of distributions that are consistent with the set of conditional independence statements that can be read from the graph using graph separation. The set of distributions that can be expressed as a factorization of the form (1) w.r.t. the maximal cliques of the graph.

Exponential Representation of Potential Functions Since ψ C (x C ) > 0, it is convenient to express them as exponentials: ψ C (x C ) = exp{ E(x C )}, where E(x C ) is called an energy function and the exponential representation is called the Boltzmann distribution. The joint distribution is defined as the product of potentials, and so the total energy is the sum of the energies of the maximal cliques.

Illustrative Example Introduction Original binary image (left) and corrupted image after randomly changing 10% of the pixels (right): Restored images obtained using iterated conditional models (left) which gives a locally optimal solution, and using the graph cut algorithm (right) which guarantees a globally optimal solution:

Illustrative Example (2) An undirected graphical model representing a Markov random field for image denoising: y i x i x i { 1, +1} denotes the state of pixel i in the unknown noise-free image and y i { 1, +1} denotes the corresponding value of pixel i in the observed noisy image. Two types of strong correlation giving two types of cliques: Cliques of the form {x i, y i } with energy function ηx i y i for some constant η > 0. Cliques of the form {x i, x j } for neighboring pixels i and j with energy function βx i x j for some constant β > 0.

Illustrative Example (3) Because a potential function is an arbitrary, nonnegative function over a maximal clique, we can multiply it by any nonnegative functions of subsets of the clique, or equivalently we add the corresponding energies. This corresponds to adding an extra term hx i for each pixel i in the noise-free image. Such a term has the effect of biasing the model towards pixel values that have one particular sign in preference to the other. (If h = 0, the probabilities of the two states of x i are equal.) Complete energy function for model: E(x, y) = h i x i β {i,j} x i x j η i x i y i Corresponding joint distribution over x and y: p(x, y) = 1 Z exp{ E(x, y)}

Illustrative Example (4) Image denoising corresponds to finding an image x that maximizes the conditional distribution p(x y) by fixing y to the observed noisy image. In practice, it may not be possible to find x with the maximum probability but one with a sufficiently high probability. Different iterative optimization algorithms may be used by starting from some initial value of x, e.g., setting x to y.

Direction-to-Undirected Graph Conversion: Simple Case Given a directed graph, we convert it into an undirected graph. E.g., x 1 x 2 x N 1 x N x 1 x 2 x N 1 x N Joint distribution of directed graph: p(x) = p(x 1 )p(x 2 x 1 )p(x 3 x 2 )... p(x N x N 1 ) Joint distribution of undirected graph: p(x) = 1 Z ψ 1,2(x 1, x 2 )ψ 2,3 (x 2, x 3 )... ψ N 1,N (x N 1, x N )

Direction-to-Undirected Graph Conversion: Simple Case (2) From the two joint distributions we can identify the following equivalence relationships: ψ 1,2 (x 1, x 2 ) = p(x 1 )p(x 2 x 1 ) ψ 2,3 (x 2, x 3 ) = p(x 3 x 2 ) ψ N 1,N (x N 1, x N ) = p(x N x N 1 ). Note that the partition function Z = x ψ 1,2 (x 1, x 2 )ψ 2,3 (x 2, x 3 )... ψ N 1,N (x N 1, x N ) = p(x 1 )p(x 2 x 1 )p(x 3 x 2 )... p(x N x N 1 ) = 1

Direction-to-Undirected Graph Conversion: General Case Conversion can be achieved if the clique potentials of the undirected graph are given by the conditional distributions of the directed graph. This requires that the set of variables that appears in each of the conditional distributions be a member of at least one clique of the undirected graph. If a node in the directed graph has: One parent: this can be achieved simply by replaced the directed link with an undirected link. Multiple parents: we also need to add extra links between all pairs of parents (thus discarding some conditional independence properties).

Direction-to-Undirected Graph Conversion: General Case (2) x 1 x 3 x 1 x 3 x 2 x 2 x 4 x 4 The factor p(x 4 x 1, x 2, x 3 ) in the directed graph involves the four variables x 1, x 2, x 3, and x 4. So they must all belong to a single clique in the undirected graph if this conditional distribution is to be absorbed into a clique potential. The process of adding extra links between the parents ( marrying the parents ) is known as moralization.

Chain Graph Introduction The graphical framework can be extended in a consistent way to graphs that include both directed and undirected links, called chain graphs. Directed graphs and undirected graphs can be considered as special cases of chain graphs.

(CRFs) Like a Markov random field, a conditional random field (CRF) is an undirected graphical model. Unlike a Markov random field, the distribution of each discrete variable Y in the graph is conditioned on an input sequence X. CRF is a type of discriminative probabilistic model often used for labeling or segmenting sequential data, such as natural language text or biological sequences. A CRF is a generalization of an hidden Markov model (HMM) that makes the constant state transition probabilities into arbitrary functions that vary across the positions in the sequence of hidden states, depending on the input sequence.

Generative Models Introduction Generative vs. Conditional Models MEMM and Label Bias Hidden Markov models (HMMs) and stochastic grammars Assign a joint probability to paired observation and label sequences The parameters typically trained to maximize the joint likelihood of train examples

Generative Models (2) Generative vs. Conditional Models MEMM and Label Bias Difficulties and disadvantages Need to enumerate all possible observation sequences Not practical to represent multiple interacting features or long-range dependencies of the observations Very strict independence assumptions on the observations

Conditional Models Introduction Generative vs. Conditional Models MEMM and Label Bias Conditional probability P(label sequence y observation sequence x) rather than joint probability P(y, x) Specify the probability of possible label sequences given an observation sequence Allow arbitrary, non-independent features on the observation sequence x. The probability of a transition between labels may depend on past and future observations Relax strong independence assumptions in generative models

Generative vs. Conditional Models MEMM and Label Bias Maximum Entropy Markov Models (MEMMs) Given training set X with label sequence Y: Train a model θ that maximizes P(Y X, θ) For a new data sequence x, the predicted label y maximizes P(y x, θ) Notice the per-state normalization

MEMMs (2) Introduction Generative vs. Conditional Models MEMM and Label Bias MEMMs have all the advantages of Conditional Models Per-state normalization: all the mass that arrives at a state must be distributed among the possible successor states ( conservation of score mass ) Subject to Label Bias Problem Bias toward states with fewer outgoing transitions Due to per-state normalization

Label Bias Problem of MEMM Generative vs. Conditional Models MEMM and Label Bias Since P(2 1, x) = 1 and P(5 4, x) = 1 for all x (per-state normalization), then P(1, 2 r, i) = P(1 r)p(2 1, i) = P(1 r) P(4, 5 r, i) = P(4 r)p(5 4, i) = P(4 r) The probability does not depend on the second observation If one path is slightly more often in training, it always wins in testing.

Solve the Label Bias Problem Generative vs. Conditional Models MEMM and Label Bias Change the state-transition structure of the model Start with a fully-connected model and let the training procedure figure out a good structure Not always practical to change the set of states Preclude the use of prior, which is very valuable (e.g. in information extraction)

Generative vs. Conditional Models MEMM and Label Bias (CRFs) CRFs have all the advantages of MEMMs without label bias problem MEMM uses per-state exponential model for the conditional probabilities of next states given the current state CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence Undirected graph Allow some transitions vote more strongly than others depending on the corresponding observations

Definition of CRFs Introduction Generative vs. Conditional Models MEMM and Label Bias X: random variable over data sequences to be labeled Y: random variable over corresponding label sequences Definition Let G = (V, E) be a graph such that Y = (Y v ) v V, so that Y is indexed by the vertices of G. Then (X, Y) is a conditional random field in case, when conditioned on X, the random variables Y v obey the Markov property with respect to the graph: p(y v X, Y w, w v) = p(y v X, Y w, w v), where w v means that w and v are neighbors in G.

Example of CRFs Introduction Generative vs. Conditional Models MEMM and Label Bias Suppose p(y v X, all other Y) = p(y v X, neighbors Y v ), then X with Y is a conditional random field. p(y 3 X, all other Y) = p(y 3 X, Y 2, Y 4 ) Think of X as observation and Y as labels

Generative vs. Conditional Models MEMM and Label Bias Graphical Comparison Among HMMs, MEMMs and CRFs Graphical structures of simple HMM (left), MEMMs (middle) and chain-structured CRFs (right). Open circles indicate that the variables are not generated by the model.

Conditional Distribution Generative vs. Conditional Models MEMM and Label Bias If the graph G = (V, E) of Y is a chain, the conditional distribution over the label sequence y given x is: p θ (y x) = 1 Z (x) exp λ k f k (e, y e, x) + µ k g k (v, y v, x) e E,k v V,k f k and g k are given and fixed. g k is a Boolean vertex feature, while f k is a Boolean edge feature. k: number of features θ = (λ 1,... ; µ 1,...): λ k and µ k are parameters to be estimated y e: the set of components of y defined by edge e y v: the set of components of y defined by vertex v Z (x): normalization over the data sequence x

Parameter Estimation for CRFs Generative vs. Conditional Models MEMM and Label Bias Lafferty et al. presented iterative scaling algorithms But it is very inefficient. log p θ (y x) = λ k f k (e, y e, x) + µ k g k (v, y v, x) log Z (x) e E,k v V,k More efficient learning algorithms: LBFGS with approximate Hessian log p θ (y x) = λ k f k (e, y e, x) + µ k g k (v, y v, x) log Z (x) θ θ e E,k v V,k depending on graph structures, log Z (x) and its derivative can be hard Other optimization algorithms apply Note: standard MCLE over-fits, 2-norm regularization saves a lot!

Summary Introduction Generative vs. Conditional Models MEMM and Label Bias Discriminative models are prone to the label bias problem. CRFs provide the benefits of discriminative models. CRFs solve the label bias problem well, and demonstrate good performance.