Intelligent Systems: Undirected Graphical models (Factor Graphs) (2 lectures) Carsten Rother 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM
Roadmap for next two lectures Definition and Visualization of Factor Graphs Converting Directed Graphical Models to Factor Graphs Probabilistic Programing Queries and making decisions Binary-valued Factors graphs: Models and Optimization (ICM, Graph Cut) Multi-valued Factors Graphs: Models and Optimization (Alpha Expansion) 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM 2
Reminder: Structured Models - when to use what representation? Directed graphical model: The unknown variables have different meanings Example: MaryCalls, JohnCalls(J), AlarmOn(A), burglarinhouse(b) Factor Graphs: the unknown variables have all the same meaning Examples: Pixel in an image, nuclei in C-Elegans (worm) Undirected graphical model are used, instead of Factor graphs, when we are interested in studying conditional independency (not relevant for our context) 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM 3
Reminder: Machine Learning: Structured versus Unstructured Models Structured Output Prediction: m n f/p: Z X (for example: X = R or X = N ) Important: the elements in X do not make independent decisions Definition (not formal) The Output consists of several parts, and not only the parts themselves contain information, but also the way in which the parts belong together. Example: Image Labelling (Computer Vision) n n K has a fixed vocabulary, e.g. K={Wall, Picture, Person, Clutter, } Input: Image (Z m ) Output: Labeling (K m ) Important: The labeling of neighboring pixels is highly correlated 4
Reminder: Machine Learning: Structured versus Unstructured Models Structured Output Prediction: m n f/p: Z X (for example: X = R or X = N ) Important: the elements in X do not make independent decisions Definition (not formal) The Output consists of several parts, and not only the parts themselves contain information, but also the way in which the parts belong together. Example: Text Processing n n The boy went home Input: Text (Z m ) Output: X m (Parse tree of a sentence) 5
Factor Graph model - Example A Factor Graph defines a distribution as: p x = 1 f F F ψ F (x N F ) where f = x F F ψ F(x N F ) f: Partition function so that distribution is normalized F: Factor F: Set of all factors N(F): Neighbourhood of a factor ψ F : function (not distribution) depending on x N(F) (ψ C : K C R where x i K) Example p x 1, x 2 = 1 f ψ 1 x 1, x 2 ψ 2 x 2 x i 0,1 ; K = 2 x N 1 = {x 1, x 2 } ψ 1 0,0 = 1; ψ 1 0,1 = 0;ψ 1 1,0 = 1;ψ 1 1,1 = 2 x N 2 = {x 2 } ψ 2 0 = 1; ψ 2 1 = 0; f = 1 1 + 0 0 + 1 1 + 2 0 = 2 Check yourself that: x p x = 1 6
Factor Graph model - Visualization For visualization: utilize an undirected Graph G = (V, F, E), where V, F are the set of nodes and E the set of Edges Example p x 1, x 2 = 1 f ψ 1 x 1, x 2 ψ 2 x 2 x i 0,1 ; K = 2 x N 1 = {x 1, x 2 } ψ 1 0,0 = 1; ψ 1 0,1 = 0;ψ 1 1,0 = 1;ψ 1 1,1 = 2 x N 2 = {x 2 } ψ 2 0 = 1; ψ 2 1 = 0; f = 1 1 + 0 0 + 1 1 + 2 0 = 2 Check yourself that: x p x = 1 visualizes a variable node Visualization of: ψ x 1 x 2 2 ψ 1 visualizes a factor node means that these variables are in one factor p x 1, x 2 = 1 f ψ 1 x 1, x 2 ψ 2 x 2 7
Factor Graph model - Visualization Example: p x 1, x 2, x 3, x 4, x 5 = 1 f ψ x 1, x 2, x 4 ψ x 2, x 3 ψ x 3, x 4 ψ x 4, x 5 ψ x 4 ψ s are specified in some way x 1 x 2 x 4 x 5 Visualization x 3 visualizes a variable node visualizes a factor node means that these variables are in one factor 8
Probabilities and Energies p x = 1 f F F ψ F(x N F ) = 1 f F F exp{ θ F(x N F )} = 1 f exp{ F F θ F x N F } = 1 f exp{ E(x) } The energy E x is just a sum of factors: E x = F F θ F x N F The most likely solution x is reached by minimizing the energy: x = argmax x p x x = argmin x E(x) Note: 1) If x is minimizer of f(x) then also of log f(x) (note that x 1 x 2 means that log x 1 log x 2 ) 2) It is: log P x = log f E x = constant E(x) 9
Names The Probability distribution: p x = 1 f exp{ E(x)} with energy: E x = F F θ F x N F is a so-called Gibbs distribution and f = x exp{ E(x)} We define the order of a Factor Graph as the arity (number of variables) of the largest factor. Example of an order 3 model: E x = θ x 1, x 2, x 4 + θ x 2, x 3 + θ x 3, x 4 + θ x 5, x 4 + θ(x 4 ) arity 3 arity 2 arity 1 A different name for factor graph / undirected graphical model is Markov Random Field. This is an extension of Markov Chains to Fields. The name Markov stands for the Markov property that means essentially that the order of a factor is small. 10
Examples: Order 4-connected; pairwise MRF E(x) = θ ij (x i, x j ) i, j Є N 4 higher(8)-connected; pairwise MRF E(x) = θ ij (x i, x j ) Higher-order RF E(x) = θ ij (x i, x j ) i, j Є N 8 i, j Є N 4 +θ(x 1,, x n ) Order 2 Order 2 Order n Pairwise energy higher-order energy 11
Roadmap for next two lectures Definition and Visualization of Factor Graphs Converting Directed Graphical Models to Factor Graphs Probabilistic Programing Queries and making decisions Binary-valued Factors graphs: Models and Optimization (ICM, Graph Cut) Multi-valued Factors Graphs: Models and Optimization (Alpha Expansion) 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM 12
Converting Directed Graphical Model to Factor Graph A simple case: x 3 x 2 x 1 p x 1, x 2, x 3 = p x 3 x 2 p x 2 x 1 p(x 1 ) x 3 x 2 x 1 p x 1, x 2, x 3 = 1 f ψ x 3, x 2 ψ(x 2, x 1 ) ψ(x 1 ) where: f = 1 ψ x 3, x 2 = p x 3 x 2 ψ x 2, x 1 = p x 2 x 1 ψ x 1 = p(x 1 ) 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM 13
Converting Directed Graphical Model to Factor Graph A more complex case: x 4 x 4 x 3 x 2 x 3 x 2 x 1 p x 1, x 2, x 3, x 4 = p x 1 x 2, x 3 p x 2 p x 3 x 4 p(x 4 ) p x 1, x 2, x 3, x 4 = x 1 1 f (ψ x 1, x 2, x 3 ψ x 2 ψ x 3, x 4 ψ x 4 ) where: f = 1 ψ x 1, x 2, x 3 = p(x 1 x 2, x 3 ) ψ x 2 = p(x 2 ) ψ x 3, x 4 = p(x 3 x 4 ) ψ x 4 = p(x 4 ) 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM 14
Converting Directed Graphical Model to Factor Graph Take each conditional probability and convert it to a factor (without conditioning), i.e. replace conditional symbols with commas. Set normalization constant f = 1 Visualization: all parents to a certain node form a new factor (this step is called moralization) Comment the other direction is more complicated since factors ψ s have to be converted correctly to individual probabilities, such that overall joint distribution stays the same. Our example: Directed GM: p x 1, x 2, x 3, x 4 = p x 1 x 2, x 3 p x 2 p x 3 x 4 p(x 4 ) Factor Graph: p x 1, x 2, x 3, x 4 = 1 (ψ x f 1, x 2, x 3 ψ x 2 ψ x 3, x 4 ψ x 4 ) where: f = 1 ψ x 1, x 2, x 3 = p(x 1 x 2, x 3 ) ψ x 2 = p(x 2 ) ψ x 3, x 4 = p(x 3 x 4 ) ψ x 4 = p(x 4 ) 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM 15
Roadmap for next two lectures Definition and Visualization of Factor Graphs Converting Directed Graphical Models to Factor Graphs Probabilistic Programing Queries and making decisions Binary-valued Factors graphs: Models and Optimization (ICM, Graph Cut) Multi-valued Factors Graphs: Models and Optimization (Alpha Expansion) 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM 16
Probabilistic programming Languages A programming language for machine learning tasks. In particular for modelling, learning and making predictions in directed graphical models (DGM), undirected graphical models (UGM), and factor graphs (FG) Comment DGM and UGM are converted to Factor graphs. All operations are run on factor graphs. Basic Idea is to associate with each variables a distribution: See http://probabilistic-programming.org/wiki/home Bool coin = 0; % Normally C++: Bool coin = Bernoulli(0.5); % Bernoulli is a distribution with 2 states. Probabilistic Program. 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM 17
An Example: Two coins Example: You draw two fair coins. What is the chance that both a head? Random variables: coin1 (x 1 ) and coin2 (x 2 ), event z about the state of both variables We know: coin1 (x 1 ) and coin2 (x 2 ) are independent Each coin has equal probability to be head (1) or tail (0) New random variable z which is true if and only if both coins are head: z = x 1 & x 2 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM 18
An Example: Two coins x 1 x 2 x i 0,1 p x i = 1 = p x i = 0 = 0.5 z P(z = 1 x 1, x 2 ) P(z = 0 x 1, x 2 ) Value x 1 Value x 2 0 1 0 0 0 1 0 1 0 1 1 0 1 0 1 1 Joint: p x 1, x 2, z = p z x 1, x 2 p x 1 p(x 2 ) Compute Marginal: p z = x 1,x 2 p z, x 1, x 2 = x 1,x 2 p z x 1, x 2 p x 1 p(x 2 ) p z = 1 = 1 0.5 0.5 = 0.25 p z = 0 = 1 0.5 0.5 + 1 0.5 0.5 + 1 0.5 0.5 = 0.75 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM 19
An Example: Two coins - Infer.net Program: Run it: Add evidence to the program: 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM 20
Roadmap for next two lectures Definition and Visualization of Factor Graphs Converting Directed Graphical Models to Factor Graphs Probabilistic Programing Queries and making decisions Binary-valued Factors graphs: Models and Optimization (ICM, Graph Cut) Multi-valued Factors Graphs: Models and Optimization (Alpha Expansion) 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM 21
What to infer? Same as in directed graphical models: MAP inference (Maximum a posterior state): x = argmax x p x = argmin x E x Probabilistic Inference, so-called marginal: p x i = k = x x i =k p(x 1, x i = k,, x n ) This can be used to make a maximum marginal decision: x i = argmax xi p x i 22
MAP versus Marginal - visually Input Image Ground Truth Labeling MAP solution x (each pixel has 0,1 label) Marginal p x i (each pixel has a probability between 0 and 1) 23
MAP versus Marginal Making Decisions p(x z) Input image z Space of all solutions x (sorted by pixel difference) Which solution x would you choose? 24
Reminder: How to make a decision Assume model p x z is known Question: What solution x should we give out? Answer: Choose x which minimizes the Bayesian risk: x = argmin x x p x z C(x, x ) C x 1, x 2 is called the loss function (or cost function) of comparing to results x 1, x 2 25
MAP versus Marginal Making Decisions p(x z) Input image z Space of all solutions x (sorted by pixel difference) Which one is the MAP solution? MAP solution (red) takes globally optimal solution x = argmax x p x z = argmin x E x, z 26
Reminder: The Cost Function behind MAP The cost function for MAP: C x, x = 0 if x = x, otherwise 1 x = argmin x p x z C(x, x ) x = argmin x 1 p x = x z = argmax x p(x z) The MAP estimation optimizes a global 0-1 loss 27
The Cost Function behind Max Marginal Probabilistic Inference give marginal. We can take the max-marginal solution: x i = argmax xi p x i (where p x i = k = x xi =k p(x 1, x i = k,, x n ) This represents the decision with minimum Bayesian Risk: x = argmin x where C x, x = i x i x i 2 x p x z C(x, x ) (proof not done) For x i {0,1} this counts the number of differently labeled pixels Example: x 1 x 2 C x 1, x 2 = 10 C x 2, x 3 = 10 x 3 C x1, x 3 = 20 (numbers guessed) 28
MAP versus Marginal Making Decisions p = 0.1 (all numbers are arbitrary chosen) p = 0.11 p = 0.1 p = 0.2 p(x z) C=1 C=1 C=100 Which one is the Max-Marginal solution? Input image z Space of all solutions x (sorted by pixel difference) x is red then Risk is (sum only over 4 solutions): 0.1+0.1+100*0.2 = 20.2 x is blue then Risk is (sum only over 4 solutions): 11+10+10 = 31 Hence red is the max-marginal solution 29
This lecture: MAP Inference in order 2 models p x = 1 f exp{ E(x) } Gibbs distribution E x = θ i x i + θ ij x i, x j + θ i,j,k x i, x j, x k i i,j i,j,k Unary terms Pairwise terms Higher-order terms + We only look at energies with unary and pairwise factors MAP inference: x = argmax x p x = argmin x E x Label space: binary x i 0,1 or multi-label x i 0,, K 30
Roadmap for next two lectures Definition and Visualization of Factor Graphs Converting Directed Graphical Models to Factor Graphs Probabilistic Programing Queries and making decisions Binary-valued Factors graphs: Models and Optimization (ICM, Graph Cut) Multi-valued Factors Graphs: Models and Optimization (Alpha Expansion) 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM 31
Image Segmentation Input Image Desired binary output labeling with user brush strokes (blue-background; red-foreground) We will use the following energy: E x = i θ i x i + Unary term Binary Label: x i {0,1} i,j N 4 θ ij x i, x j N 4 is the set of all neighboring pixels Pairwise term x j x i θ ij (x i, x j ) θ i (x i ) 32
Image Segmentation: Energy Goal: formulate E(x) such that E x = 0.01 E x = 0.05 E x = 0.05 E x = 10 (numbers may not represent real numbers) MAP solution: x = argmin x E(x) 15/01/2015 Computer Vision I: Introduction 33
Green Unary term Green Gaussian Mixture Model fit Red user labelled pixels (cross foreground; dot background) Red Foreground model is blue Background model is red 15/01/2015 Computer Vision I: Introduction 34
Unary term New query image z i θ i x i = 0 = log P red (z i x i = 0) θ i x i = 1 = log P blue (z i x i = 1) θ i (x i = 0) θ i (x i = 1) Dark means likely background Dark means likely foreground Optimum with unary terms only x = argmin x E(x) E x = 15/01/2015 Computer Vision I: Introduction 35 i θ i x i
Pairwise term We choose a so-called Ising Prior: θ ij x i, x j = x i x j which gives an energy: E x = i,j N4 θ ij x i, x j θ ij (x i, x j ) x j Questions: What labelling has lowest energy? What labelling has highest energy? x i θ i (x i ) Lowest energy Lowest energy Intermediate energy Very high energy This models the assumption that the object is spatially coherent 36
Adding unary and Pairwise term ω = 0 ω = 10 Question: What happens when ω increases further? Question (done in exercise) Can the global optimum be computed with graph cut? Please prove. ω = 40 ω = 200 Energy: E x = i θ i x i + ω i,j N4 x i x j 15/01/2015 Computer Vision I: Introduction 37
Is it the best we can do? 4-connected segmentation zoom zoom Zoom-in on image 38
Is it the best we can do? Given: E x = i θ i x i + i,j N4 x i x j Which segmentation has higher energy? 0 0 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 0 1 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 1 1 0 0 1 1 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 Answers: 1) It depends on unary costs 2) The pairwise cost is the same in both cases (16 edges are cut of N 4 ) 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM 39
From 4-connected to 8-connected Factor Graph 4-connected 8-connected 1 Length of the paths: Eucl. 5.65 8 4-con. 6.28 6.28 8-con. 5.08 6.75 Larger connectivity can model true Euclidean length (also other metric possible) [Boykov et al. 03; 05] 40
Going to 8-connectivty 4-connected Euclidean 8-connected Euclidean (MRF) Is it the best we can do? Zoom-in image 41
Adapt the pairwise term E x = θ i x i + i i,j N 4 θ ij x i, x j exp β z i z j 2 Question (done in exercise) Can the global optimum be computed with graph cut? Please prove. θ ij x i, x j = x i x j (exp β z i z j 2 ) What is this term doing? β is a constant z i z j 2 Standard 4-connected Edge-dependent 4-connected 42
A probabilistic view 1. Just look at the conditional distribution Gibbs Distribution: p x z = 1 exp{ E(x, z) } f and f = x exp{ E(x, z)} E x, z = i θ i x i, z i + i,j N 4 θ ij x i, x j θ i x i = 0, z i = log p red (z i x i = 0) θ i x i = 1, z i = log p blue (z i x i = 1) θ ij x i, x j = x i x j 2. Factorize the conditional distribution p x z = 1 p z x p(x) p(z) is a constant factor p(z) p x = 1 exp{ x f i x j } p z x = 1 p(z 1 i,j N4 f i x i ) = 2 i 1 exp 0 = 1 (p blue z f i x i = 1 x i + p red z i x i = 0 1 x i ) exp 1 = 0.36 2 i Check yourself: p x z = 1 f p z x p x = 1 f exp{ E(x, z) } 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM 43
ICM - Iterated conditional mode Energy: x 2 E x = i θ i x i + i,j N 4 θ ij x i, x j x 3 x 1 x 4 x 5 Idea: fix all variable but one and optimize for this one. Insight: The optimization has an implicit energy that depends only on a few factors Example: E x 1 x \1 = i θ i x i + i,j N4 θ ij x i, x j = θ i x 1 + θ 12 x 1, x 2 + θ 13 x 1, x 3 + θ 14 x 1, x 4 + θ 15 x 1, x 5 x 1 x \1 means all labels but x 1 are fixed 44
ICM - Iterated conditional mode Energy: x 2 x 3 x 1 x 4 E x = i θ i x i + Algorithm: 1. Fix x = 0 2. From i = 1 n i,j N 4 θ ij x i, x j x 5 3. Update x i = argmin E x i x \i x i 4. Go to Step 2. if E(x) has not change wrt previous Iteration Problems: Can get stuck in local minima Depends on initialization ICM Global optimum (with graph cut) 45
ICM - parallelization Normal procedure: Parallel procedure: Step 1 Step 2 Step 3 Step 4 Step 1 Step 2 Step 3 Step 4 The schedule is a more complex task in graphs which are not 4-connected 46
Roadmap for next two lectures Definition and Visualization of Factor Graphs Converting Directed Graphical Models to Factor Graphs Probabilistic Programing Queries and making decisions Binary-valued Factors graphs: Models and Optimization (ICM, Graph Cut) Multi-valued Factors Graphs: Models and Optimization (Alpha Expansion) 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM 47