Course 16:198:520: Introduction To Artificial Intelligence Lecture 9. Markov Networks. Abdeslam Boularias. Monday, October 14, 2015

Size: px

Start display at page:

Download "Course 16:198:520: Introduction To Artificial Intelligence Lecture 9. Markov Networks. Abdeslam Boularias. Monday, October 14, 2015"

Claud Carr
5 years ago
Views:

1 Course 16:198:520: Introduction To Artificial Intelligence Lecture 9 Markov Networks Abdeslam Boularias Monday, October 14, / 58

Overview Bayesian networks, presented in the previous lecture, are one type of graphical models. Bayesian networks are mostly used for diagnostic, such as in medicine and in business analytics.

2 Overview Bayesian networks, presented in the previous lecture, are one type of graphical models. Bayesian networks are mostly used for diagnostic, such as in medicine and in business analytics. In this lecture, we focus on another important class of graphical models known as Markov Networks (a.k.a Markov Random Fields). Markov Networks are extensively used in computer vision and image processing. Andrey Markov 2 / 58

3 Application 1: Image denoising from Christopher Burger, Christian Schuler and Stefan Harmeling. CVPR / 58

4 GRAPHICAL 8. MODELS Application 1: Image denoising ( from Pattern Recognition and Machine Learning by Christopher Bishop) 8. GRAPHICAL MODELS Original image Noisy image Figure Figure Illustration of image of image de-noising using using a Markov a Markov random random field. field. The The top top row row shows shows the the original original binary binary image image on on the the left left and and the the corrupted image image after after randomly changing 10% 10% of the of the pixels pixels on on the the right. right. The The Illustration of bottom image bottom row de-noising row shows shows the using the Reconstructed restored a Markov images images random obtained image field. using The using iterated top iterated rowa conditional shows Markov themodels original models Network (ICM) (ICM) on on the the left left and and using using on the left and the the graph-cut corrupted algorithm image after on on the randomly the right. right. ICM changing ICM produces 10% anofan theimage pixels where where on96% the96% right. of the of The the pixels pixels agree agree with with the the original original 4 / 58

5 ODELS Application 1: Image denoising ( from Pattern Recognition and Machine Learning by Christopher Bishop) 8.3. Markov Random Fields 389 Figure 8.31 An undirected graphical model representing a Markov random field for image de-noising, in which xi is a binary variable denoting the state of pixel i in the unknown noise-free image, and yi denotes the corresponding value of pixel i in the observed noisy image. y i Noisy image equivalently we can add the corresponding energies. In this example, this allows us to add an extra term hx i for each pixel i in the noise-free image. Such a term has the effect of biasing the model towards pixel values that have one particular sign in preference to the other. The complete energy function for the model then takes the form p(x, y) = 1 exp{ E(x, y)}. (8.43) e de-noising using a Markov random field. The top row shows Zthe original corrupted image after randomly changing 10% of the pixels on the right. The ages obtained using iterated We conditional now fix models the elements (ICM) on of the y to left the and observed using values given by the pixels of the ght. ICM produces an image noisy where image, 96% which of theimplicitly pixels agree defines with the a conditional original distribution p(x y) over noise- x i Markov Network We construct a graph (network), where each node corresponds to a pixel in the image. Nodes are binary random variables (e.g. blue or yellow). E(x, y) =h x i β x i x j η x i y i (8.42) Nodes X describe the true color i of {i,j} each pixel. i Nodes Ywhich describe defines a joint thedistribution observed over xcolor and y given of by each pixel. Adjacent pixels correspond to adjacent nodes in the graph. Inference problem: Find the Maximum A Posteriori (MAP) argmax X P (X Y ). 5 / 58

Yi Ma, Harm Derksen, Wei Hong and John Wright.

6 Application 2: Background/Foreground separation Segmentation of Multivariate Mixed Data via Lossy Coding and Compression. Yi Ma, Harm Derksen, Wei Hong and John Wright. IEEE Transactions on Pattern Analysis and Machine Intelligence, / 58

7 Application 3: detecting graspable parts of objects from depth images 7 / 58

8 Markov Networks Markov networks are useful for applications where relations between variables are symmetrical. Bayesian networks represent causal relations, e.g. Smoking Cancer. Markov networks represent correlations between variables, e.g. Bob is a democrat Alice is a democrat when Bob and Alice are friends, these two variables depend on each other, but neither is the cause of the other. A Markov network is a set of random variables having a Markov property described by an undirected graph. 8 / 58

9 Markov Networks An example of a Markov random field. Each edge represents dependency. In this example: A depends on B and D. B depends on A and D. D depends on A, B, and E. E depends on D and C. C depends on E. (From Wikipedia) 9 / 58

10 Complete subgraphs (cliques) Definition A complete subgraph, or a clique, is a subgraph where every two vertices are connected to each other. Example 1-vertex cliques: C 1 = {A}, C 2 = {B}, C 3 = {C}, C 4 = {D}, C 5 = {E}. 2-vertex cliques: C 6 = {A, B}, C 7 = {A, D}, C 8 = {B, D}, C 9 = {D, E}, C 10 = {E, C}. 3-vertex cliques: C 11 = {A, B, D}. 10 / 58

11 Factor Potential Definition Let X be a set of random variables. We define a factor potential to be a function from values of X to R +. Example Let X = {BobDemocrat, AliceDemocrat}. We define the following factor potential φ: φ([bobdemocrat = true, AliceDemocrat = true]) = 21 φ([bobdemocrat = true, AliceDemocrat = f alse]) = 2.7 φ([bobdemocrat = f alse, AliceDemocrat = true]) = 3.6 φ([bobdemocrat = f alse, AliceDemocrat = f alse]) = / 58

12 Definition of a Markov Network Definition Let X = {X 1, X 2,..., X n } be a set of random variables and G a graph that has X as vertices. Let C = {C 1, C 2,..., C m } be the set of all the complete subgraphs in G. Let φ 1, φ 2,..., φ m be factor potentials defined over C 1, C 2,..., C m respectively. X is a Markov network if: P (X 1, X 2,..., X n ) = 1 Z φ 1(C 1 ) φ 2 (C 2 ) φ m (C m ). Z is a normalization constant, called the partition function, it is defined as Z = φ 1 (C 1 ) φ 2 (C 2 ) φ m (C m ). X 1,X 2,...,X n (each C i is a subset of {X 1, X 2,..., X n }). 12 / 58

13 Example 1 1-vertex cliques: C 1 = {A}, C 2 = {B}, C 3 = {C}, C 4 = {D}, C 5 = {E}. 2-vertex cliques: C 6 = {A, B}, C 7 = {A, D}, C 8 = {B, D}, C 9 = {D, E}, C 10 = {E, C}. 3-vertex cliques: C 11 = {A, B, D}. Assume the random variables {A, B, C, D, E} are boolean. Define a factor potential φ i for each clique C i. Example: φ 6 is defined over the clique C 6 = {A, B} φ 6 ([A = true, B = true]) = 21 φ 6 ([A = true, B = false]) = 2.7 φ 6 ([A = false, B = true]) = 3.6 φ 6 ([A = false, B = false]) = / 58

14 Example 1 Probability of (A = true, B = false, C = true, D = true, E = false) is given as: P (A = true, B = false, C = true, D = true, E = false) = 1 Z φ 1(C 1 ) φ 2 (C 2 ) φ 3 (C 3 ) φ 4 (C 4 ) φ 5 (C 5 ) φ 6 (C 6 ) φ 7 (C 7 ) φ 8 (C 8 ) φ 9 (C 9 ) φ 10 (C 10 ) φ 11 (C 11 ). The variables inside each clique C i take their corresponding values from (A = true, B = false, C = true, D = true, E = false). Z = (A,B,C,D,E) {true,false} 5 m φ i (C i ) i=0 14 / 58

15 tation 21 Representation Example 2 B Patient 1 TB Patient 1 TB Patient 2 B Patient 3 TB Patient 4 TB Patient 2 TB Patient 3 TB Patient 4 P 1 p 1 p 1 P 1 P 3 π(p 1, P 3 ) P 1 P 2 π(p 1, P 2 ) p 1 p 2 1 P 1 P 2 π(p 1, P 2 ) p 1 p p 1 p 2 1 p 1 p 0.5 P 2 2 π(p 2 ) p 1 p P p p 1 π(p 2 2 p ) p 1 p p p p 1 p 2 2 p P 1 P 2 P 1 P 2 P 4 P 2 π(p 2, P 4 ) π(p 2 ) p 1 p 3 1 P 1 P 3 π(p 1, P 3 ) p 2 p 4 1 P 2 P 4 π(p 2, P 4 p 1 p p 1 p 3 1 p 2 p p 2 p 4 1 p 1 p p 2 p p 1 p p 2 p 0.5 p 4 1 p 3 2 p 2 p 4 2 p 1 p 3 P 0.5 p 2 p P 4 p 1 p 3 2 p 2 p 4 2 P 3 π(p 3 ) P P 3 4 π(pp 4 ) p 3 P 3 P 4 π(p 3, P 4 ) 0.2 p P 3 π(p 3 ) P 4 π(p 4 ) p p p 1 p 3 4 p 3 P 3 P 4 π(p 4 3, P ) 0.2 p p p 3 p p 3 p 4 1 p p 3 p p 3 p p 3 p 4 2 p 3 p p 3 p 4 2 Graph (a) of random variables Tables of factor (b) potentials. Potentials of 1-vertex (a) cliques are denoted by π(.) and potentials of 2-vertex (b) cliques are 2.3 denoted (a) Aby simple π(.,.). Random Markovvariable network P i describing can be interpreted the tuberculosis as patient having status TB. oftwo four s. Figure The links 2.3 between (a) A simple patients Markov indicate network which describing patients the have tuberculosis been in contact status of f variables are linked if the corresponding patients had a physical contact with each other. ch patients. other. (b) The The links same between Markov patients network, indicate together whichwith patients the node haveand been edge in con π(p 1 ) Example taken from Daphne Koller, Nir Friedman, Lise Getoor and Ben Taskar. Graphical Models in a Nutshell 15 / P 2 p 2 p 2

16 Independence properties in Markov networks Let X be a Markov network. We use the notation (X Y ) Z to indicate that X is independent of Y given Z, i.e. P (X, Y Z) = P (X Z)P (Y Z), in other terms P (X Y, Z) = P (X Z). 16 / 58

17 Independence properties in Markov networks Let X be a Markov network. We use the notation (X Y ) Z to indicate that X is independent of Y given Z, i.e. P (X, Y Z) = P (X Z)P (Y Z), in other terms P (X Y, Z) = P (X Z). We show here some nice independence properties in Markov networks. Independence properties are useful for fast inference, we don t have to enumerate all the possible combinations of values of all the variables. 17 / 58

18 Independence properties in Markov networks Let X be a Markov network. We use the notation (X Y ) Z to indicate that X is independent of Y given Z, i.e. P (X, Y Z) = P (X Z)P (Y Z), in other terms P (X Y, Z) = P (X Z). We show here some nice independence properties in Markov networks. Independence properties are useful for fast inference, we don t have to enumerate all the possible combinations of values of all the variables. Pairwise Markov property Any two non-adjacent variables, X i and X j, are conditionally independent of each other given all other variables X {X i, X j } (X i X j ) X {X i, X j }. In other terms, P (X i X j, X {X i, X j }) = P (X i X {X i, X j }). 18 / 58

19 Example A and E are independent of each other, given B, C, and D. 19 / 58

20 Independence properties in Markov networks Local Markov property A variable is conditionally independent of all other variables given its neighbors ( Xi X {X i, neighbors(x i )} ) neighbors(x i ). 20 / 58

21 Independence properties in Markov networks P 1 p 1 p 1 π(p 1 ) P 1 P p 1 p p 1 p p 1 p p 1 p Example TB Patient 1 TB Patient 2 P 1 P 3 π(p 1, P 3 ) p 1 p 3 1 p 1 p p 1 p p 1 p 3 2 P 1 P 3 TB Patient 3 TB Patient 4 P 3 p 3 p 3 π(p 3 ) P 3 p 3 p 3 P p 3 p 3 This Markov Network describes the following (a) local Markov assumptions: (P 1 P 4 P 2, P 3 ), Figure 2.3 (a) A simple Markov network describing the t (P 2 P 3 P 1, patients. P 4 ), The links between patients indicate which patien with each other. (b) The same Markov network, together potentials. 21 / 58

22 Independence properties in Markov networks Global Markov property Any two subsets of variables, X A X and X B X, are conditionally independent given a separating subset X S X: (X A X B ) X S. where every path from a node in X A to a node in X B passes through X S. 22 / 58

23 Example X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X 10 X 11 X 12 X 13 X 14 X 15 X 16 X 17 X 18 X 19 X 20 X 21 A Markov Network 23 / 58

24 Example X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X 10 X 11 X 12 X 13 X 14 X 15 X 16 X 17 X 18 X 19 X 20 X 21 Set X S = {X 4, X 11, X 17 } separates sets X A = {X 1, X 2, X 3, X 8, X 9, X 10, X 15, X 16 } and X B = {X 5, X 6, X 7, X 12, X 13, X 14, X 18, X 19, X 20, X 21 } 24 / 58

25 Example X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X 10 X 11 X 12 X 13 X 14 X 15 X 16 X 17 X 18 X 19 X 20 X 21 Set X S = {X 4, X 11, X 17 } separates sets X A = {X 1, X 2, X 3, X 8, X 9, X 10, X 15, X 16 } and X B = {X 5, X 6, X 7, X 12, X 13, X 14, X 18, X 19, X 20, X 21 } Example: any path between X 8 and X 13 should pass through X 4, X 11 or X / 58

26 Example X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X 10 X 11 X 12 X 13 X 14 X 15 X 16 X 17 X 18 X 19 X 20 X 21 Set X S = {X 4, X 11, X 17 } separates sets X A = {X 1, X 2, X 3, X 8, X 9, X 10, X 15, X 16 } and X B = {X 5, X 6, X 7, X 12, X 13, X 14, X 18, X 19, X 20, X 21 } Example: any path between X 8 and X 13 should pass through X 4, X 11 or X / 58

27 Example X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X 10 X 11 X 12 X 13 X 14 X 15 X 16 X 17 X 18 X 19 X 20 X 21 Set X S = {X 4, X 11, X 17 } separates sets X A = {X 1, X 2, X 3, X 8, X 9, X 10, X 15, X 16 } and X B = {X 5, X 6, X 7, X 12, X 13, X 14, X 18, X 19, X 20, X 21 } Example: P (X 8 X 13, X 4, X 11, X 17 ) = P (X 8 X 4, X 11, X 17 ), and P (X 13 X 8, X 4, X 11, X 17 ) = P (X 13 X 4, X 11, X 17 ). 27 / 58

28 Hammersley-Clifford theorem Theorem Let X = {X 1, X 2,..., X n } be a set of random variables and G a graph that has X as vertices. If X satisfies the global Markov independence property, then P (X 1, X 2,..., X n ) = 1 Z φ 1(C 1 ) φ 2 (C 2 ) φ m (C m ). where C = {C 1, C 2,..., C m } is the set of all the complete subgraphs in G, and φ 1, φ 2,..., φ m are factor potentials defined over C 1, C 2,..., C m respectively. In other terms, X is a Markov network. 28 / 58

29 Hammersley-Clifford theorem Theorem Let X = {X 1, X 2,..., X n } be a set of random variables and G a graph that has X as vertices. If X satisfies the global Markov independence property, then P (X 1, X 2,..., X n ) = 1 Z φ 1(C 1 ) φ 2 (C 2 ) φ m (C m ). where C = {C 1, C 2,..., C m } is the set of all the complete subgraphs in G, and φ 1, φ 2,..., φ m are factor potentials defined over C 1, C 2,..., C m respectively. In other terms, X is a Markov network. Global Markov independence Factorization into potentials 29 / 58

30 What exactly are the factor potentials φ i (C i )? In the previous examples, the factor potentials φ i (C i ) look only at the values of the variables contained in clique C i. For example: φ([patient 1 has TB = true]) = 30 φ([patient 1 has TB = false]) = 3 φ([patient 1 has TB = true, Patient 2 has TB = true]) = 21 φ([patient 1 has TB = true, Patient 2 has TB = false]) = 2.7 Clearly, we cannot just use the same φ([patient 1 has TB = true]) for every patient x. Every patient is different. Also, we cannot write down φ([patient x has TB = true]) for every possible patient x, there are infinitely many patients. 30 / 58

31 What exactly are the factor potentials φ i (C i )? In the previous examples, the factor potentials φ i (C i ) look only at the values of the variables contained in clique C i. For example: φ([patient 1 has TB = true]) = 30 φ([patient 1 has TB = false]) = 3 φ([patient 1 has TB = true, Patient 2 has TB = true]) = 21 φ([patient 1 has TB = true, Patient 2 has TB = false]) = 2.7 Clearly, we cannot just use the same φ([patient 1 has TB = true]) for every patient x. Every patient is different. Also, we cannot write down φ([patient x has TB = true]) for every possible patient x, there are infinitely many patients. We include side information, or features, regarding each variable and each clique. For example, the medical checkup of Patient x. φ([patient x has TB = true]) is not directly a function of Patient x, but it is a function of the features of Patient x. 31 / 58

32 Clique features Each clique C i can be described by a vector of features f i. Example 1: Clique C 1 = { Patient 1} is described by a vector of features f 1 : f 1 [0] = Age of Patient 1 f 1 [1] = Blood pressure of Patient 1 f 1 [2] = Body temperature of Patient 1 Example 1: Clique C 2 = { Patient 1, Patient 2} is described by a vector of features f 2 : { f2 [0] = Type of interaction between Patients 1 and 2 f 2 [1] = Duration of interaction between Patients 1 and 2 32 / 58

33 Logistic model Typically, we use a log-linear model with feature functions f i to represent factor potentials φ i : ( k ) φ i (C i ) = exp w Ci [j]f i [j], j=0 Where f i [j] is the j th feature of the clique C i, and w Ci [j] is its corresponding weight (a value in R). w Ci, like φ i (C i ), depends on the values taken by the variables in clique C i. 33 / 58

34 Logistic model Typically, we use a log-linear model with feature functions f i to represent factor potentials φ i : ( k ) φ i (C i ) = exp w i [j]f i [j], Where f i [j] si the j th feature of the clique C i, and w i [j] is its corresponding weight (a value in R). w Ci, like φ i (C i ), depends on the values taken by the variables in clique C i. j=0 Therefore, P (X 1, X 2,..., X n ) = 1 Z m φ i (C i ) = 1 ( m Z exp i=0 k i=0 j=0 ) w Ci [j]f i [j] 34 / 58

35 Associative Markov Networks Associative Markov Networks are a popular variant of Markov Networks that have been successfully used in computer vision. They are a special variant of Pairwise Markov Networks. In Pairwise Markov Networks, there are two types of factor potentials, potentials φ node associated with individual variables, and potentials φ edge associated with edges (links between variables). A logistic model is used to represent the potentials: ( ) φ node (X i ) = exp k sign(x i)w node [k]f i [k] ( ) φ edge (X i, X j ) = exp k w edge[k]f (i,j) [k] sign(x i ) = +1 if X i = true and sign(x i ) = 1 if X i = false 35 / 58

36 Associative Markov Networks Associative Markov Networks are a popular variant of Markov Networks that have been successfully used in computer vision. They are a special variant of Pairwise Markov Networks. In Pairwise Markov Networks, there are two types of factor potentials, potentials φ node associated with individual variables, and potentials φ edge associated with edges (links between variables). A logistic model is used to represent the potentials: ( ) φ node (X i ) = exp k sign(x i)w node [k]f i [k] ( ) φ edge (X i, X j ) = exp k w edge[k]f (i,j) [k] sign(x i ) = +1 if X i = true and sign(x i ) = 1 if X i = false The name logistic model comes from the fact that log φ(x i ) = k sign(x i)w[k]f i [k] 36 / 58

37 Associative Markov Networks In Associative Markov Networks, edge potentials are defined as ( ) φ edge (X i, X j ) = exp k w edge[k]f (i,j) [k] if X i = X j. ( ) φ edge (X i, X j ) = exp 0 = 1 if X i X j. 37 / 58

38 Associative Markov Networks In Associative Markov Networks, edge potentials are defined as ( ) φ edge (X i, X j ) = exp k w edge[k]f (i,j) [k] if X i = X j. ( ) φ edge (X i, X j ) = exp 0 = 1 if X i X j. The joint probability distribution is given as P (X) = 1 ( Z exp sign(x i )w node [k]f i [k] + X i k (X i,x j ) k s.t. X i =X j and X i,x j are neighbors ) w edge [k]f (i,j) [k]. 38 / 58

39 Application of Associative Markov Networks: Image segmentation sky building void flower aeroplane grass Figure 4. Example images and segmentations using PLSA-MRF on the 21 class data set, with topic vectors learned from labeled pa Bird Book Chair Road Cat Dog / 58 8 Boat Sign Body Flower Bicycle Car Water Face Aeroplane Sky Cow Tree Sheep Grass Textonboost PLSA-MRF/P PLSA-MRF/I Building Jakob Verbeek, Bill Triggs. Region Classification with Markov Field Aspect Models. In CVPR 2007.

40 Application of Associative Markov Networks: Image segmentation void flower Jakob Verbeek, Bill Triggs. Region Classification sky with Markov Field Aspect Models. In CVPR Figure 4. Example images and segmentations using PLS building void flower 40 / 58

41 Application of Associative Markov Networks: Image segmentation X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 sky 41 / 58

42 Learning Associative Markov Networks Each variable X i in the network corresponds to a small patch in the image. X i = true means patch i in the image corresponds to a flower. X i = false means patch i in the image does not correspond to a flower. We represent the joint probability distribution over all possible values of variables X i as P (X) = 1 ( Z exp sign(x i)w node [k]f i[k] + ) w edge [k]f (i,j) [k]. X i k (X i,x j ) k s.t. X i =X j and X i,x j are neighbors The instance that has the highest probability should be when variables X 4, X 5, X 11, X 12, X 18 and X 19 are all true while all the remaining variables are false. 42 / 58

43 Application of Associative Markov Networks: Image segmentation X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 sky 43 / 58

44 Application of Associative Markov Networks: Image segmentation Edge features We can use a unique constant feature for the edges (links): f (i,j) [0] = 1, (i, j) s.t X i is neighbor of X j Node features We can use the Histograms of Oriented Gradients (HOG) features to represent the patch corresponding to each variable. f i [k] counts the number of occurrences of image gradient orientation 2πk/N in the portion i of the image (where N is the maximum number or orientations considered). We can also add RGB colours as features. 44 / 58

45 Learning Associative Markov Networks Now that we know what the feature vectors f i and f i,j are, how can we find their weight vectors w node and w edge? We can collect a lot of annotated examples, and find weights w node and w edge that maximize the likelihood. The likelihood function is concave and can be maximized using a gradient ascent. 45 / 58

46 Learning Associative Markov Networks To learn weights wnode and wedge, we start by collecting examples of images of a particular object (e.g, flower). Examples of images containing a flower For each image, we create an Associative Markov Network and label its nodes as true (object detected), or false. Red nodes are variables set to true, blue nodes are variables set to true. 46 / 58

47 Maximum Likelihood Let X ex be an assignment of values to the variables (nodes) in one of the examples. We have L X ex(w) = 1 ( exp sign(x i)w node [k]f i[k] + ) w edge [k]f (i,j) [k]. Z w X i X ex where w = [w node, w edge ]. k L X ex(w) is the likelihood of w given X ex. (X i,x j ) k s.t. X i =X j in X ex, X i,x j are neighbors L X ex(w) is a function of both w and X ex. But here, X ex is known from the annotated example. We want to find w that maximizes L X ex(w). We need to compute ( w L X ex(w) = LX ex(w) w node [0], L X ex(w) w node [1],..., L X ex(w) w edge [0], L X ex(w) w edge [1] )., / 58

48 Maximum Likelihood L X ex(w) = 1 ( exp Z w X i X ex sign(x i)w node [k]f i[k] + k (X i,x j ) k s.t. X i =X j in X ex, X i,x j are neighbors ) w edge [k]f (i,j) [k]. L ( X ex w node [k] (w) = L X ex(w) sign(x i )f i [k] L X (w) X i X ex X where X is an arbitrary assignments of values to the variables. L ( X ex w edge [k] (w) = L X ex(w) f (i,j) [k] (X i,x j) k s.t. X i=x j in X ex, X i,x j are neighbors L X (w) X (X i,x j) X i X k s.t. X i=x j in X, X i,x j are neighbors ) sign(x i )f i [k]. ) f (i,j) [k]. 48 / 58

49 Most Likely Explanation After we learn the weights of the model using gradient ascent, we are given a completely new image and we are asked to label the nodes in it into true (e.g, flower), or false. Notice that we do not need to compute the probability of each combination of values, we only need to find the combination that has the maximum propagation. For binary variables, this problem can be transformed into finding a minimum cut in a graph, and solved in polynomial time. 49 / 58

50 Inference in Markov Networks We return now to general Markov Networks, and let s say we want to compute P (X i ) for some variable X i X. The belief propagation algorithm, a.k.a the sum-product algorithm, provides an exact answer for tree-structured graphs, and an approximate answer for general graphs (for which the algorithm is known as loopy belief propagation). 50 / 58

51 Factor graphs We have seen that the joint probability is the product of factors φ i, defined over cliques C i. Each factor involves one or more variables. Each variable is involved in one or more factors. 51 / 58

52 Factor graphs 1-vertex cliques: C 1 = {A}, C 2 = {B}, C 3 = {C}, C 4 = {D}, C 5 = {E}. 2-vertex cliques: C 6 = {A, B}, C 7 = {A, D}, C 8 = {B, D}, C 9 = {D, E}, C 10 = {E, C}. 3-vertex cliques: C 11 = {A, B, D}. A B C D E φ 1 φ 2 φ 3 φ 4 φ 5 φ 6 φ 7 φ 8 φ 9 φ 10 φ / 58

53 Factor graphs The sum-product algorithm is based on repeatedly passing messages between variables and factors until convergence. A B C D E φ 1 φ 2 φ 3 φ 4 φ 5 φ 6 φ 7 φ 8 φ 9 φ 10 φ / 58

54 The sum-product algorithm We have two types of messages: Message µ Xi φ from a variable X i to a factor φ Message µ φ Xi from a factor φ to a variable X i. A B C D E µ A φ1 µ A φ6 µ A φ7 µ A φ11 φ 1 φ 2 φ 3 φ 4 φ 5 φ 6 φ 7 φ 8 φ 9 φ 10 φ / 58

55 The sum-product algorithm We have two types of messages: Message µ Xi φ from a variable X i to a factor φ Message µ φ Xi from a factor φ to a variable X i. A B C D E µ φ6 A µ φ6 B φ 1 φ 2 φ 3 φ 4 φ 5 φ 6 φ 7 φ 8 φ 9 φ 10 φ / 58

56 The sum-product algorithm A message µ Xi φ from a variable X i to a factor φ is the product of the messages from all other factors involving X i (except the recipient) x i domain(x i ) : µ Xi φ(x i ) = φ j φ s.t X i C j µ φj X i (x i ). where φ j are the factors of all the cliques that contain X i, except the factor φ (the recipient). 56 / 58

57 The sum-product algorithm A message µ φ Xi from factor φ to variable X i is the product of the factor with messages from all other variables, marginalized over all variables except the one associated with X i x i domain(x i ) : µ φ Xi (x i ) = φ(x) µ Xj φ(x j ). X s.t X i =x i X j C {X i } where C is the clique of variables associated with φ, and X are joint values of all the variables in clique C. We sum over all the possible values of the variables in the clique (except for X i which is set to x i ). x j is the value that variable X j takes in X. 57 / 58

58 The sum-product algorithm The initial messages (at the first iteration) are all equal to 1. Upon convergence (if convergence happened), the estimated marginal distribution of each node is: P (X i = x i ) µ φj X i (x i ) φ j s.t X i C j The same algorithm can be used for finding the Maximum A Posteriori (MAP) by replacing the sums by maxima. This algorithm is known as Max-Product. 58 / 58

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional