Probabilistic Graphical Models

Probabilistic Graphical Models Lecture 10 Undirected Models CS/CNS/EE 155 Andreas Krause

Announcements Homework 2 due this Wednesday (Nov 4) in class Project milestones due next Monday (Nov 9) About half the work should be done 4 pages of writeup, NIPS format http://nips.cc/paperinformation/stylefiles 2

Markov Networks (a.k.a., Markov Random Field, Gibbs Distribution, ) A Markov Network consists of An undirected graph, where each node represents a RV A collection of factors defined over cliques in the graph Joint probability X 1 X 2 X 4 X 3 X 5 X 6 X 7 X 8 X 9 A distribution factorizes over undirected graph G if 3

Computing Joint Probabilities Computing joint probabilities in BNs Computing joint probabilities in Markov Nets 4

Local Markov Assumption for MN X 3 X 1 X 2 X 4 X 5 X 7 X 8 X 6 The Markov Blanket MB(X) of a node X is the set of neighbors of X Local Markov Assumption: X EverythingElse MB(X) I loc (G) = set of all local independences G is called an I-map of distribution P if I loc (G) I(P) X 9 5

Factorization Theorem for Markov Nets s 4 s 1 s 2 s 3 s 6 s 5 s7 s 8 s 11 s 9 s 10 True distribution P can be represented exactly as a Markov net (G,P) s 12 I loc (G) I(P) G is an I-mapof P (independence map) 6

Factorization Theorem for Markov Nets Hammersley-Clifford Theorem s 4 s 1 s 2 s 3 s 6 s 5 s7 s 8 s 11 s 9 s 10 s 12 I loc (G) I(P) True distribution P can be represented exactly as G is an I-mapof P (independence map) and P>0 i.e., P can be represented as a Markov net (G,P) 7

Global independencies X 1 A trail X X 1 X m Y is called active for evidence E, if none of X 1,,X m E X 3 X 2 X 4 X 5 X 6 Variables X and Y are called separatedby Eif there is no active trail for Econnecting X, Y Write sep(x,y E) X 7 X 8 I(G) = {X Y E: sep(x,y E)} X 9 8

Soundness of separation Know: For positive distributions P>0 I loc (G) I(P) P factorizes over G Theorem: Soundness of separation For positive distributions P>0 I loc (G) I(P) I(G) I(P) Hence, separation captures only true independences How about I(G) = I(P)? 9

Completeness of separation Theorem: Completeness of separation I(G) = I(P) for almost all distributions P that factorize over G almost all : Except for of potential parameterizations of measure 0 (assuming no finite set have positive measure) 10

Minimal I-maps For BNs: Minimal I-map not unique E B E B J M A A E B J M J M A For MNs: For positive P, minimal I-map is unique!! 11

Do P-maps always exist? For BNs: no P-maps How about Markov Nets? 12

Exact inference in MNs Variable elimination and junction tree inference work exactly the same way! Need to construct junction trees by obtaining chordalgraph through triangulation 13

Pairwise MNs A pairwise MN is a MN where all factors are defined over single variables or pairs of variables Can reduce any MN to pairwise MN! X 1 X 2 X 4 X 3 X 5 14

Logarithmic representation Can represent any positive distribution in log domain 15

Log-linear models Feature functions φ i (D) defined over cliques Log linear model over undirected graph G Feature functions φ 1 (D 1 ),,φ k (D k ) Domains D i can overlap Set of weights w i learnt from data 16

Converting BNsto MNs C D I G S L H J Theorem: Moralized Bayesnet is minimal Markov I-map 17

Converting MNsto BNs X 1 X2 X 3 X 6 X 7 X 8 X 9 Theorem: Minimal Bayes I-map for MN must be chordal 18

So far Markov Network Representation Local/Global Markov assumptions; Separation Soundness and completeness of separation Markov Network Inference Variable elimination and Junction Tree inference work exactly as in BayesNets How about Learning Markov Nets? 19

Parameter Learning for Bayesnets 20

Algorithm for BN MLE 21

MLE for Markov Nets Log likelihood of the data 22

Log-likelihood doesn t decompose Log likelihood l(d θ) is concave function! Log Partition function log Z(θ) doesn t decompose 23

Derivative of log-likelihood 24

Derivative of log-likelihood 25

Computing the derivative Derivative C D I H G L J S Computing P(c i θ) requires inference! Can optimize using conjugate gradient etc. 26

Alternative approach: Iterative Proportional Fitting (IPF) At optimum, it must hold that Solve fixed point equation Must recompute parameters every iteration 27

Parameter learning for log-linear models Feature functions φ i (C i ) defined over cliques Log linear model over undirected graph G Feature functions φ 1 (C 1 ),,φ k (C k ) Domains C i can overlap Joint distribution How do we get weights w i? 28

Derivative of Log-likelihood 1 29

Derivative of Log-likelihood 2 30

Optimizing parameters Gradient of log-likelihood Thus, w is MLE 31

Regularization of parameters Put prior on parameters w 32

Summary: Parameter learning in MN MLE in BN is easy (score decomposes) MLE in MN requires inference (score doesn t decompose) Can optimize using gradient ascent or IPF 33

Tasks Read Koller& Friedman Chapters 20.1-20.3, 4.6.1 34