Graphical Models and Kernel Methods

Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123

Outline Graphical Models Probabilistic Inference Directed vs. Undirected Graphical Models Inference Parameter Estimation Kernel Methods Support Vector Machines Kernel PCA Reproducing Kernel Hilbert Spaces 2 / 123

The envelope quiz red ball = $$$ 5 / 123

The envelope quiz red ball = $$$ You randomly picked an envelope, randomly took out a ball and it was black 5 / 123

The envelope quiz red ball = $$$ You randomly picked an envelope, randomly took out a ball and it was black Should you choose this envelope or the other envelope? 5 / 123

The envelope quiz Probabilistic inference 6 / 123

The envelope quiz Probabilistic inference Joint distribution on E {1, 0}, B {r, b}: P (E, B) = P (E)P (B E) 6 / 123

The envelope quiz Probabilistic inference Joint distribution on E {1, 0}, B {r, b}: P (E, B) = P (E)P (B E) P (E = 1) = P (E = 0) = 1/2 6 / 123

The envelope quiz Probabilistic inference Joint distribution on E {1, 0}, B {r, b}: P (E, B) = P (E)P (B E) P (E = 1) = P (E = 0) = 1/2 P (B = r E = 1) = 1/2, P (B = r E = 0) = 0 6 / 123

The envelope quiz Probabilistic inference Joint distribution on E {1, 0}, B {r, b}: P (E, B) = P (E)P (B E) P (E = 1) = P (E = 0) = 1/2 P (B = r E = 1) = 1/2, P (B = r E = 0) = 0 The graphical model: E B 6 / 123

The envelope quiz Probabilistic inference Joint distribution on E {1, 0}, B {r, b}: P (E, B) = P (E)P (B E) P (E = 1) = P (E = 0) = 1/2 P (B = r E = 1) = 1/2, P (B = r E = 0) = 0 The graphical model: E B Statistical decision theory: switch if P (E = 1 B = b) < 1/2 6 / 123

Reasoning with uncertainty The world is reduced to a set of random variables x 1,..., x d 7 / 123

Reasoning with uncertainty The world is reduced to a set of random variables x 1,..., x d e.g. (x1,..., x d 1 ) a feature vector, x d y the class label 7 / 123

Reasoning with uncertainty The world is reduced to a set of random variables x 1,..., x d e.g. (x1,..., x d 1 ) a feature vector, x d y the class label Inference: given joint distribution p(x 1,..., x d ), compute p(x Q X E ) where X Q X E {x 1... x d } e.g. Q = {d}, E = {1... d 1}, by the definition of conditional p(x 1,..., x d 1, x d ) p(x d x 1,..., x d 1 ) = v p(x 1,..., x d 1, x d = v) 7 / 123

It is difficult to reason with uncertainty joint distribution p(x 1,..., x d ) 8 / 123

It is difficult to reason with uncertainty joint distribution p(x 1,..., x d ) exponential naïve storage (2 d for binary r.v.) 8 / 123

It is difficult to reason with uncertainty joint distribution p(x 1,..., x d ) exponential naïve storage (2 d for binary r.v.) hard to interpret (conditional independence) 8 / 123

It is difficult to reason with uncertainty joint distribution p(x 1,..., x d ) exponential naïve storage (2 d for binary r.v.) hard to interpret (conditional independence) inference p(x Q X E ) Often can t afford to do it by brute force If p(x 1,..., x d ) not given, estimate it from data Often can t afford to do it by brute force 8 / 123

What are graphical models? Graphical model = joint distribution p(x 1,..., x d ) 9 / 123

What are graphical models? Graphical model = joint distribution p(x 1,..., x d ) Bayesian network or Markov random field 9 / 123

What are graphical models? Graphical model = joint distribution p(x 1,..., x d ) Bayesian network or Markov random field conditional independence 9 / 123

What are graphical models? Graphical model = joint distribution p(x 1,..., x d ) Bayesian network or Markov random field conditional independence Inference = p(x Q X E ), in general X Q X E {x 1... x d } exact, MCMC, variational 9 / 123

Graphical-Model-Nots Graphical model is the study of probabilistic models 10 / 123

Graphical-Model-Nots Graphical model is the study of probabilistic models Just because there are nodes and edges doesn t mean it s a graphical model 10 / 123

Graphical-Model-Nots Graphical model is the study of probabilistic models Just because there are nodes and edges doesn t mean it s a graphical model These are not graphical models: neural network decision tree network flow HMM template (but HMMs are!) 10 / 123

Directed graphical models 12 / 123

Directed graphical models Also called Bayesian networks 13 / 123

Directed graphical models Also called Bayesian networks A directed graph has nodes x 1,..., x d, some of them connected by directed edges x i x j 13 / 123

Directed graphical models Also called Bayesian networks A directed graph has nodes x 1,..., x d, some of them connected by directed edges x i x j A cycle is a directed path x 1... x k where x 1 = x k 13 / 123

Directed graphical models A Bayesian network on the DAG is a family of distributions satisfying {p p(x 1,..., x d ) = i p(x i P a(x i ))} where P a(x i ) is the set of parents of x i. 14 / 123

Directed graphical models A Bayesian network on the DAG is a family of distributions satisfying {p p(x 1,..., x d ) = i p(x i P a(x i ))} where P a(x i ) is the set of parents of x i. p(x i P a(x i )) is the conditional probability distribution (CPD) at x i 14 / 123

Example: Burglary, Earthquake, Alarm, John and Marry Binary variables P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J P(J A) = 0.9 P(J ~A) = 0.05 A E M P(E)=0.002 P(M A) = 0.7 P(M ~A) = 0.01 P (B, E, A, J, M) = P (B)P ( E)P (A B, E)P (J A)P ( M A) = 0.001 (1 0.002) 0.94 0.9 (1 0.7).000253 15 / 123

Example: Naive Bayes y y x1... xd x d p(y, x 1,... x d ) = p(y) d i=1 p(x i y) 16 / 123

Example: Naive Bayes y y x1... xd x d p(y, x 1,... x d ) = p(y) d i=1 p(x i y) Plate representation on the right 16 / 123

Example: Naive Bayes y y x1... xd x d p(y, x 1,... x d ) = p(y) d i=1 p(x i y) Plate representation on the right p(y) multinomial 16 / 123

Example: Naive Bayes y y x1... xd x d p(y, x 1,... x d ) = p(y) d i=1 p(x i y) Plate representation on the right p(y) multinomial p(x i y) depends on the feature type: multinomial (count x i ), Gaussian (continuous x i ), etc. 16 / 123

No Causality Whatsoever P(A)=a P(B A)=b P(B ~A)=c A B B A The two BNs are equivalent in all respects Do not read causality from Bayesian networks P(B)=ab+(1 a)c P(A B)=ab/(ab+(1 a)c) P(A ~B)=a(1 b)/(1 ab (1 a)c) They only represent correlation (joint probability distribution) 17 / 123

What do we need probabilistic models for? Make predictions. p(y x) plus decision theory 18 / 123

What do we need probabilistic models for? Make predictions. p(y x) plus decision theory Interpret models. Very natural to include latent variables 18 / 123

Example: Latent Dirichlet Allocation (LDA) β φ T w z N d θ D α A generative model for p(φ, θ, z, w α, β): For each topic t φ t Dirichlet(β) For each document d θ Dirichlet(α) For each word position in d topic z Multinomial(θ) word w Multinomial(φ z ) Inference goals: p(z w, α, β), argmax φ,θ p(φ, θ w, α, β) 19 / 123

Conditional Independence Two r.v.s A, B are independent if P (A, B) = P (A)P (B) P (A B) = P (A) P (B A) = P (B) 20 / 123

Conditional Independence Two r.v.s A, B are independent if P (A, B) = P (A)P (B) P (A B) = P (A) P (B A) = P (B) Two r.v.s A, B are conditionally independent given C if P (A, B C) = P (A C)P (B C) P (A B, C) = P (A C) P (B A, C) = P (B C) This extends to groups of r.v.s 20 / 123

d-separation Case 1: Tail-to-Tail C C A A, B in general dependent B A B 21 / 123

d-separation Case 1: Tail-to-Tail C C A B A B A, B in general dependent A, B conditionally independent given C (observed nodes are shaded) 21 / 123

d-separation Case 1: Tail-to-Tail C C A B A B A, B in general dependent A, B conditionally independent given C (observed nodes are shaded) An observed C is a tail-to-tail node, blocks the undirected path A-B 21 / 123

d-separation Case 2: Head-to-Tail A C B A C B A, B in general dependent 22 / 123

d-separation Case 2: Head-to-Tail A C B A C B A, B in general dependent A, B conditionally independent given C 22 / 123

d-separation Case 2: Head-to-Tail A C B A C B A, B in general dependent A, B conditionally independent given C An observed C is a head-to-tail node, blocks the path A-B 22 / 123

d-separation Case 3: Head-to-Head A B A B C A, B in general independent C 23 / 123

d-separation Case 3: Head-to-Head A B A B C A, B in general independent A, B conditionally dependent given C, or any of C s descendants C 23 / 123

d-separation Case 3: Head-to-Head A B A B C A, B in general independent A, B conditionally dependent given C, or any of C s descendants An observed C is a head-to-head node, unblocks the path A-B C 23 / 123

d-separation Variable groups A and B are conditionally independent given C, if all undirected paths from nodes in A to nodes in B are blocked 24 / 123

d-separation Example 1 The undirected path from A to B is unblocked by E (because of C), and is not blocked by F A F E B C 25 / 123

d-separation Example 1 The undirected path from A to B is unblocked by E (because of C), and is not blocked by F A, B dependent given C A F E B C 25 / 123

d-separation Example 2 The path from A to B is blocked both at E and F A F E B C 26 / 123

d-separation Example 2 The path from A to B is blocked both at E and F A, B conditionally independent given F A F E B C 26 / 123

Undirected graphical models 27 / 123

Undirected graphical models Also known as Markov Random Fields 28 / 123

Undirected graphical models Also known as Markov Random Fields Recall directed graphical models require a DAG and locally normalized CPDs 28 / 123

Undirected graphical models Also known as Markov Random Fields Recall directed graphical models require a DAG and locally normalized CPDs efficient computation 28 / 123

Undirected graphical models Also known as Markov Random Fields Recall directed graphical models require a DAG and locally normalized CPDs efficient computation but restrictive 28 / 123

Undirected graphical models Also known as Markov Random Fields Recall directed graphical models require a DAG and locally normalized CPDs efficient computation but restrictive A clique C in an undirected graph is a set of fully connected nodes (full of loops!) Define a nonnegative potential function ψ C : X C R + An undirected graphical model is a family of distributions satisfying { } p p(x) = 1 ψ C (X C ) Z C 28 / 123

Example: A Tiny Markov Random Field x x 1 2 C x 1, x 2 { 1, 1} 29 / 123

Example: A Tiny Markov Random Field x x 1 2 C x 1, x 2 { 1, 1} A single clique ψ C (x 1, x 2 ) = e ax 1x 2 29 / 123

Example: A Tiny Markov Random Field x x 1 2 C x 1, x 2 { 1, 1} A single clique ψ C (x 1, x 2 ) = e ax 1x 2 p(x 1, x 2 ) = 1 Z eax 1x 2 29 / 123

Example: A Tiny Markov Random Field x x 1 2 C x 1, x 2 { 1, 1} A single clique ψ C (x 1, x 2 ) = e ax 1x 2 p(x 1, x 2 ) = 1 Z eax 1x 2 Z = (e a + e a + e a + e a ) 29 / 123

Example: A Tiny Markov Random Field x x 1 2 C x 1, x 2 { 1, 1} A single clique ψ C (x 1, x 2 ) = e ax 1x 2 p(x 1, x 2 ) = 1 Z eax 1x 2 Z = (e a + e a + e a + e a ) p(1, 1) = p( 1, 1) = e a /(2e a + 2e a ) 29 / 123

Example: A Tiny Markov Random Field x x 1 2 C x 1, x 2 { 1, 1} A single clique ψ C (x 1, x 2 ) = e ax 1x 2 p(x 1, x 2 ) = 1 Z eax 1x 2 Z = (e a + e a + e a + e a ) p(1, 1) = p( 1, 1) = e a /(2e a + 2e a ) p( 1, 1) = p(1, 1) = e a /(2e a + 2e a ) 29 / 123

Log-Linear Models Real-valued feature functions f 1 (X),..., f k (X) 30 / 123

Log-Linear Models Real-valued feature functions f 1 (X),..., f k (X) Real-valued weights w 1,..., w k ( k ) p(x) = 1 Z exp w i f i (X) i=1 30 / 123

Log-Linear Models Real-valued feature functions f 1 (X),..., f k (X) Real-valued weights w 1,..., w k Equivalent to MRF p(x) = 1 Z p(x) = 1 Z exp ( k i=1 w i f i (X) ) C ψ C(X C ) with ψ C (X C ) = exp (w C f C (X)) 30 / 123

Example: Ising Model θ s x s θ st x t This is an undirected model with x {0, 1}. p θ (x) = 1 Z exp θ s x s + θ st x s x t s V (s,t) E f s (X) = x s, f st (X) = x s x t 31 / 123

Example: Image Denoising [From Bishop PRML] noisy image argmax X P (X Y ) p θ (X Y ) = 1 Z exp θ s x s + s V θ s = (s,t) E { c ys = 1 c y s = 0, θ st > 0 θ st x s x t 32 / 123

Example: Gaussian Random Field p(x) N(µ, Σ) = ( 1 (2π) n/2 exp 1 ) Σ 1/2 2 (X µ) Σ 1 (X µ) Multivariate Gaussian 33 / 123

Example: Gaussian Random Field p(x) N(µ, Σ) = ( 1 (2π) n/2 exp 1 ) Σ 1/2 2 (X µ) Σ 1 (X µ) Multivariate Gaussian The n n covariance matrix Σ positive semi-definite 33 / 123

Example: Gaussian Random Field p(x) N(µ, Σ) = ( 1 (2π) n/2 exp 1 ) Σ 1/2 2 (X µ) Σ 1 (X µ) Multivariate Gaussian The n n covariance matrix Σ positive semi-definite Let Ω = Σ 1 be the precision matrix x i, x j are conditionally independent given all other variables, if and only if Ω ij = 0 33 / 123

Conditional Independence in Markov Random Fields Two group of variables A, B are conditionally independent given another group C, if A, B become disconnected by removing C and all edges involving C A C B 34 / 123

Exact Inference 36 / 123

Inference by Enumeration Let X = (X Q, X E, X O ) for query, evidence, and other variables. 37 / 123

Inference by Enumeration Let X = (X Q, X E, X O ) for query, evidence, and other variables. Goal: P (X Q X E ) 37 / 123

Inference by Enumeration Let X = (X Q, X E, X O ) for query, evidence, and other variables. Goal: P (X Q X E ) P (X Q X E ) = P (X Q, X E ) P (X E ) = X O P (X Q, X E, X O ) X Q,X O P (X Q, X E, X O ) 37 / 123

Inference by Enumeration Let X = (X Q, X E, X O ) for query, evidence, and other variables. Goal: P (X Q X E ) P (X Q X E ) = P (X Q, X E ) P (X E ) = X O P (X Q, X E, X O ) X Q,X O P (X Q, X E, X O ) Summing exponential number of terms: with k variables in X O each taking r values, there are r k terms 37 / 123

Markov Chain Monte Carlo 38 / 123

Markov Chain Monte Carlo Forward sampling 39 / 123

Markov Chain Monte Carlo Forward sampling Gibbs sampling 39 / 123

Markov Chain Monte Carlo Forward sampling Gibbs sampling Collapsed Gibbs sampling 39 / 123

Markov Chain Monte Carlo Forward sampling Gibbs sampling Collapsed Gibbs sampling Not covered: block Gibbs, Metropolis-Hastings, etc. 39 / 123

Markov Chain Monte Carlo Forward sampling Gibbs sampling Collapsed Gibbs sampling Not covered: block Gibbs, Metropolis-Hastings, etc. Unbiased (after burn-in), but can have high variance 39 / 123

Monte Carlo Methods Consider the inference problem p(x Q = c Q X E ) where X Q X E {x 1... x d } p(x Q = c Q X E ) = 1 (xq =c Q )p(x Q X E )dx Q 40 / 123

Monte Carlo Methods Consider the inference problem p(x Q = c Q X E ) where X Q X E {x 1... x d } p(x Q = c Q X E ) = 1 (xq =c Q )p(x Q X E )dx Q If we can draw samples x (1) Q,... x(m) Q p(x Q X E ), an unbiased estimator is p(x Q = c Q X E ) 1 m m i=1 1 (x (i) Q =c Q) 40 / 123

Forward Sampling Draw X P (X) 41 / 123

Forward Sampling Draw X P (X) Throw away X if it doesn t match the evidence X E 41 / 123

Forward Sampling: Example P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 P(J A) = 0.9 P(J ~A) = 0.05 To generate a sample X = (B, E, A, J, M): B J A E M P(E)=0.002 P(M A) = 0.7 P(M ~A) = 0.01 1. Sample B Ber(0.001): r U(0, 1). If (r < 0.001) then B = 1 else B = 0 2. Sample E Ber(0.002) 3. If B = 1 and E = 1, sample A Ber(0.95), and so on 42 / 123

Inference with Forward Sampling Say the inference task is P (B = 1 E = 1, M = 1) 43 / 123

Inference with Forward Sampling Say the inference task is P (B = 1 E = 1, M = 1) Throw away all samples except those with (E = 1, M = 1) m p(b = 1 E = 1, M = 1) 1 m i=1 1 (B (i) =1) where m is the number of surviving samples Can be highly inefficient (note P (E = 1) tiny) 43 / 123

Gibbs Sampling: Example P (B = 1 E = 1, M = 1) Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method. P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = 0.01 44 / 123

Gibbs Sampling: Example P (B = 1 E = 1, M = 1) Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method. Directly sample from p(x Q X E ) P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = 0.01 44 / 123

Gibbs Sampling: Example P (B = 1 E = 1, M = 1) Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method. Directly sample from p(x Q X E ) Works for both graphical models P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = 0.01 44 / 123

Gibbs Sampling: Example P (B = 1 E = 1, M = 1) Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method. Directly sample from p(x Q X E ) Works for both graphical models Initialization: P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = 0.01 44 / 123

Gibbs Sampling: Example P (B = 1 E = 1, M = 1) Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method. Directly sample from p(x Q X E ) Works for both graphical models Initialization: Fix evidence; randomly set other variables P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = 0.01 44 / 123

Gibbs Sampling: Example P (B = 1 E = 1, M = 1) Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method. Directly sample from p(x Q X E ) Works for both graphical models Initialization: Fix evidence; randomly set other variables e.g. X (0) = (B = 0, E = 1, A = 0, J = 0, M = 1) P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = 0.01 44 / 123

Gibbs Sampling For each non-evidence variable x i, fixing all other nodes X i, resample its value x i P (x i X i ) P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J A E=1 M=1 P(E)=0.002 P(J A) = 0.9 P(M A) = 0.7 45 / 123

Gibbs Sampling For each non-evidence variable x i, fixing all other nodes X i, resample its value x i P (x i X i ) This is equivalent to x i P (x i MarkovBlanket(x i )) P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J A E=1 M=1 P(E)=0.002 P(J A) = 0.9 P(M A) = 0.7 45 / 123

Gibbs Sampling For each non-evidence variable x i, fixing all other nodes X i, resample its value x i P (x i X i ) This is equivalent to x i P (x i MarkovBlanket(x i )) For a Bayesian network MarkovBlanket(x i ) includes x i s parents, spouses, and children P (x i MarkovBlanket(x i )) P (x i P a(x i )) P (y P a(y)) y C(x i ) where P a(x) are the parents of x, and C(x) the children of x. P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J A E=1 M=1 P(E)=0.002 P(J A) = 0.9 P(M A) = 0.7 45 / 123

Gibbs Sampling For each non-evidence variable x i, fixing all other nodes X i, resample its value x i P (x i X i ) This is equivalent to x i P (x i MarkovBlanket(x i )) For a Bayesian network MarkovBlanket(x i ) includes x i s parents, spouses, and children P (x i MarkovBlanket(x i )) P (x i P a(x i )) P (y P a(y)) y C(x i ) where P a(x) are the parents of x, and C(x) the children of x. For many graphical models the Markov Blanket is small. P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J A E=1 M=1 P(E)=0.002 P(J A) = 0.9 P(M A) = 0.7 45 / 123

Gibbs Sampling For each non-evidence variable x i, fixing all other nodes X i, resample its value x i P (x i X i ) This is equivalent to x i P (x i MarkovBlanket(x i )) For a Bayesian network MarkovBlanket(x i ) includes x i s parents, spouses, and children P (x i MarkovBlanket(x i )) P (x i P a(x i )) P (y P a(y)) y C(x i ) where P a(x) are the parents of x, and C(x) the children of x. For many graphical models the Markov Blanket is small. For example, B P (B E = 1, A = 0) P (B)P (A = 0 B, E = 1) P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J A E=1 M=1 P(E)=0.002 P(J A) = 0.9 P(M A) = 0.7 45 / 123

Gibbs Sampling Say we sampled B = 1. Then X (1) = (B = 1, E = 1, A = 0, J = 0, M = 1) P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = 0.01 46 / 123

Gibbs Sampling Say we sampled B = 1. Then X (1) = (B = 1, E = 1, A = 0, J = 0, M = 1) Starting from X (1), sample A P (A B = 1, E = 1, J = 0, M = 1) to get X (2) P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = 0.01 46 / 123

Gibbs Sampling Say we sampled B = 1. Then X (1) = (B = 1, E = 1, A = 0, J = 0, M = 1) Starting from X (1), sample A P (A B = 1, E = 1, J = 0, M = 1) to get X (2) Move on to J, then repeat B, A, J, B, A, J... P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = 0.01 46 / 123

Gibbs Sampling Say we sampled B = 1. Then X (1) = (B = 1, E = 1, A = 0, J = 0, M = 1) Starting from X (1), sample A P (A B = 1, E = 1, J = 0, M = 1) to get X (2) Move on to J, then repeat B, A, J, B, A, J... Keep all samples after burn in. P (B = 1 E = 1, M = 1) is the fraction of samples with B = 1. P(B)=0.001 P(A B, E) = 0.95 P(A B, ~E) = 0.94 P(A ~B, E) = 0.29 P(A ~B, ~E) = 0.001 B J P(J A) = 0.9 P(J ~A) = 0.05 A E=1 M=1 P(E)=0.002 P(M A) = 0.7 P(M ~A) = 0.01 46 / 123

Gibbs Sampling Example 2: The Ising Model A B x s D This is an undirected model with x {0, 1}. p θ (x) = 1 Z exp θ s x s + s V C (s,t) E θ st x s x t 47 / 123

Gibbs Example 2: The Ising Model A B x s D The Markov blanket of x s is A, B, C, D C 48 / 123

Gibbs Example 2: The Ising Model A B x s D The Markov blanket of x s is A, B, C, D In general for undirected graphical models p(x s x s ) = p(x s x N(s) ) N(s) is the neighbors of s. C 48 / 123

Gibbs Example 2: The Ising Model A B x s D The Markov blanket of x s is A, B, C, D In general for undirected graphical models N(s) is the neighbors of s. The Gibbs update is p(x s = 1 x N(s) ) = C p(x s x s ) = p(x s x N(s) ) 1 exp( (θ s + t N(s) θ stx t )) + 1 48 / 123

Gibbs Sampling as a Markov Chain A Markov chain is defined by a transition matrix T (X X) 49 / 123

Gibbs Sampling as a Markov Chain A Markov chain is defined by a transition matrix T (X X) Certain Markov chains have a stationary distribution π such that π = T π 49 / 123

Gibbs Sampling as a Markov Chain A Markov chain is defined by a transition matrix T (X X) Certain Markov chains have a stationary distribution π such that π = T π Gibbs sampler is such a Markov chain with T i ((X i, x i ) (X i, x i )) = p(x i X i), and stationary distribution p(x Q X E ) But it takes time for the chain to reach stationary distribution (mix) 49 / 123

Collapsed Gibbs Sampling In general, E p [f(x)] 1 m m i=1 f(x(i) ) for X (i) p 50 / 123

Collapsed Gibbs Sampling In general, E p [f(x)] 1 m m i=1 f(x(i) ) for X (i) p Sometimes X = (Y, Z) where E Z Y has a closed-form 50 / 123

Collapsed Gibbs Sampling In general, E p [f(x)] 1 m m i=1 f(x(i) ) for X (i) p Sometimes X = (Y, Z) where E Z Y has a closed-form If so, for Y (i) p(y ) E p [f(x)] = E p(y ) E p(z Y ) [f(y, Z)] 1 m E m p(z Y (i) ) [f(y (i), Z)] i=1 50 / 123

Collapsed Gibbs Sampling In general, E p [f(x)] 1 m m i=1 f(x(i) ) for X (i) p Sometimes X = (Y, Z) where E Z Y has a closed-form If so, E p [f(x)] = E p(y ) E p(z Y ) [f(y, Z)] 1 m E m p(z Y (i) ) [f(y (i), Z)] i=1 for Y (i) p(y ) No need to sample Z: it is collapsed 50 / 123

Example: Collapsed Gibbs Sampling for LDA Collapse θ, φ, Gibbs update: P (z i = j z i, w) n(w i) i,j + βn(d i) i,j + α n ( ) i,j + W βn(d i) i, + T α n (w i) i,j : number of times word w i has been assigned to topic j, excluding the current position n (d i) i,j : number of times a word from document d i has been assigned to topic j, excluding the current position : number of times any word has been assigned to topic j, excluding the current position n ( ) i,j 51 / 123

Belief Propagation 52 / 123

Factor Graph For both directed and undirected graphical models A ψ (A,B,C) B A f B ψ (A,B,C) C C A B A B f P(A)P(B)P(C A,B) C C 53 / 123

Factor Graph For both directed and undirected graphical models Bipartite: edges between a variable node and a factor node A ψ (A,B,C) B A f B ψ (A,B,C) C C A B A B f P(A)P(B)P(C A,B) C C 53 / 123

Factor Graph For both directed and undirected graphical models Bipartite: edges between a variable node and a factor node Factors represent computation A ψ (A,B,C) B A f B ψ (A,B,C) C C A B A B f P(A)P(B)P(C A,B) C C 53 / 123

The Sum-Product Algorithm Also known as belief propagation (BP) 54 / 123

The Sum-Product Algorithm Also known as belief propagation (BP) Exact if the graph is a tree; otherwise known as loopy BP, approximate 54 / 123

The Sum-Product Algorithm Also known as belief propagation (BP) Exact if the graph is a tree; otherwise known as loopy BP, approximate The algorithm involves passing messages on the factor graph 54 / 123

Example: A Simple HMM The Hidden Markov Model template (not a graphical model) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 55 / 123

Example: A Simple HMM Observing x 1 = R, x 2 = G, the directed graphical model z 1 z 2 x =R 1 x =G 2 56 / 123

Example: A Simple HMM Observing x 1 = R, x 2 = G, the directed graphical model z 1 z 2 Factor graph x =R 1 x =G 2 f 1 z 1 f 2 z 2 P(z )P(x z ) P(z z )P(x z ) 1 1 1 2 1 2 2 56 / 123

Messages A message is a vector of length K, where K is the number of values x takes. 57 / 123

Messages A message is a vector of length K, where K is the number of values x takes. There are two types of messages: 57 / 123

Messages A message is a vector of length K, where K is the number of values x takes. There are two types of messages: 1. µ f x : message from a factor node f to a variable node x µ f x (i) is the ith element, i = 1... K. 57 / 123

Leaf Messages Assume tree factor graph. Pick an arbitrary root, say z 2 f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 58 / 123

Leaf Messages Assume tree factor graph. Pick an arbitrary root, say z 2 Start messages at leaves. f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 58 / 123

Leaf Messages Assume tree factor graph. Pick an arbitrary root, say z 2 Start messages at leaves. If a leaf is a factor node f, µ f x (x) = f(x) µ f1 z 1 (z 1 = 1) = P (z 1 = 1)P (R z 1 = 1) = 1/2 1/2 = 1/4 µ f1 z 1 (z 1 = 2) = P (z 1 = 2)P (R z 1 = 2) = 1/2 1/4 = 1/8 f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 58 / 123

Leaf Messages Assume tree factor graph. Pick an arbitrary root, say z 2 Start messages at leaves. If a leaf is a factor node f, µ f x (x) = f(x) µ f1 z 1 (z 1 = 1) = P (z 1 = 1)P (R z 1 = 1) = 1/2 1/2 = 1/4 µ f1 z 1 (z 1 = 2) = P (z 1 = 2)P (R z 1 = 2) = 1/2 1/4 = 1/8 If a leaf is a variable node x, µ x f (x) = 1 f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 58 / 123

Message from Variable to Factor A node (factor or variable) can send out a message if all other incoming messages have arrived f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 59 / 123

Message from Variable to Factor A node (factor or variable) can send out a message if all other incoming messages have arrived Let x be in factor f s. ne(x)\f s are factors connected to x excluding f s. µ x fs (x) = µ f x (x) f ne(x)\f s µ z1 f 2 (z 1 = 1) = 1/4 µ z1 f 2 (z 1 = 2) = 1/8 f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 59 / 123

Message from Factor to Variable Let x be in factor f s. Let the other variables in f s be x 1:M. µ fs x(x) =... M f s (x, x 1,..., x M ) µ xm fs (x m ) x 1 x M m=1 f 1 z 1 f 2 z 2 P(z )P(x z ) P(z z )P(x z ) 1 1 1 2 1 2 2 60 / 123

Message from Factor to Variable Let x be in factor f s. Let the other variables in f s be x 1:M. µ fs x(x) =... M f s (x, x 1,..., x M ) µ xm fs (x m ) x 1 x M m=1 In this example 2 µ f2 z 2 (s) = µ z1 f 2 (s )f 2 (z 1 = s, z 2 = s) s =1 = 1/4P (z 2 = s z 1 = 1)P (x 2 = G z 2 = s) +1/8P (z 2 = s z 1 = 2)P (x 2 = G z 2 = s) f 1 z 1 f 2 z 2 P(z )P(x z ) P(z z )P(x z ) 1 1 1 2 1 2 2 60 / 123

Message from Factor to Variable Let x be in factor f s. Let the other variables in f s be x 1:M. µ fs x(x) =... M f s (x, x 1,..., x M ) µ xm fs (x m ) x 1 x M m=1 In this example 2 µ f2 z 2 (s) = µ z1 f 2 (s )f 2 (z 1 = s, z 2 = s) s =1 = 1/4P (z 2 = s z 1 = 1)P (x 2 = G z 2 = s) +1/8P (z 2 = s z 1 = 2)P (x 2 = G z 2 = s) We get µ f2 z 2 (z 2 = 1) = 1/32, µ f2 z 2 (z 2 = 2) = 1/8 f 1 z 1 f 2 z 2 P(z )P(x z ) P(z z )P(x z ) 1 1 1 2 1 2 2 60 / 123

Up to Root, Back Down The message has reached the root, pass it back down µ z2 f 2 (z 2 = 1) = 1 µ z2 f 2 (z 2 = 2) = 1 f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 61 / 123

Keep Passing Down µ f2 z 1 (s) = 2 s =1 µ z 2 f 2 (s )f 2 (z 1 = s, z 2 = s ) = 1P (z 2 = 1 z 1 = s)p (x 2 = G z 2 = 1) + 1P (z 2 = 2 z 1 = s)p (x 2 = G z 2 = 2). f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 62 / 123

Keep Passing Down µ f2 z 1 (s) = 2 s =1 µ z 2 f 2 (s )f 2 (z 1 = s, z 2 = s ) = 1P (z 2 = 1 z 1 = s)p (x 2 = G z 2 = 1) + 1P (z 2 = 2 z 1 = s)p (x 2 = G z 2 = 2). We get µ f2 z 1 (z 1 = 1) = 7/16 µ f2 z 1 (z 1 = 2) = 3/8 f 1 z 1 f 2 z 2 P(z 1)P(x 1 z 1) P(z 2 z 1)P(x 2 z 2) 1/4 1/2 1 2 P(x z=1)=(1/2, 1/4, 1/4) P(x z=2)=(1/4, 1/2, 1/4) R G B R G B π 1= π 2= 1/2 62 / 123

From Messages to Marginals Once a variable receives all incoming messages, we compute its marginal as p(x) µ f x (x) f ne(x) 63 / 123

From Messages to Marginals Once a variable receives all incoming messages, we compute its marginal as p(x) µ f x (x) f ne(x) In this example P (z 1 x 1, x 2 ) µ f1 z 1 µ f2 z 1 = ( 1/4 ) 1/8 ( 0.7 0.3 P (z 2 x 1, x 2 ) µ f2 z 2 = ( 1/32 1/8 ) ( 0.2 ) 0.8 ) ( 7/16 ) ( 7/64 ) 3/8 = 3/64 63 / 123

Handling Evidence Observing x = v, 64 / 123

Handling Evidence Observing x = v, we can absorb it in the factor (as we did); or 64 / 123

Handling Evidence Observing x = v, we can absorb it in the factor (as we did); or set messages µx f (x) = 0 for all x v 64 / 123

Handling Evidence Observing x = v, we can absorb it in the factor (as we did); or set messages µx f (x) = 0 for all x v Observing X E, 64 / 123

Handling Evidence Observing x = v, we can absorb it in the factor (as we did); or set messages µx f (x) = 0 for all x v Observing X E, multiplying the incoming messages to x / XE gives the joint (not p(x X E )): p(x, X E ) µ f x (x) f ne(x) 64 / 123

Loopy Belief Propagation So far, we assumed a tree graph 65 / 123

Loopy Belief Propagation So far, we assumed a tree graph When the factor graph contains loops, pass messages indefinitely until convergence 65 / 123

Loopy Belief Propagation So far, we assumed a tree graph When the factor graph contains loops, pass messages indefinitely until convergence Loopy BP may not convergence, but works in many cases 65 / 123

Parameter Learning Assume the graph structure is given 67 / 123

Parameter Learning Assume the graph structure is given Parameters: 67 / 123

Parameter Learning Assume the graph structure is given Parameters: θ i in CPDs p(x i pa(x i ), θ i ) in directed graphical models p(x) = i p(x i P a(x i ), θ i ) 67 / 123

Parameter Learning Assume the graph structure is given Parameters: θ i in CPDs p(x i pa(x i ), θ i ) in directed graphical models p(x) = i p(x i P a(x i ), θ i ) Weights wi in undirected graphical model p(x) = 1 Z exp ( k i=1 w i f i (X) ) 67 / 123

Parameter Learning: Maximum Likelihood Estimate fully observed: all dimensions of X are observed 68 / 123

Parameter Learning: Maximum Likelihood Estimate fully observed: all dimensions of X are observed given X 1,..., X n, the MLE is ˆθ = argmax θ n log p(x i θ) i=1 68 / 123

Parameter Learning: Maximum Likelihood Estimate fully observed: all dimensions of X are observed given X 1,..., X n, the MLE is ˆθ = argmax θ n log p(x i θ) i=1 log likelihood factorizes for directed models (easy) 68 / 123

Parameter Learning: Maximum Likelihood Estimate fully observed: all dimensions of X are observed given X 1,..., X n, the MLE is ˆθ = argmax θ n log p(x i θ) log likelihood factorizes for directed models (easy) gradient method for undirected models partially observed: X = (X o, X h ) where X h unobserved i=1 68 / 123