Directed Probabilistic Graphical Models CMSC 678 UMBC

Announcement 1: Assignment 3 Due Wednesday April 11 th, 11:59 AM Any questions?

Announcement 2: Progress Report on Project Due Monday April 16 th, 11:59 AM Build on the proposal: Update to address comments Discuss the progress you ve made Discuss what remains to be done Discuss any new blocks you ve experienced (or anticipate experiencing) Any questions?

Outline Recap of EM Math: Lagrange Multipliers for constrained optimization Probabilistic Modeling Example: Die Rolling Directed Graphical Models Naïve Bayes Hidden Markov Models Message Passing: Directed Graphical Model Inference Most likely sequence Total (marginal) probability EM in D-PGMs

Recap from last time

Expectation Maximization (EM): E-step 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters p(z i ) count(z i, w i ) 2. M-step: maximize log-likelihood, assuming these uncertain counts p (t) (z) estimated counts p t+1 (z)

max θ new parameters EM Math E-step: count under uncertainty E old parameters log p (z, w) z ~ p (t)( w) θ θ posterior distribution M-step: maximize log-likelihood new parameters M θ = marginal loglikelihood of observed data X C θ = log-likelihood of complete data (X,Y) P θ = posterior loglikelihood of incomplete data Y M θ = E Y θ (t)[c θ X] E Y θ (t)[p θ X] EM does not decrease the marginal log-likelihood

Lagrange multipliers Assume an original optimization problem

Lagrange multipliers Assume an original optimization problem We convert it to a new optimization problem:

Lagrange multipliers: an equivalent problem?

Probabilistic Estimation of Rolling a Die N different (independent) rolls p w 1, w 2,, w N = p w 1 p w 2 p w N = i p w i w 1 = 1 w 2 = 5 w 3 = 4 Generative Story for roll i = 1 to N: w i Cat(θ) 6 θ k = 1 k=1 a probability distribution over 6 sides of the die 0 θ k 1, k

Probabilistic Estimation of Rolling a Die N different (independent) rolls p w 1, w 2,, w N = p w 1 p w 2 p w N = w 1 = 1 w 2 = 5 w 3 = 4 Generative Story for roll i = 1 to N: w i Cat(θ) Maximize Log-likelihood i p w i L θ = log p θ (w i ) i = log θ wi i

Probabilistic Estimation of Rolling a Die N different (independent) rolls p w 1, w 2,, w N = p w 1 p w 2 p w N = Generative Story for roll i = 1 to N: w i Cat(θ) Maximize Log-likelihood L θ = log θ wi i i p w i Q: What s an easy way to maximize this, as written exactly (even without calculus)?

Probabilistic Estimation of Rolling a Die N different (independent) rolls p w 1, w 2,, w N = p w 1 p w 2 p w N = i p w i Maximize Log-likelihood (with distribution constraints) L θ = log θ wi i 6 s. t. θ k = 1 k=1 (we can include the inequality constraints 0 θ k, but it complicates the problem and, right now, is not needed) solve using Lagrange multipliers

Probabilistic Estimation of Rolling a Die N different (independent) rolls p w 1, w 2,, w N = p w 1 p w 2 p w N = i p w i Maximize Log-likelihood (with distribution constraints) 6 F θ = log θ wi λ θ k 1 i k=1 (we can include the inequality constraints 0 θ k, but it complicates the problem and, right now, is not needed) F θ 1 = λ θ k θ wi i:w i =k F θ λ 6 = θ k + 1 k=1

Probabilistic Estimation of Rolling a Die N different (independent) rolls p w 1, w 2,, w N = p w 1 p w 2 p w N = i p w i Maximize Log-likelihood (with distribution constraints) 6 F θ = log θ wi λ θ k 1 i k=1 (we can include the inequality constraints 0 θ k, but it complicates the problem and, right now, is not needed) θ k = σ i:w i =k 1 λ optimal λ when θ k = 1 6 k=1

Probabilistic Estimation of Rolling a Die N different (independent) rolls p w 1, w 2,, w N = p w 1 p w 2 p w N = i p w i Maximize Log-likelihood (with distribution constraints) 6 F θ = log θ wi λ θ k 1 i k=1 (we can include the inequality constraints 0 θ k, but it complicates the problem and, right now, is not needed) θ k = σ i:w i =k 1 σ k σ i:wi =k 1 = N k N optimal λ when θ k = 1 6 k=1

Example: Conditionally Rolling a Die p w 1, w 2,, w N = p w 1 p w 2 p w N = p w i add complexity to better explain what we see p z 1, w 1, z 2, w 2,, z N, w N = p z 1 p w 1 z 1 p z N p w N z N i z 1 = H z 2 = T w 1 = 1 w 2 = 5 = p w i z i i p z i p heads = λ p tails = 1 λ p heads = γ p tails = 1 γ p heads = ψ p tails = 1 ψ

Example: Conditionally Rolling a Die p w 1, w 2,, w N = p w 1 p w 2 p w N = p w i p z 1, w 1, z 2, w 2,, z N, w N = p z 1 p w 1 z 1 p z N p w N z N p heads = λ p tails = 1 λ p heads = γ p tails = 1 γ p heads = ψ p tails = 1 ψ = p w i z i i add complexity to better explain what we see p z i Generative Story λ = distribution over penny γ = distribution for dollar coin ψ = distribution over dime for item i = 1 to N: z i ~ Bernoulli λ if z i = H: w i ~ Bernoulli γ else: w i ~ Bernoulli ψ i

Classify with Bayes Rule argmax Y p Y X) argmax Y log p X Y) + log p(y) likelihood prior

The Bag of Words Representation Adapted from Jurafsky & Martin (draft)

The Bag of Words Representation Adapted from Jurafsky & Martin (draft) 29

Bag of Words Representation seen 2 sweet 1 γ( )=c whimsical 1 classifier Adapted from Jurafsky & Martin (draft) recommend 1 happy 1...... classifier

Naïve Bayes: A Generative Story Generative Story φ = distribution over K labels for label k = 1 to K: θ k = generate parameters global parameters

Naïve Bayes: A Generative Story Generative Story φ = distribution over K labels for label k = 1 to K: θ k = generate parameters for item i = 1 to N: y i ~ Cat φ y

Naïve Bayes: A Generative Story Generative Story φ = distribution over K labels for label k = 1 to K: θ k = generate parameters for item i = 1 to N: y i ~ Cat φ for each feature j x ij F j (θ yi ) local variables y x i1 x i2 x i3 x i4 x i5

Naïve Bayes: A Generative Story Generative Story φ = distribution over K labels for label k = 1 to K: θ k = generate parameters for item i = 1 to N: y i ~ Cat φ for each feature j x ij F j (θ yi ) y x i1 x i2 x i3 x i4 x i5 each x ij conditionally independent of one another (given the label)

Naïve Bayes: A Generative Story Generative Story φ = distribution over K labels for label k = 1 to K: θ k = generate parameters for item i = 1 to N: y i ~ Cat φ for each feature j x ij F j (θ yi ) y x i1 x i2 x i3 x i4 x i5 Maximize Log-likelihood L θ = i log F yi (x ij ; θ yi ) + j i log φ yi s. t. φ k = 1 k θ k is valid for F k

Multinomial Naïve Bayes: A Generative Generative Story φ = distribution over K labels for label k = 1 to K: for item i = 1 to N: y i ~ Cat φ Story θ k = distribution over J feature values for each feature j x ij Cat(θ yi,j) y x i1 x i2 x i3 x i4 x i5 Maximize Log-likelihood L θ = i log θ yi,x i,j + j i log φ yi s. t. φ k = 1 k θ kj = 1 k j

Multinomial Naïve Bayes: A Generative Generative Story φ = distribution over K labels for label k = 1 to K: for item i = 1 to N: y i ~ Cat φ Story θ k = distribution over J feature values for each feature j x ij Cat(θ yi,j) y x i1 x i2 x i3 x i4 x i5 L θ Maximize Log-likelihood via Lagrange Multipliers = log θ yi,x i,j + log φ yi μ φ k 1 i j i k k λ k θ kj 1 j

Multinomial Naïve Bayes: Learning Calculate class priors For each k: items k = all items with class = k Calculate feature generation terms For each k: obs k = single object containing all items labeled as k For each feature j n kj = # of occurrences of j in obs k p k = items k # items p j k = n kj σ j n kj

Adapted from Jurafsky & Martin (draft) Brill and Banko (2001) With enough data, the classifier may not matter

Summary: Naïve Bayes is Not So Naïve, but not without issue Pro Very Fast, low storage requirements Robust to Irrelevant Features Very good in domains with many equally important features Optimal if the independence assumptions hold Dependable baseline for text classification (but often not the best) Con Model the posterior in one go? (e.g., use conditional maxent) Are the features really uncorrelated? Are plain counts always appropriate? Are there better ways of handling missing/noisy data? (automated, more principled) Adapted from Jurafsky & Martin (draft)

Hidden Markov Models (i): Adjective Noun Verb Prep Noun Noun (ii): Noun Verb Noun Prep Noun Noun p(british Left Waffles on Falkland Islands) Class-based model Bigram model of the classes Model all class sequences p w i z i p z i z i 1 z 1,..,z N p z 1, w 1, z 2, w 2,, z N, w N

Hidden Markov Model p z 1, w 1, z 2, w 2,, z N, w N = p z 1 z 0 p w 1 z 1 p z N z N 1 p w N z N = p w i z i i p z i z i 1 Goal: maximize (log-)likelihood In practice: we don t actually observe these z values; we just see the words w if we did observe z, estimating the probability parameters would be easy but we don t! :( if we knew the probability parameters then we could estimate z and evaluate likelihood but we don t! :(

Hidden Markov Model Terminology p z 1, w 1, z 2, w 2,, z N, w N = p z 1 z 0 p w 1 z 1 p z N z N 1 p w N z N = p w i z i i p z i z i 1 Each z i can take the value of one of K latent states

Hidden Markov Model Terminology p z 1, w 1, z 2, w 2,, z N, w N = p z 1 z 0 p w 1 z 1 p z N z N 1 p w N z N = p w i z i i p z i z i 1 transition probabilities/parameters Each z i can take the value of one of K latent states

Hidden Markov Model Terminology p z 1, w 1, z 2, w 2,, z N, w N = p z 1 z 0 p w 1 z 1 p z N z N 1 p w N z N = p w i z i i p z i z i 1 emission probabilities/parameters transition probabilities/parameters Each z i can take the value of one of K latent states Transition and emission distributions do not change

Hidden Markov Model Representation p z 1, w 1, z 2, w 2,, z N, w N = p z 1 z 0 p w 1 z 1 p z N z N 1 p w N z N emission probabilities/parameters = p w i z i i p z i z i 1 transition probabilities/parameters z 1 z 2 z 3 z 4 w 1 w 2 w 3 w 4 represent the probabilities and independence assumptions in a graph

Hidden Markov Model Representation p z 1, w 1, z 2, w 2,, z N, w N = p z 1 z 0 p w 1 z 1 p z N z N 1 p w N z N emission probabilities/parameters initial starting distribution ( SEQ_SYM ) p z 1 z 0 z 1 = p w i z i i p z i z i 1 p z 2 z 1 p z 3 z 2 p z 4 z 3 z 2 z 3 z 4 p w 1 z 1 p w 2 z 2 p w 3 z 3 p w 4 z 4 transition probabilities/parameters w 1 w 2 w 3 w 4 Each z i can take the value of one of K latent states Transition and emission distributions do not change

Example: 2-state Hidden Markov Model as a Lattice z 1 = V z 2 = V z 3 = V z 4 = V z 1 = N z 2 = N z 3 = N z 4 = N w 1 w 2 w 3 w 4 w 1 w 2 w 3 w 4 N.7.2.05.05 V.2.6.1.1 N V end start.7.2.1 N.15.8.05 V.6.35.05

Example: 2-state Hidden Markov Model as a Lattice z 1 = V z 2 = V z 3 = V z 4 = V z 1 = N z 2 = N z 3 = N z 4 = N p w 1 V p w 2 V p w 3 V p w 1 N p w 2 N p w 3 N p w 4 N p w 4 V w 1 w 2 w 3 w 4 w 1 w 2 w 3 w 4 N.7.2.05.05 V.2.6.1.1 N V end start.7.2.1 N.15.8.05 V.6.35.05

Example: 2-state Hidden Markov Model as a Lattice p V start z 1 = V p V V z 2 = p V V p V V z 3 = V V z 4 = V p N start z 1 = N p N N z 2 = p N N z 3 = p N N N N z 4 = N p w 1 V p w 2 V p w 3 V p w 1 N p w 2 N p w 3 N p w 4 N p w 4 V w 1 w 2 w 3 w 4 w 1 w 2 w 3 w 4 N.7.2.05.05 V.2.6.1.1 N V end start.7.2.1 N.15.8.05 V.6.35.05

Example: 2-state Hidden Markov Model as a Lattice p V start z 1 = V p V V z 2 = p V V p V V z 3 = V V z 4 = V p N start p N V z 1 = N p V N p V N p V N p N V p N V p N N z 2 = p N N z 3 = p N N N N z 4 = N p w 1 V p w 2 V p w 3 V p w 1 N p w 2 N p w 3 N p w 4 N p w 4 V w 1 w 2 w 3 w 4 w 1 w 2 w 3 w 4 N.7.2.05.05 V.2.6.1.1 N V end start.7.2.1 N.15.8.05 V.6.35.05

A Latent Sequence is a Path through the Graph z 1 = V z 2 = V p V V z 3 = V z 4 = V p V N p N V p N start z 1 = N z 2 = N z 3 = N z 4 = N p w 2 V p w 3 V p w 1 N p w 4 N w 1 w 2 w 3 w 4 N V end start.7.2.1 N.15.8.05 V.6.35.05 w 1 w 2 w 3 w 4 N.7.2.05.05 V.2.6.1.1 Q: What s the probability of (N, w 1 ), (V, w 2 ), (V, w 3 ), (N, w 4 )? A: (.7*.7) * (.8*.6) * (.35*.1) * (.6*.05) = 0.0002822

Message Passing: Count the Soldiers If you are the front soldier in the line, say the number one to the soldier behind you. If you are the rearmost soldier in the line, say the number one to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side ITILA, Ch 16

What s the Maximum Weighted Path? 3 7 4 9 6 32 1

What s the Maximum Weighted Path? 3 +3 10 7 4 +3 7 9 6 32 1

What s the Maximum Weighted Path? 3 +3 10 7 4 +3 7 9 6 32 1 +10 19 +10 16 +10 42 +10 11

What s the Maximum Value? z i-1 = A z i = A z i-1 = B z i = B z i-1 = C z i = C consider any shared path ending with B (AB, BB, or CB) B maximize across the previous hidden state values v i, B = max s v i 1, s p B s ) p(obs at i B) v(i, B) is the maximum probability of any paths to that state B from the beginning (and emitting the observation)

What s the Maximum Value? z i-2 = A z i-1 = A z i = A z i-2 = B z i-1 = B z i = B z i-2 = C z i-1 = C z i = C consider any shared path ending with B (AB, BB, or CB) B maximize across the previous hidden state values v i, B = max s v i 1, s p B s ) p(obs at i B) v(i, B) is the maximum probability of any paths to that state B from the beginning (and emitting the observation)

What s the Maximum Value? z i-2 computing = A v at time i-1 will correctly incorporate (maximize over) paths z through i-2 = B time i-2: we correctly obey the Markov property z i-2 = C z i-1 = A z i-1 = B z i-1 = C z i = A z i = B z i = C consider any shared path ending with B (AB, BB, or CB) B maximize across the previous hidden state values v i, B = max s v i 1, s p B s ) p(obs at i B) Viterbi algorithm v(i, B) is the maximum probability of any paths to that state B from the beginning (and emitting the observation)

v = double[n+2][k*] b = int[n+2][k*] Viterbi Algorithm v[*][*] = 0 v[0][start] = 1 backpointers/ book-keeping for(i = 1; i N+1; ++i) { for(state = 0; state < K*; ++state) { p obs = p emission (obs i state) for(old = 0; old < K*; ++old) { p move = p transition (state old) if(v[i-1][old] * p obs * p move > v[i][state]) { } } } } v[i][state] = v[i-1][old] * p obs * p move b[i][state] = old

Forward Probability z i-2 = A z i-1 = A z i = A z i-2 = B z i-1 = B z i = B z i-2 = C z i-1 = C z i = C α(i, B) is the total probability of all paths to that state B from the beginning

Forward Probability z i-2 = A z i-1 = A z i = A z i-2 = B z i-1 = B z i = B z i-2 = C z i-1 = C z i = C marginalize across the previous hidden state values α i, B = s α i 1, s p B s ) p(obs at i B) computing α at time i-1 will correctly incorporate paths through time i-2: we correctly obey the Markov property α(i, B) is the total probability of all paths to that state B from the beginning

Forward Probability α i, s = s α i 1, s p s s ) p(obs at i s) α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i

Forward Probability α i, s = s α i 1, s p s s ) p(obs at i s) what are the immediate ways to get into state s? what s the total probability up until now? how likely is it to get into state s this way? α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i

α i, s = what are the immediate ways to get into state s? Forward Probability s α i 1, s p s s ) p(obs at i s) what s the total probability up until now? α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i how likely is it to get into state s this way? 3. that emit the observation obs at i Q: What do we return? (How do we return the likelihood of the sequence?) A: α[n+1][end]

Forward & Backward Message Passing z i-1 = A z i = A z i+1 = A z i-1 = B z i = B z i+1 = B z i-1 = C z i = C z i+1 = C α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end 3. (that emit the observation obs at i+1)

Forward & Backward Message Passing z i-1 = A z i = A z i+1 = A z i-1 = B z i = B z i+1 = B z i-1 = C z i = C z i+1 = C α(i, B) β(i, B) α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end 3. (that emit the observation obs at i+1)

Forward & Backward Message Passing z i-1 = A z i = A z i+1 = A z i-1 = B z i = B z i+1 = B z i-1 = C z i = C z i+1 = C α(i, B) β(i, B) α(i, B) * β(i, B) = total probability of paths through state B at step i α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end 3. (that emit the observation obs at i+1)

Forward & Backward Message Passing z i-1 = A z i = A z i+1 = A z i-1 = B z i = B z i+1 = B z i-1 = C z i = C z i+1 = C α(i, B) β(i+1, s) α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end 3. (that emit the observation obs at i+1)

Forward & Backward Message Passing z i-1 = A z i = A z i+1 = A z i-1 = B z i = B z i+1 = B z i-1 = C z i = C z i+1 = C α(i, B) β(i+1, s ) α(i, B) * p(s B) * p(obs at i+1 s ) * β(i+1, s ) = total probability of paths through the B s arc (at time i) α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end 3. (that emit the observation obs at i+1)

With Both Forward and Backward Values α(i, s) * β(i, s) = total probability of paths through state s at step i p z i = s w 1,, w N ) = α i, s β(i, s) α(n + 1, END) α(i, s) * p(s B) * p(obs at i+1 s ) * β(i+1, s ) = total probability of paths through the s s arc (at time i) p z i = s, z i+1 = s w 1,, w N ) = α i, s p s s p obs i+1 s β(i + 1, s ) α(n + 1, END)

estimated counts Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm p obs (w s) p trans (s s) 1. E-step: count under uncertainty, assuming these parameters 2. M-step: maximize log-likelihood, assuming these uncertain counts

estimated counts Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters p z i = s w 1,, w N ) = α i, s β(i, s) α(n + 1, END) p obs (w s) p trans (s s) p z i = s, z i+1 = s w 1,, w N ) = α i, s p s s p obs i+1 s β(i + 1, s ) α(n + 1, END) 2. M-step: maximize log-likelihood, assuming these uncertain counts

M-Step maximize log-likelihood, assuming these uncertain counts p new s s) = c(s s ) σ x c(s x) if we observed the hidden transitions

M-Step maximize log-likelihood, assuming these uncertain counts p new s s) = E s s [c s s ] σ x E s x [c s x ] we don t the hidden transitions, but we can approximately count

M-Step maximize log-likelihood, assuming these uncertain counts p new s s) = E s s [c s s ] σ x E s x [c s x ] we compute these in the E-step, with our α and β values we don t the hidden transitions, but we can approximately count

α = computeforwards() β = computebackwards() L = α[n+1][end] EM For HMMs (Baum-Welch Algorithm) for(i = N; i 0; --i) { for(next = 0; next < K*; ++next) { c obs (obs i+1 next) += α[i+1][next]* β[i+1][next]/l for(state = 0; state < K*; ++state) { u = p obs (obs i+1 next) * p trans (next state) c trans (next state) += α[i][state] * u * β[i+1][next]/l } } }

Bayesian Networks: Directed Acyclic Graphs p x 1, x 2, x 3,, x N = i p x i π(x i )) parents of topological sort

Bayesian Networks: Directed Acyclic Graphs p x 1, x 2, x 3,, x N = i p x i π(x i )) exact inference in general DAGs is NP-hard inference in trees can be exact

D-Separation: Testing for Conditional Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z Independence d-separation X & Y are d-separated if for all paths P, one of the following is true: P has a chain with an observed middle node observing Z blocks the path from X to Y X Z Y P has a fork with an observed parent node observing Z blocks the path from X to Y X Z Y not observing Z blocks the path from X to Y P includes a v-structure or collider with all unobserved descendants X Z Y

D-Separation: Testing for Conditional Variables X & Y are conditionally independent given Z if all (undirected) paths from (any variable in) X to (any variable in) Y are d-separated by Z Independence d-separation X & Y are d-separated if for all paths P, one of the following is true: P has a chain with an observed middle node observing Z blocks the path from X to Y X Z Y P has a fork with an observed parent node observing Z blocks the path from X to Y X Z Y not observing Z blocks the path from X to Y p x, y, z = p x p y p(z x, y) p x, y = p x p y p(z x, y) = p x p y z P includes a v-structure or collider with all unobserved descendants X Z Y

Markov Blanket p x i x j i = factorization of graph = the set of nodes needed to form the complete conditional for a variable x i p(x 1,, x N ) p x 1,, x N dx i ς k p(x k π x k ) ς k p x k π x k ) dx i x = factor out terms not dependent on x i ς k:k=i or i π xk p(x k π x k ) ς k:k=i or i π xk p x k π x k ) dx i Markov blanket of a node x is its parents, children, and children's parents