Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2 Yasemin Altun January 26, 2007
Review of Inference on Graphical Models Elimination algorithm finds single marginal probability p(f ) by Choosing an elimination ordering I such that f is the last node Keep track of active potentials Eliminate a node i by removing all active potentials referencing i, take product and sum over x i, put this message on the active list. Consider trees with multiple queries. This is a special case of Junction Tree Algorithm, Sum-Product algorithm. Choose a root node arbitrarily A natural elimination order is depth-first traversal of the tree.
Inference on Trees for Single Query p(r) to root i i j m ji ( x i ) j m kj ( x j ) m lj ( x j ) k l (a) (b) Elimination of j node. If j is an evidence node φ E (x j ) = 1[x j = x j ] else φ E (x j ) = 1 m ji (x i ) = φ E (x j )φ(x i, x j ) m kj (x j ) x j k c(j) The marginal of root x r is proportional to its messages p(x r x e ) = φ E (x r ) m kr (x r ) k c(r)
m 32 ( Inference on Trees for x ) 2 Multiple Query m ( x ) 42 2 X 3 X4 m 32 ( x 2 ) m 42 ( x 2 ) X 3 X4 (a) (b) X1 X 1 X1 X 1 m 12 ( x 1 ) m 12 ( x ) 12 ( x 2 ) 2 m 12 ( x 2 ) m 12 ( x 1 ) X2 X X2 2 X 2 m 32 ( x 2 ) m 42 ( x 2 ) m 32 ( x 2 ) m 42 ( x 2 ) m24 ( x 4 ) m 32 ( x 2 ) m 42 ( x 2 ) X 3 X4 X 3 X4X 4 X 3 m 23 ( x 3 ) m 24 ( x 4 ) X 4 (a) (b) (c) (d) X1 X1 Fig a) Choosing 1 as root, 2 has all the messages from nodes m 12 ( x ) below it 2 m 12 ( x 2 ) m 12 ( x 1 ) Fig b) To X2 compute p(x 2 ), m 12 X2 (x 2 ) is needed m m 32 ( x 2 ) 32 ( x ) m 2 42 ( ) Fig c) One pass from leaves to root x 2 and one pass m24 ( x 4 ) m 23 ( x ) m 3 24 ( x ) 4 X 3 X4 X 3 X4 node (c) marginals (d) backward gives the messages required to compute all
Inference on HMMs '"#!%& '"#& '"#$%& '"#!%& '"#& '"#$%&!"#!%&!"#&!"#$%&!"#!%&!"#&!"#$%& (&)*++,&)-./)(01)2(34(#4506 Hidden Markov Models are directed sequence models. Let x = (x 1,..., x l ) and y = (y 1,..., y l ) be sequences. x i O, where O is a set of possible observations. y i Y, where Y is a set of possible labels. π σ = p(y 1 = σ; θ) initial transition. T σ,σ = p(y i+1 = σ y i = σ ; θ) transition probabilities for σ, σ Y O σ,u = p(x i = u y i = σ; θ) observation probabilities for σ Y, u O N N 1 P(x, y; θ) = π y1 O yi,x i T yi+1,y i i=1 i=1
Forward-Backward Algorithm for HMMs Set y l as the root, x are evidence. Forward pass α i (σ) = m (i 1)i (x i = σ) ( ) p( x 1:i, y i = σ) = α i (σ) = α i 1 (σ )T σ,σ σ Y O σ, xi Backward pass p( x i+1:l y i = σ) = β i (σ) = T σ,σo σ, x i+1 β i+1 (σ ) σ Y Combining messages P(y i = σ x 1:l ) = p( x 1:l y i )p(y i ) p( x i:l ) p( x 1:i y i )p( x i+1:l y i )p(y i ) = α i (σ)β i (σ)
Generalization to Junction Tree Algorithm Data structure: Generate a tree whose nodes are the maximal cliques. Then, we get p(x C x V \C ) using the sum-product. We can marginalize over this to get individual marginal probabilities. Marginalization over different cliques of node i should yield to same p(x i x V \i ). There are multiple clique tree. We need a data structure that preserves global consistency via local consistency. A junction tree is a clique tree for each clique pair C i, C j, sharing some variables C s, C s is included in all the nodes of the clique tree on the path from C i and C j. There exists a junction tree of a graph iff the graph is triangulated. Elimination ordering Message-Passing Protocol: A clique C i can send a message to neighbor C j if it has received messages from all its other neighbors.
Junction Tree Algorithm Moralize if the graph is directed Triangulate the graph G (using variable elimination algorithm) Construct a junction tree J from the triangulated graph via maximal weight spanning tree. Initialize the potential of cliques C of J as the product of potentials from G all assigned factors from the model. Propagate local messages on the junction tree
Maximizing instead of Summing Finding maximum probability configurations ion and Belief Propagation both summed over Ch values of the marginal Maximizing (non-query, non-evidence) instead ofnodes Summing If j rginal probability. Same formulation can be used exactly for finding MAP Elimination and Belief Propagation both summed over Multiplication distributes over max as well as sum Pa wanted to allmaximize possible values over the ofnon-query, the marginal non-evidence (non-query, non-evidence) nodes d the probabilty to get aof marginal the single probability. best setting consistent max(ab, ac) = a max(b, c) ery and evidence? What if we wanted to maximize over the non-query, non-evidence ax max max max nodes We max p(x can to 1)p(xfind 2 x do 1)p(x MAP the 3 x 1)p(x probabilty computation 4 x 2)p(x 5 x 3)p(x of 6 xthe 2, using x 5) single exactly best setting the same consistent x1 x2 x3 x4 x5 ax p(x 1) max p(x with algorithm 2 x 1) max anyp(x query 3 x 1) replacing maxand p(x 4 x 2) evidence? max sums p(x 5 x 3)p(x with 6 x 2, max x 5) Re x1 x2 x3 x4 x5 wn as the maximum a-posteriori or MAP configuration. Giv max x p(x) = max x1 max x2 max x3 x1 x2 gation to solve this problem. max x4 max x5 p(x 1 )p(x 2 x 1 )p(x 3 x 1 )p(x 4 x 2 )p(x 5 x 3 )p(x 6 x 2, x 5 ) that (on trees), we can use an algorithm exactly like = max p(x 1 ) max p(x 2 x 1 ) max p(x 3 x 1 ) max p(x 4 x 2 ) max p(x 5 x 3 )p(x 6 x 2, x 5 ) x3 X 4 This is known as the maximum a-posteriori or MAP configuration. X 2 X It turns out that (on trees), we can use 6 X1 an algorithm exactly like belief-propagation to solve this problem. X 4 X 3 X 5 x4 x5 X1 X 2 X 6 Re ma X 3 X 5
Learning Parameter Learning: Given the graph structure, how to get the conditional probability distributions P θ (X i X Πi )? Parameters θ are unknown constants Given fully observed training data D Find their estimates maximizing the (penalized) log-likelihood ˆθ = argmax log P(D θ)( λr(θ)) θ Parameter estimation with fully observed data Parameter estimation with latent variables, Expectation Maximization (EM) Structure Learning: Given fully observed training data D, how to get the graph G and its parameters θ? After studying the discriminative framework
Parameter Estimation with Complete Observations Random variable X = (X 1,..., X n ). The data D = (x 1,..., x m ) is a set of m IID observations. Maximum likelihood estimate (MLE) can be found by maximizing the log probability of the data. In Bayes nets, l(θ; D) = log p(d θ) = log m p(x j θ) = j=1 m n j=1 i=1 log p(x j i x j, θ π j i ) i Consider X i be discrete rv with K possible values. Then, p(x i X πi, θ i ) = θ i (x i, x πi ) is a multinomial distribution st x i θ i (x i, x πi ) = 1. l(θ; D) = N(x i,πi ) log θ i (x i, x πi ) i x i,πi N(x i,πi ) = m assignment in data j=1 1[x j i,π i = x i,πi ] observed count of x i,πi
Parameter Estimation with Complete Observations l(θ; D) = i x i,πi N(x i,πi ) log θ i (x i, x πi ) Estimation of θ i is independent of θ k for k i. Estimating of θ i, ignore data associated with nodes other than i and π i and maximize N(x i,πi ) log θ i (x i, x πi ) wrt θ i. Add a Lagrangian term for normalization and solve for θ i ML estimate is the relative frequency. ˆθ i (x i, x πi ) = N(x i, x πi )/N(π i )
Parameter Estimation in HMMs '"#!%& '"#& '"#$%& '"#!%& '"#& '"#$%&!"#!%&!"#&!"#$%&!"#!%&!"#&!"#$%& (&)*++,&)-./)(01)2(34(#4506 Hidden Markov Models are directed sequence models. Let x = (x 1,..., x l ) and y = (y 1,..., y l ) be sequences. x i O, where O is a set of possible observations. y i Y, where Y is a set of possible labels. π σ = p(y 1 = σ; θ) initial transition. T σ,σ = p(y i+1 = σ y i = σ ; θ) transition probabilities for σ, σ Y O σ,u = p(x i = u y i = σ; θ) observation probabilities for σ Y, u O N N 1 P(x, y; θ) = π y1 O yi,x i T yi+1,y i i=1 i=1
Parameter Estimation in HMMs MLE estimate π σ = P(x, y; θ) = π y1 N i=1 P m j=1 1[y j 1 =σ] m Pl j 1 i=1 1[y j i+1 =σ y j i =σ ] P m Pl j 1 j=1 i=1 y j i =σ P m Pl j j=1 i=1 1[y j i =σ x j i =u] P m Pl j j=1 i=1 y j i =σ T σ,σ = P m j=1 O σ,u = N 1 O yi,x i i=1 T yi+1,y i
MLE with Complete Observations for MRF p(x θ) = C φ(x C) x Y n c φ(x c) In MRFs, Z couples the parameters. There is no closed formed solution of θ. For potential functions φ C (x C ) = exp( f (x C ), θ C ) m l(θ; D) = log p(x j θ) j=1 = j ( ) f (x C ), θ C log Z θ C = C N(x C ) f (x C ), θ C m log Z θ Take the derivative of l wrt θ, equate to 0 and solve for θ. The derivative of log Z wrt θ C is the expectation E x pθ [f (x C )].
Parameter Estimation with Latent Variables When there are latent variables, we need to marginalize over the latent variables, which leads to coupling between parameters. Let X be observed variables, Z latent variables. l(θ; D) = log z p(x, z θ) Expectation Maximization (EM) Algorithm is a general approach to maximum likelihood parameter estimation (MLE) with latent(hidden) variables. EM is a coordinate-descent algorithm to minimize Kullback-Leibler (KL) divergence
EM Algorithm Since Z is not observed, the log-likelihood of data is a marginal over Z which does not decompose. l(θ; x) := log p(x θ) = log p(x, z θ) z Replace this with expected log-likelihood wrt the averaging distribution q(z x). This is linear in l c (θ; x, z) := log p(x, z θ). l c (θ; x, z) q := z q(z x, θ) log p(x, z θ) This optimization is a lower bound on l(θ; x). Using Jensen s inequality l(θ; x) = log p(x, z θ) = log p(x, z θ) q(z x) q(z x) z z p(x, z θ) q(z x) log = L(q, θ) q(z x) z
EM Algorithm Until convergence Maximize wrt q (E step): Maximize wrt θ (M step): q t+1 = argmax L(q, θ t ) q θ t+1 = argmax L(q t+1, θ) θ
EM Algorithm Until convergence Maximize wrt q (E step): q t+1 = argmax q L(q, θ t ) Maximize wrt θ (M step): θ t+1 = argmax θ L(q t+1, θ) argmax θ L(q, θ t ) = argmax q(z x) log p(x, z θ) z θ z q(z x) log q(z x) = argmax l c (θ; x, z) q θ
EM Algorithm Until convergence Maximize wrt q (E step): q t+1 = argmax q L(q, θ t ) l(θ; x) L(q, θ) = q(z x) log p(x θ) z z = D(q(z x) p(z x, θ)) q(z x) log p(x, z θ) q(z x) Minimizing l(θ; x) L(q, θ) is equivalent to maximizing L(q, θ). KL divergence is minimizes when q(z x) = p(z x, θ t ) Intuitively, given p(x, z θ), p(z x, θ t ) is the best guess for latent variables conditioned on x. Maximize wrt θ (M step): θ t+1 = argmax θ L(q t+1, θ) M step increases a lower bound on likelihood E step closes the gap to yield l(θ t ; x) = L(q t+1, θ t )
EM for HMMs E step: Get expected counts using Forward-Backward Algorithm. q t = p(z x, θ t 1 ) M step: Find MLE estimate wrt the expected counts (relative frequency). θ t = argmax l c (θ; x, z) q t θ
Discriminative Training of HMMs st-1 s t st-1 s t o t o t (a). Maximum Entropy Markov Models (MEMM) [McCFrePer00] (Generalization to Bayes Nets is trivial.) Figure 1. (a) The dependency graph for a traditional HMM; (b) a) HMM: P(x, y; θ) = π N for our conditional maximum y1 i=1 entropy O N 1 y i,x Markov i i=1 T y model. i+1,y i b) MEMM: P(y x; θ) = π N 1 y1,x 1 i=1 T y i+1,y i,x i+1 (b) distribution, for, and an initial state
Inference and Parameter Estimation in MEMMs Forward-Backward algorithm α i (σ) = σ Y α i 1 (σ )T σ,σ, x i β i (σ) = σ Y T σ,σ, x i+1 β i+1 (σ ) Representation for σ, σ Y, u O T σ,σ,ū = 1 Z (ū, σ ) exp ( θ, f (ū, σ, σ) ) Z (ū, σ ) = exp ( θ, f (ū, σ, σ ) ) σ Coupling between θ due to Z results in no-closed-form solution. Use some convex optimization technique. Note the difference between undirected models where Z is a sum over all possible values of all nodes (Y n ). Here, it is a sum over Y.
Generative vs Discriminative Approaches Generative framework: HMMs model p(x, y) Advantages: Efficient learning algorithm. Relative frequency. Can handle missing observation (simply latent variable) Naturally incorporates prior knowledge Disadvantages: Harder problem p(y x)p(y) vs p(y x) Questionable independence assumption x i x j y i, j i Limited representation. Overlapping features are problematic. Assuming independence violates the model. p(x i y i ) = Y k p(f k (x i ) y i ) f 1 (u) = Is u the word "Brown"?, f 2 (u) = Is u capitalized? Conditioning increase the number of parameters. p(x i y i ) = Y k p(f k (x i ) y i, f 1 (x i ),..., f k 1 (x i )) Increased complexity to capture dependency between y i and x i 1, p(x i y i, y i 1 ).
Generative vs Discriminative Approaches Discriminative learning: MEMM model p(y x) Advantages: Solves more direct problem. Richer representation via kernels or feature selection. Conditioning on x, overlapping features are trivial. No independence assumption on observations. Lower asymptotic error Disadvantages: Expensive learning. There is no closed form solution as in HMMs. Therefore, we need Forward-Backward algorithm Missing values in observation is problematic due to conditioning. Requires more data (O(d)) to reach its asymptotic error, whereas generative models require O(log d), for d number of parameters [NgJor01].