CS395T: Structured Models for NLP Lecture 10: Loopy Graphical Models

Size: px

Start display at page:

Download "CS395T: Structured Models for NLP Lecture 10: Loopy Graphical Models"

Allyson Griffin
5 years ago
Views:

1 CS395T: Structured Models for NLP Lecture 10: Loopy Graphical Models Recall: Global vs. Greedy State space Start state Gold end state Greg Durrett Greedy: 2n local training examples, only see gold states Global: one global example, might see new states Recall: Global Training with Early UpdaMng For each epoch For each sentence For i=1 2*len(sentence) # 2n transimons in arc-standard beam[i] = compute_successors(beam[i-1]) If beam[i] does not contain gold: # Feats are cumulamve up unml this point apply_gradient_update(feats(gold[0:i]) - feats(beam[i,0])) break # If we got to the end, gold may smll not be one-best If i == 2*len(sentence): apply_gradient_update(feats(gold) - feats(beam[2*len(sentence),0])) Administrivia Survey results: pace a bit too fast (assumes too much prior knowledge) Fast pace for a couple of lectures on graph-structured models, classical machine translamon More moderate pace on fundamentals of NNs / RNNs / neural MT Details for projects: I ll try to do this more FronMers / current research: a]er RNNs More materials: precision/recall of readings? Don t have expectamons for the final project It starts at 9:30am: sorry :(

2 Road Map Text Text Analysis Annota/ons POS tagging SyntacMc parsing NER Coreference resolumon First half of the class: more text analysis Second half of the class: more applicamons Applica/ons Summarize Extract informamon Answer quesmons IdenMfy senmment Translate Sequences: POS, NER Road Map Trees: consmtuency parsing, dependency parsing, semanmc role labeling Today: graph-structured models with two inference techniques: belief propagamon, Gibbs sampling Next Mme: classical (non-neural) machine translamon Then, part 2 of the class: neural networks, RNNs, CNNs, etc. P (y x) = 1 Z i=2 e Recall: CRFs t y n exp( t (y i 1,y i )) exp( e (y i,i,x)) i=1 Skip-chain CRFs The news agency Tanjug reported on the outcome of the meemng.? PER? The delegamon met the president at the airport, Tanjug said. Z is normalizing constant: how did we compute it? And marginals? Forward-backward: efficient dynamic program for summing things out Coreferent enmmes should be the same type One sense per discourse assumpmon: bank (river) and bank (financial) rarely occur in the same context

3 P (y x) / i=2 PER Skip-chain CRFs (j, k) are pairs of variables that we manually linked up PER exp( t (y i 1,y i )) exp( e (y i,i,x)) exp( `(y j,y k )) i=1 (j,k)? Inference P (y x) / exp( t (y i 1,y i )) exp( e (y i,i,x)) exp( `(y j,y k )) i=2 i=1 (j,k) How do we do forward-backward in this case? Assume just one sentence What if there are no links (j, k)? What if there s one link (j, k)? Iterate upward through i: keep track of state i-1, keep track of state j Dynamic program now tracks two states, so an extra factor of s For k links, state blows up by factor of s k Inference Inference Forward matrix (tensor): The news agency Tanjug reported on the outcome of the meemng. state j The delegamon met the president at the airport, Tanjug said. doc len state at curr Mmestep state j O(s 2 ) predecessors The news agency Tanjug reported on the outcome of the meemng. The delegamon met the president at the airport, Tanjug said. esterday, Tanjug also reported on Now would need to track two prior states generally becomes intractable

4 SoluMon 1: Belief propagamon Inference SoluMon 2: Gibbs sampling Belief PropagaMon Belief PropagaMon Forward-backward: instance of sum-product algorithm for inference in general tree-structured CRFs Sum-product doesn t work when there are loops, but it s usually a good approximamon, so we can just use it anyway f 1 v 1 f 2 v 2 Sum-Product Algorithm P (, x) = NotaMon: N(v) factors that are neighbors of variable v (use y for values) N(f) variables that are neighbors of factor f y f y values associated with a factor f Messages µ : vectors of values on edges between variable and factor (one message in each direcmon along edge). DistribuMons over y Posteriors are products of messages from factors: exp( ( )) exp( (, )) exp( (y1 0 )) exp( (y,y0 2 )) P y,y0 2 P (y i = y x) / f2n(v i ) µ f!vi (y)

5 V->f messages: F->v messages: Sum-Product Algorithm µ v!f (y) = f 0 2N(v),f 0 6=f Sum over all values of y for this factor with the ith coordinate set to y i µ f 0!v(y) Value of y is a product of what all other incoming messages say about y. I.e., propagate informamon from the rest of the graph, but don t feed the factor its own outputs µ f!vi (y i )= X y f, i exp( (y f, i,y i )) k:v k 2N(f),k6=i µ vk!f (y f,k ) Product over all other factors messages f 1 v 1 f 2 v 2 Sum-Product Algorithm exp( ( )) = [0.9, 0.1] exp( (, )) = 1 0 Factor requires = (these are zeroes in real space!) P (, )= P (, x) = exp( ( )) exp( (, )) exp( (y1 0 )) exp( (y,y0 2 )) P y,y0 2 Probability reflects both factors f 1 µ v!f (y) = v 1 f 2 v 2 Sum-Product Algorithm P (, x) = exp( ( )) = [0.9, 0.1] exp( (, )) = 1 0 IniMalize messages arbitrarily, then iterate over nodes (in some order): µ f!vi (y i )= X f 0 2N(v),f 0 6=f µ f 0!v(y) exp( ( )) exp( (, )) exp( (y1 0 )) exp( (y,y0 2 )) P y,y0 2 y f, i exp( (y f, i,y i )) k:v k 2N(f),k6=i µ vk!f (y f,k ) f 1 v 1 f 2 v 2 Final marginals: Sum-Product Algorithm P (, x) = P (y i = y x) / exp( ( )) exp( (, )) exp( (y1 0 )) exp( (y,y0 2 )) P y,y0 2 If graph is tree structured: once every node/factor has talked to every other node/factor, we have convergence For linear chains: need to run a forward pass and a backward pass f2n(v i ) µ f!vi (y)

6 ConnecMons to Forward-Backward µ v!f (y) = µ f 0!v(y) f 0 2N(v),f 0 6=f µ f!vi (y i )= X exp( (y f, i,y i )) y f, i Message from variable to next factor: product over current emission and previous factor message Message from factor to variable: incorporates transimon scores k:v k 2N(f),k6=i µ vk!f (y f,k ) Loopy Sum-Product y 3 exp( ( )) = [0.9, 0.1] exp( (, )) = 1 0 exp( (,y 3 )) = exp( (y 3, )) = What happens in this case? Posteriors blow up! Sum-product algorithm is not correct with loops We ve just broken the forward update into two pieces! Belief PropagaMon Algorithm y 3 exp( (, )) = exp( (,y 3 )) = exp( ( )) = [0.9, 0.1] 1 0 exp( (y 3, )) = Belief propaga/on: ignore this problem. Run sum-product for a while and use what it computes as an approxima3on to the true posterior Most of the Mme cyclic dependencies are not strong and it works out Some momvamon from stamsmcal physics, no guarantees on results /PER? EnMty Analysis Model for joint coreference, NER, and enmty linking to Wikipedia; here we ll just look at coref+ner Coreference ResoluMon is headquartered just outside AusMn. The company was founded LOC SemanMc Typing PER

7 EnMty Analysis EnMty Analysis Each menmon chooses an antecedent AusMn NEW PERSON,,... c 3 t 3 The company Coreference SemanMc typing f(x,c 3 ) c 1 c 3 Coreference t 1 c 2 t 2 AusMn t 3 The company f(x,t 3 ) SemanMc typing EnMty Analysis EnMty Analysis Coreference If coreferent, these menmons should have the same semanmc type c 1 c 2 SemanMc type PERSON,,... t 1 t 2 The company The company 5 menmons and a typical document has 200

8 What does BP inference look like? Belief PropagaMon c 2 Coreference Be a tree factor imposes the tree constraint on the variables t 1 t 2 New SemanMc type Sibling factors make it loopy PERSON The company PERSON Achieve the same thing as Koo s higher-order features Bansal et al. (2014) Skip-chain CRFs PER PER? P (y x) / exp( t (y i 1,y i )) exp( e (y i,i,x)) exp( `(y j,y k )) i=2 i=1 (j,k) Can we approximate P(y x) in other ways?

9 i=2 Inference P (y x) / exp( t (y i 1,y i )) exp( e (y i,i,x)) exp( `(y j,y k )) i=1 Can we sample from P(y x) and use those samples to approximate it? (Monte Carlo methods) For distribumons that are very peaked, samples look like the max anyway (j,k) Key idea: resample a single variable at a Mme condimoned on all others P (y i = y y i, x) / exp [ t(y i 1,y)+ t (y, y i+1 )+ e (y, i, x)+ 3 X X `(y, y k )+ `(y k,y) 5 k:(i,k) linked k:(k,i) linked P (y i = y y i, x) / exp [ t(y i 1,y)+ t (y, y i+1 )+ e (y, i, x)+ 3 X X `(y, y k )+ `(y k,y) 5 k:(i,k) linked Orange things are all constants now! k:(k,i) linked Fix all predicmons except one, easy to compute condimonal probabilimes (normalize scores for this parmcular variable y) Iterate over all variables repeatedly, like belief propagamon IniMalize y values to something reasonable for k=1 t iteramons: for i=1 m words in the document: y i = Sample from P (y i,...,i 1,i+1,...,m, x) Note: we need to iterate over the document several Mmes! The Gibbs sampling procedure forms a Markov chain whose equilibrium distribumon is the posterior However, you might need to run it for a very long Mme to get samples which don t depend on the inimalizamon.

Problems with P (x 1,x 2 ) x 2 x 1 0 0.49 0.01 1 0.01 0.49 Start with x = (0, 0) P (x 2 x 1 = 0) = [0.98, 0.

02] stay at (0, 0) 98% of the Mme stay at (0, 0) 98% of the Mme Takes ~50 steps before we switch to (1, 1) need to run Gibbs sampling for a long Mme to get a good approximamon of the posterior

10 Problems with P (x 1,x 2 ) x 2 x Start with x = (0, 0) P (x 2 x 1 = 0) = [0.98, 0.02] P (x 1 x 2 = 0) = [0.98, 0.02] stay at (0, 0) 98% of the Mme stay at (0, 0) 98% of the Mme Takes ~50 steps before we switch to (1, 1) need to run Gibbs sampling for a long Mme to get a good approximamon of the posterior Unsupervised POS inducmon with alignments across languages Takeaways Can define loopy factor graphs and smll do inference Belief propagamon and Gibbs sampling both work best if there are only weak cyclic dependencies. This is usually the case if the loopy factors incorporate features and the loops are large Can incorporate nice features this way, not as commonplace and a bit harder to get working, but everyone thinks this stuff is cool Other ways of doing this: output reranking, beam search (give up on doing principled inference), Naseem et al. (2009)

CS395T: Structured Models for NLP Lecture 19: Advanced NNs I

CS395T: Structured Models for NLP Lecture 19: Advanced NNs I Greg Durrett Administrivia Kyunghyun Cho (NYU) talk Friday 11am GDC 6.302 Project 3 due today! Final project out today! Proposal due in 1 week