Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Size: px

Start display at page:

Download "Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu"

Tobias Cunningham
5 years ago
Views:

1 Natural Language Processing Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

2 סקר הערכת הוראה Fill it!

3 Presentations next week 21 projects 20/6: 12 presentations 27/6: 9 presentations 10 minutes per presentation + 2 minutes for questions and transition 4 projects per hour Some wiggle room if there are any problems Put presentations in folder

4 Presentations next week Describe precisely the task and its importance Describe prior work What you are planning to do and why Evaluation What are potential problems and how you will address them

5 Last class Sequence to sequence models LSTMs and GRUs Attention Pointer networks

6 Today Weak supervision Tree structured neural networks A nice paper on combining search and learning Stuff we didn t cover Concluding remarks

7 Weak supervision

8 Weak supervision We have assumed that we have as input pairs of natural language and logical form In practice those are hard to collect and we usually have (language, denotation) pairs

9 The problem Before we trained with cross entropy over tokens, but we don t have tokens here softmax Type (0.7) Profession (0.2) argmax Type t t+1

10 This looks familiar Search with CKY Can we do something similar with a seq2seq model?

11 Markov Decision Process Sequence of states, actions and rewards s 0, s 1, s 2,, s T from a set S a 0, a 1, a 2,, a T from a set A Let s assume a deterministic transition function f:sxa->s r 0, r 1, r 2,, r T given by a reward function r(s,a) We want a policy π(a s) providing a distribution over actions that will maximize expected future reward a0 a1 a2 s0 s1 s2 at-1 st

12 Liang et al, 2017, Guu et al., 2017 Seq2seq as MDP s t : h t a t is in A(s t ) Either all symbols in the target vocabulary All valid symbols if we check grammaticality r t is zero in all steps except the last. Then, it is 1 if execution results in a correct answer and 0 otherwise. tall Lebron James? <s> HeightOf

13 Seq2seq as MDP: policy p(z x) = Y t p(z t x, z 0,...,z t 1 ) = Y t p(a t x, a 0,...,a t 1 ) = Y t (a t s t ) (a t s t ) = softmax(w (s) h t ) tall Lebron James? <s> HeightOf How do we learn?

14 Option 1: Maximum marginal likelihood Our data is language-dentation pairs (x,y) We obtain y by constructing a logical form z We can use maximum marginal likelihood like before Interleave search and learning Apply search to get candidate logical forms (with fixed model) Update parameters based on candidates Difference from before: Search was done with CKY and learning was a globally-normalized model Search can be done with beam search and we have a locally-normalized model

15 Maximum marginal likelihood z is independent of x conditioned on y p (y x) = X z p (z x) p(y z) = X z p (z x)r(z) =E p (z x)[r(z)] L MML ( ) = log Y (x,y) p (y x) =log Y (x,y) E p (z x)[r(z)] = X (x,y) log X z p (z x) R(z)

16 Gradient of MML Gradient has similar form to what we have seen in the past, except that we are not in a log-linear model. Let s assume a binary reward: r log X z p (z x) R(z) = X z p (z)r(z)r log p (z x) P 0 z p (z 0 x) R(z 0 ) = X z p(z x, R(z) = 1)r log p (z x) Compute the gradient of the log probability for every logical form, and weight the gradient using the reward.

17 Computing the gradient We can not enumerate all of the logical forms Instead we perform beam search as usual and get a beam Z containing K logical forms. We imagine that this beam is the entire set of possible logical forms X p(z x, R(z) = 1)r log p (z x) z2z For every z we can compute the gradient of log p(z x) since this is now the usual seq2seq setup.

18 Option 2: policy gradient We would like to simply maximize our expected reward E p (z x)[r(z)] = X p (z x)r(z) z L RL ( ) = X X p (z x)r(z) = X E p (z x)[r(z)] (x,y) z (x,y) rl RL ( ) = X X p (z x)r(z)r log p (z x) (x,y) z = X E p (z x)[r(z)r log p (z x)] (x,y) Weight the gradient by the product of the reward and the model probability

19 Computing the gradient Again, we can not sum over all logical forms But the gradient for every example is an expectation over a distribution we can sample from! So we can sample many logical forms, compute the gradient and sum them weighted by the product of the model probability and reward Again, for every sample this is regular seq2seq and we can compute an approximate gradient

20 Some differences Using MML with beam search is a biased estimator and has less exploration - we only observe the approximate top-k logical forms Using RL could be harder to train. If we have a correct logical form z* that has low probability at the beginning of training, then the contribution to the gradient would be very small and it would be hard to boostrap.

21 Intermediate summary Training with a seq2seq model with weak supervision is problematic because the loss function is not a differentiable function of the input We saw both MML and RL approaches for getting around that In both we find a set of logical forms, compute the gradient for them like in supervised learning, and weight them in some way to form a final gradient This let s us train with SGD Often this is still hard to train - more ahead

22 Bengio et al, 2013, Jang et al, 2016 Other approaches Other approaches have been suggested to overcome the non-differentiability problem Using softmax directly Temperature Straight-through estimator

23 Neelakantan et al, 2016 Softmax Replace argmax with softmax at training time Pass as input the average of embeddings Use argmax at test time Could approximate argmax (skewed distributions) Train-test mismatch (you don t train on logical forms) softmax WeightOf (0.7) HeightOf (0.2) Embed and avg.

24 Adding temperature p(a) = s(a) exp( ) t Pa 0 exp( s(a0 ) t ) Start with high temperature: when t = 1 this is softmax when t is high - uniform Anneal temperature towards 0: get close to argmax

25 Bengio, 2013 Straight-through estimator Use argmax at forward pass Pretend that you had softmax in the backward pass

26 Global loss functions Gold I saw the big tree there Pred 1 I saw the the tree there Pred 2 I saw the tree over there The first is better than the second from a maximum likelihood point of view

27 Exposure bias Training time: Model observes correct tokens only I saw the big At test time we observer predicted tokens I saw the the After an error our hidden state might be different from anything we have seen and errors can accumulate - train-test mismatch

28 Solution - RL? We can just define a reward on the entire token sequence! Then we have the same setup as semantic parsing But training with REINFORCE is much harder than training with ML Imitation learning

29 Imitation learning

30 Imitation learning In sequence to sequence the expert is simply the correct sequence. So this is exactly maximum likelihood Imitation learning algorithms provide a way to avoid exposure bias

31 Dagger For an example (x, y) and an expert π* Define a policy π=β π* + (1-β)πt Sample from π, and use (x,y) to define loss Train Reduce β exponentially

32 Dagger This gets rid of exposure bias Problems: in some cases after you sample something wrong, it is hard to define an expert/ oracle at all! Still haven t dealt with the problem of bad reward

33 Choi et al., 2017 Global reward Similar annealing approach can be taken for changing the reward Begin with ML objective slowly transition to your true reward as your model becomes better. Mix somehow your two loss functions and change their weights with time At the end of training you are sampling just from your model and using your true reward - simple RL!

34 Liang et al., 2017 Semantic parsing In semantic parsing there is no expert A way to get around that is to interleave maximum likelihood training with RL training ML training: Run search: find some approximate gold logical form z* Train with an ML objective on z* RL training: Sample from your model and update according to REINFORCE

35 Summary Using sequence to sequence models with delayed reward does not have a consensus solution yet Solutions range from Continuous relaxations REINFORCE like algorithm Mixing ML with RL in various ways Many flavors of curriculum learning/annealing

36 Tree RNNs

37 Back to Trees We learned about constituency parsing with linear models We saw we can replace linear scoring functions with deep neural networks and maintain exact decoding But what if we forego exact decoding and instead use a powerful non-linear model? Can we use LSTM like architecture to build trees?

38 Meaning composition tall Lebron James? tall We compose meanings left-to-right (over prefixes) Lebron

39 But language is hierarchical A snowboarder is leaping over a mogul A person on a snowboard jumps into the air Our understanding of language depends on building larger meaning of units. Having a prior on the structure of language should be beneficial Should we compute meaning up a tree structure?

41 Reality Unfortunately it is not always easy to show that using a tree structure over language is better than a sequence because LSTMs are strong learning machines and have good memory Where do you get the trees from? Often trees are wrong and expensive to compute

42 Distributed representations for sentences

43 Embedding a sentence Use compositionality to build the representations This is more than just syntactic parsing! We hope to achieve a semantic representation as well

44 If we have a parse tree

45 Recursive neural networks

46 Socher et al., 2012 Recursive neural networks NN p = tanh(w 1 c 1 + W 2 c 2 + b) s(p) =w > p

47 Combine with greedy parsing

48 Combining with greedy parsing

49 Combining with greedy parsing

50 Global score The score of the tree is the sum of decisions along the way They used a max-margin loss: for a training pair (x, y) maximize s(x, y) max(s(x, y 0 )+ (y, y 0 )) y 0 Solving this max is exponential! Approximate with beam search

51 Syntactic parsing This did not result in state-of-the-art performance Problem: A single matrix that composes all pairs of words regardless of their syntax

52 Socher et al., 2013 Compositional vector grammars Use PCFGs to build the structure of the tree and then compose the meaning with vectors We have a different matrix for every pair of syntactic categories that are on the RHS of a grammar rule The score for category P which has vector p is obtained by multiplying a with a vector and adding the log probability of the rule in the grammar s(p, P ) = log P (P! BC)+w > f(w bc [b; c])

53 Compositional vector grammars

54 Socher et al., 2013 Problems Many parameters Hard to compute everything from scratch Instead they did re-ranking on a simple PCFG model

55 Evaluation

56 Sentence representations Anecdotal

57 Tree LSTMs for CCG parsing Nice slides by Kenton Lee!

58 Stuff we didn t cover

Coreference resolution Coreference resolution: the task of clustering together of expressions that refer to the same concept/entity Michelle LaVaughn Robinson Obama is

59 Coreference resolution Coreference resolution: the task of clustering together of expressions that refer to the same concept/entity Michelle LaVaughn Robinson Obama is an American lawyer and writer. She is the wife of the 44th president of the United States, Barack Obama, and the first African-American first lady of the United States

60 Importance Crucial for general natural language understanding Who is Barack Obama s spouse? Information extraction Anywhere you need to understand more than one sentence there are coreference issues

61 Main sub-tasks Entity extraction Coreference resolution Entity linking

62 Entity extraction Find all mentions of entities in a text Michelle LaVaughn Robinson Obama She The wife of

63 Coreference resolution Clustering of the entities extracted Michelle Obama she the wife of Barack Obama The president

64 Michelle Obama she the wife of Entity linking

65 General approaches Score every pair of entity mentions Find the best clustering In practice this is hard so usually a more greedy approach is used

66 Question answering A central task in NLP One can argue that any task can be reduced to QA with natural language A lot of interest in last couple of years Dominated by many large scale corpora and deep learning

67 SQuAD Rajpurkar et al, 2016

68 Daily Mail Hermann et al, 2015

69 Hewlett et al, 2016 WikiReading Nguyen,

70 Nguyen et al, 2016 Ms-MARCO Q: what is rba A: Results-Based Accountability is a disciplined way of thinking and taking action that communities can use to improve the lives of children, youth, families, adults and the community as a whole p1: Since 2007, the RBA's outstanding reputation has been affected by p2: The Reserve Bank of Australia (RBA) came into being on 14 January p3: Results-Based Accountability (also known as RBA) is a disciplined

71 TriviaQA Joshi et al, 2017

72 MCTest Richardson et al, 2013

73 ProcessBank Berant et al, 2014

74 Bowman et al., 2015 Natural Language Inference

75 Dependency parsing In constituency parsing: We saw both greedy an graph-based method Similar distinctions also occur in dependency parsing, but the algorithms are different

76 Character-based models Ling et al., 2015

77 More Multi-task learning ILP inference and LP relaxation Summarization Interactive learning

78 Broad takeaways

79 What did we cover? Word embeddings Language models Sequence tagging Syntactic parsing Semantic parsing

80 Main technical tools Structured prediction Deep learning General recipe: Define a parameterized mapping from input to output Define a loss function Optimize Find best output at test time

81 High-level observations Often linear models can be replaced with nonlinear ones without change to the guarantees A growing trend is to move the burden from inference to learning. Let information flow between various variables in a neural network and make inference very simple It is still an active research area

82 Hopefully you Appreciate the complexity of building systems for natural language Understand the main tools used to build state-ofthe-art systems nowadays Have solid background to read papers Have solid background to develop models for NLP

83 Apology This is the first time this class is taught under hard time constraints. Sorry for The typos in the slides The clarify of HW We will learn and improve next time Thanks for helping us debug the class!

84 Thank you!

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu Natural Language Processing Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu Projects Project descriptions due today! Last class Sequence to sequence models Attention Pointer networks Today Weak