Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Similar documents
Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Lecture 8: Policy Gradient

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Natural Language Processing

Reinforcement Learning and NLP

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Q-Learning in Continuous State Action Spaces

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

CS 570: Machine Learning Seminar. Fall 2016

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

CSC321 Lecture 22: Q-Learning

6 Reinforcement Learning

with Local Dependencies

Reinforcement Learning. George Konidaris

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Learning Control for Air Hockey Striking using Deep Reinforcement Learning

Lecture 23: Reinforcement Learning

Reinforcement Learning for NLP

CS230: Lecture 9 Deep Reinforcement Learning

(Deep) Reinforcement Learning

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Q-learning. Tambet Matiisen

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Introduction to Reinforcement Learning

Reinforcement Learning

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Reinforcement Learning and Control

Lecture 5 Neural models for NLP

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Reinforcement Learning Active Learning

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Final Exam, Machine Learning, Spring 2009

Machine Learning I Continuous Reinforcement Learning

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Reinforcement Learning

Reinforcement Learning

Machine Learning for Structured Prediction

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Intelligent Systems (AI-2)

CS 4100 // artificial intelligence. Recap/midterm review!

CS599 Lecture 1 Introduction To RL

Grundlagen der Künstlichen Intelligenz

Approximation Methods in Reinforcement Learning

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Learning Neural Networks

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Generative Clustering, Topic Modeling, & Bayesian Inference

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Deep Learning Recurrent Networks 2/28/2018

Be able to define the following terms and answer basic questions about them:

An Adaptive Clustering Method for Model-free Reinforcement Learning

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Markov Models and Reinforcement Learning. Stephen G. Ware CSCI 4525 / 5525

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Deep Reinforcement Learning

Lecture 1: March 7, 2018

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Basics of reinforcement learning

Markov Decision Processes

Approximate Q-Learning. Dan Weld / University of Washington

Reinforcement Learning II

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

From perceptrons to word embeddings. Simon Šuster University of Groningen

STA 4273H: Statistical Machine Learning

text classification 3: neural networks

CSE 473: Artificial Intelligence Autumn Topics

Algorithms for Syntax-Aware Statistical Machine Translation

Reinforcement Learning. Yishay Mansour Tel-Aviv University

REINFORCEMENT LEARNING

Log-Linear Models, MEMMs, and CRFs

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Artificial Intelligence

Feature Noising. Sida Wang, joint work with Part 1: Stefan Wager, Percy Liang Part 2: Mengqiu Wang, Chris Manning, Percy Liang, Stefan Wager

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Reinforcement Learning. Machine Learning, Fall 2010

Lab 12: Structured Prediction

Reinforcement Learning

CSC321 Lecture 7 Neural language models

David Silver, Google DeepMind

Discrete Latent Variable Models

Reinforcement Learning

Markov Decision Processes (and a small amount of reinforcement learning)

Do not tear exam apart!

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

Deep Learning for NLP

Policy Gradient Methods. February 13, 2017

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Transcription:

Natural Language Processing Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Projects Project descriptions due today!

Last class Sequence to sequence models Attention Pointer networks

Today Weak supervision Hashlamot semantic parsing

Weak supervision

Weak supervision We have assumed that we have as input pairs of natural language and logical form In practice those are hard to collect and we usually have (language, denotation) pairs

The problem Before we trained with cross entropy over tokens, but we don t have tokens here softmax Type (0.7) Profession (0.2) argmax Type t t+1

This looks familiar Search with CKY Can we do something similar with a seq2seq model?

Markov Decision Process Sequence of states, actions and rewards s0, s1, s2,, st from a set S a0, a1, a2,, at from a set A Let s assume a deterministic transition function f:sxa->s r0, r1, r2,, rt given by a reward function r(s,a) We want a policy π(a s) providing a distribution over actions that will maximize expected future reward a0 a1 a2 s0 s1 s2 at-1 st

Liang et al, 2017, Guu et al., 2017 Seq2seq as MDP st: ht at is in A(st) Either all symbols in the target vocabulary All valid symbols if we check grammaticality rt is zero in all steps except the last. Then, it is 1 if execution results in a correct answer and 0 otherwise. tall Lebron James? <s> HeightOf

Seq2seq as MDP: policy p(z x) = Y t p(z t x, z 0,...,z t 1 ) = Y t p(a t x, a 0,...,a t 1 ) = Y t (a t s t ) (a t s t ) = softmax(w (s) h t ) tall Lebron James? <s> HeightOf

Seq2seq as MDP: policy p(z x) = Y t p(z t x, z 0,...,z t 1 ) = Y t p(a t x, a 0,...,a t 1 ) = Y t (a t s t ) (a t s t ) = softmax(w (s) h t ) tall Lebron James? <s> HeightOf How do we learn?

Option 1: Maximum marginal likelihood Our data is language-dentation pairs (x,y) We obtain y by constructing a logical form z We can use maximum marginal likelihood like before Interleave search and learning Apply search to get candidate logical forms (with fixed model) Update parameters based on candidates Difference from before: Search was done with CKY and learning was a globally-normalized model Search can be done with beam search and we have a locally-normalized model

Explanation No exact search > Need a search algorithm In linear models: use the globally normalized scoring function to score partial trees Should this work?

Explanation For sequence to sequence models: logp(z)r(z) t=1 t=2 t=3 t=4 t=5 t=6

Explanation For sequence to sequence models: logp(z)r(z) t=1 t=2 t=3 t=4 t=5 t=6

Maximum marginal likelihood y is independent of x conditioned on z p (y x) = X z p (z x) p(y z) = X z p (z x)r(z) =E p (z x)[r(z)] L MML ( ) = log Y (x,y) p (y x) =log Y (x,y) E p (z x)[r(z)] = X (x,y) log X z p (z x) R(z)

Gradient of MML Gradient has similar form to what we have seen in the past, except that we are not in a log-linear model. Let s assume a binary reward: r log X z p (z x) R(z) = X z p (z)r(z)r log p (z x) P 0 z p (z 0 x) R(z 0 ) = X z p(z x, R(z) = 1)r log p (z x) Compute the gradient of the log probability for every logical form, and weight the gradient using the reward.

Computing the gradient We can not enumerate all of the logical forms Instead we perform beam search as usual and get a beam Z containing K logical forms. We imagine that this beam is the entire set of possible logical forms X p(z x, R(z) = 1)r log p (z x) z2z For every z we can compute the gradient of log p(z x) since this is now the usual seq2seq setup.

Option 2: policy gradient We would like to simply maximize our expected reward E p (z x)[r(z)] = X p (z x)r(z) z L RL ( ) = X X p (z x)r(z) = X E p (z x)[r(z)] (x,y) z (x,y) rl RL ( ) = X X p (z x)r(z)r log p (z x) (x,y) z = X E p (z x)[r(z)r log p (z x)] (x,y) Weight the gradient by the product of the reward and the model probability

Computing the gradient Again, we can not sum over all logical forms But the gradient for every example is an expectation over a distribution we can sample from! So we can sample many logical forms, compute the gradient and sum them weighted by the product of the model probability and reward Again, for every sample this is regular seq2seq and we can compute an approximate gradient

Some differences Using MML with beam search is a biased estimator and has less exploration - we only observe the approximate top-k logical forms Using RL could be harder to train. If we have a correct logical form z* that has low probability at the beginning of training, then the contribution to the gradient would be very small and it would be hard to boostrap.

Intermediate summary Training with a seq2seq model with weak supervision is problematic because the loss function is not a differentiable function of the input We saw both MML and RL approaches for getting around that In both we find a set of logical forms, compute the gradient for them like in supervised learning, and weight them in some way to form a final gradient This let s us train with SGD Often this is still hard to train - more ahead

Variant 1: ε-greedy beam search Do MML, but instead of beam search, at each time step T Compute all possible continuations for i=1 K with probability ε randomly sample a continuation without replacement O/W take top continuation Motivation: improve exploration in beam search

Variant 2: Replay Buffer/ Memory Concept from DQN, a value-based RL method Agents performs actions and observe rewards Those are stored in a huge memory buffer Training is done by sampling from the buffer Motivation: make RL look like supervised learning by decor relating samples

Variant 2: Replay Buffer/ Memory We can do something similar with different motivation: For every example keep a buffer with all programs with positive reward Replace beam search by sampling a negative example + marginalizing or sampling a negative example It can be shown this reduces the gradient variance Requires smaller beams > faster training

Variant 3: MML/RL hybrids There is a relation between MML and RL gradients grad_rl = p+(z) grad_mml p+(z): sum of probabilities for all positive reward programs Problem: at the beginning of training p+(z) is really small Replace p+(z) with α(=0.1) if p+(z)<α Effect: more exploitation at the beginning of training without too much bias.

Bengio et al, 2013, Jang et al, 2016 Variant 4: continuous relaxations Using softmax directly Temperature Straight-through estimator

Neelakantan et al, 2016 Softmax Replace argmax with softmax at training time Pass as input the average of embeddings Use argmax at test time Could approximate argmax (skewed distributions) Train-test mismatch (you don t train on logical forms) softmax WeightOf (0.7) HeightOf (0.2) Embed and avg.

Adding temperature p(a) = s(a) exp( ) t Pa 0 exp( s(a0 ) t ) Start with high temperature: when t = 1 this is softmax when t is high - uniform Anneal temperature towards 0: get close to argmax

Bengio, 2013 Straight-through estimator Use argmax at forward pass Pretend that you had softmax in the backward pass

Global loss functions Gold I saw the big tree there Pred 1 I saw the the tree there Pred 2 I saw the tree over there The first is better than the second from a maximum likelihood point of view

Exposure bias Training time: Model observes correct tokens only I saw the big At test time we observer predicted tokens I saw the the After an error our hidden state might be different from anything we have seen and errors can accumulate - train-test mismatch

Solution - RL? We can just define a reward on the entire token sequence! Then we have the same setup as semantic parsing But training with REINFORCE is much harder than training with ML Imitation learning

Imitation learning

Imitation learning In sequence to sequence the expert is simply the correct sequence. So this is exactly maximum likelihood Imitation learning algorithms provide a way to avoid exposure bias

Dagger For an example (x, y) and an expert π* Define a policy π=β π* + (1-β)πt Sample from π, and use (x,y) to define loss Train Reduce β exponentially

Dagger This gets rid of exposure bias Problems: in some cases after you sample something wrong, it is hard to define an expert/ oracle at all!

Summary Using sequence to sequence models with delayed reward does not have a consensus solution yet Solutions range from Continuous relaxations REINFORCE like algorithm Mixing MML with RL in various ways Many flavors of curriculum learning/annealing

Adding structure

Structure We have treated semantic parsing as a sequence to sequence problem ignoring the syntax of the output language We can reduce the search space by taking that into account

CFG formal language If the formal language is a CFG (highly likely) we can use existing algorithms One can efficiently compute for any prefix if it can be generated by a CFG with an Earley parser (top-down, left-to-right parser) Try all possible continuations and allow only valid ones at both training (pre-process) and test (online) time

Design the language so it s trivial

Design the language so it s trivial This is a top-down parser

Arbitrary shift-reduce parsers For any structure for a logical form - if we can define a shift-reduce parser that outputs that structure we can map the input language to a sequence of discrete actions and use sequence to sequence

Graph-based parsing Unaware of graph-based neural approaches for semantic parsing There is a bit of work on that on syntactic parsing (Lee et al., EMNLP 2016 best paper)