What s so Hard about Natural Language Understanding?

Similar documents
ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

SGD and Deep Learning

Generative Adversarial Networks. Presented by Yi Zhang

Social Media & Text Analysis

Reinforcement Learning

Learning to translate with neural networks. Michael Auli

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

CS 570: Machine Learning Seminar. Fall 2016

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Discriminative Training. March 4, 2014

GANs. Machine Learning: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM GRAHAM NEUBIG

TTIC 31230, Fundamentals of Deep Learning, Winter David McAllester. The Fundamental Equations of Deep Learning

Attention Based Joint Model with Negative Sampling for New Slot Values Recognition. By: Mulan Hou

Introduction of Reinforcement Learning

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017

Natural Language Processing

ML4NLP Multiclass Classification

Fun with weighted FSTs

Name: Student number:

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department

Lecture 13: Structured Prediction

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Midterm sample questions

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

with Local Dependencies

Intelligent Systems (AI-2)

8: Hidden Markov Models

Deep Learning Architectures and Algorithms

Summary of A Few Recent Papers about Discrete Generative models

Topics in Natural Language Processing

Experiments on the Consciousness Prior

Laconic: Label Consistency for Image Categorization

CSC321 Lecture 16: ResNets and Attention

Presented By: Omer Shmueli and Sivan Niv

Image Processing 2. Hakan Bilen University of Edinburgh. Computer Graphics Fall 2017

Intelligent Systems (AI-2)

Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity- Representativeness Reward

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

How to do backpropagation in a brain

STA 414/2104: Lecture 8

Naïve Bayes, Maxent and Neural Models

Joint Emotion Analysis via Multi-task Gaussian Processes

Statistical Ranking Problem

Based on the original slides of Hung-yi Lee

Conditional Language Modeling. Chris Dyer

Machine Learning for natural language processing

A QUANTITATIVE MEASURE OF GENERATIVE ADVERSARIAL NETWORK DISTRIBUTIONS

CS 188: Artificial Intelligence Spring Today

Applied Natural Language Processing

Deep Generative Models for Graph Generation. Jian Tang HEC Montreal CIFAR AI Chair, Mila

CS 6375 Machine Learning

Classification, Linear Models, Naïve Bayes

Logistic Regression & Neural Networks

Generative Adversarial Networks

Mini-project 2 (really) due today! Turn in a printout of your work at the end of the class

attention mechanisms and generative models

Task-Oriented Dialogue System (Young, 2000)

The Success of Deep Generative Models

Reinforcement Learning and NLP

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Dialogue Systems. Statistical NLU component. Representation. A Probabilistic Dialogue System. Task: map a sentence + context to a database query

IBM Model 1 for Machine Translation

Deep Learning for NLP

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Markov Models and Reinforcement Learning. Stephen G. Ware CSCI 4525 / 5525

Introduction to Machine Learning

Discriminative Training

Information Extraction from Text

Deep Reinforcement Learning. Scratching the surface

Lecture 3: ASR: HMMs, Forward, Viterbi

Natural Language Understanding. Kyunghyun Cho, NYU & U. Montreal

Natural Language Processing with Deep Learning CS224N/Ling284

Variational Attention for Sequence-to-Sequence Models

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

arxiv: v1 [cs.cl] 21 May 2017

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

CSC321 Lecture 15: Recurrent Neural Networks

The Noisy Channel Model and Markov Models

CS 188: Artificial Intelligence. Outline

Domain adaptation for deep learning

Hidden Markov Models

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Probability Review and Naïve Bayes

Natural Language Processing (CSE 490U): Language Models

Log-Linear Models, MEMMs, and CRFs

Data Informatics. Seon Ho Kim, Ph.D.

2018 EE448, Big Data Mining, Lecture 4. (Part I) Weinan Zhang Shanghai Jiao Tong University

Sequential Supervised Learning

Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

CRF Word Alignment & Noisy Channel Translation

Natural Language Processing (CSEP 517): Text Classification

Language Processing with Perl and Prolog

Tasks ADAS. Self Driving. Non-machine Learning. Traditional MLP. Machine-Learning based method. Supervised CNN. Methods. Deep-Learning based

Transcription:

What s so Hard about Natural Language Understanding? Alan Ritter Computer Science and Engineering The Ohio State University Collaborators: Jiwei Li, Dan Jurafsky (Stanford) Bill Dolan, Michel Galley, Jianfeng Gao (MSR), Colin Cherry (Google) Jeniya Tabassum (Ohio State), Alexander Konovalov (Ohio State), Wei Xu (Ohio State) Brendan O Connor (Umass)

What s so Hard about Natural Language Understanding? Alan Ritter Computer Science and Engineering The Ohio State University Collaborators: Jiwei Li, Dan Jurafsky (Stanford) Bill Dolan, Michel Galley, Jianfeng Gao (MSR), Colin Cherry (Google) Jeniya Tabassum (Ohio State), Alexander Konovalov (Ohio State), Wei Xu (Ohio State) Brendan O Connor (Umass)

Q: Why are we so good at Speech, MT (but bad at NLU)? People naturally translate and transcribe.

Q: Why are we so good at Speech, MT (but bad at NLU)? People naturally translate and transcribe. Q: Large, End-to-End Datasets for NLU? Web-scale Conversations? Web-scale Structured Data?

Q: Why are we so good at Speech, MT (but bad at NLU)? People naturally translate and transcribe. Q: Large, End-to-End Datasets for NLU? Web-scale Conversations? Web-scale Structured Data?

Data-Driven Conversation Twitter: ~ 500 Million Public SMS-Style Conversations per Month Goal: Learn conversational agents directly from massive volumes of data. 6

Data-Driven Conversation Twitter: ~ 500 Million Public SMS-Style Conversations per Month Goal: Learn conversational agents directly from massive volumes of data. 6

[Ritter, Cherry, Dolan EMNLP 2011] Noisy Channel Model Input: Who wants to come over for dinner tomorrow? 7

[Ritter, Cherry, Dolan EMNLP 2011] Noisy Channel Model Input: Who wants to come over for dinner tomorrow? { Output: Yum! I 7

[Ritter, Cherry, Dolan EMNLP 2011] Noisy Channel Model Input: Who wants to come over for dinner tomorrow? { { Output: Yum! I want to 7

[Ritter, Cherry, Dolan EMNLP 2011] Noisy Channel Model Input: Who wants to come over for dinner tomorrow? { { { Output: Yum! I want to be there 7

[Ritter, Cherry, Dolan EMNLP 2011] Noisy Channel Model Input: Who wants to come over for dinner tomorrow? { { { { Output: Yum! I want to be there tomorrow! 7

Neural Conversation [Sordoni et. al. 2015] [Xu et. al. 2016] [Wen et. al. 2016] [Li et. al. 2016] [Kannan et. al. 2016] [Serban et. al. 2016] 8

Neural Conversation [Sordoni et. al. 2015] [Xu et. al. 2016] [Wen et. al. 2016] [Li et. al. 2016] [Kannan et. al. 2016] [Serban et. al. 2016] 8

How old are you? 9 Slide Credit: Jiwei Li

How old are you? i 'm 16. 10 Slide Credit: Jiwei Li

How old are you? i 'm 16. 16? 11 Slide Credit: Jiwei Li

How old are you? i 'm 16. 16? i don 't know what you 're talking about 12 Slide Credit: Jiwei Li

How old are you? i 'm 16. 16? i don 't know what you 're talking about you don 't know what you 're saying 13 Slide Credit: Jiwei Li

How old are you? i 'm 16. 16? i don 't know what you 're talking about you don 't know what you 're saying i don 't know what you 're talking about Slide Credit: Jiwei Li

How old are you? Bad Action i 'm 16. 16? i don 't know what you 're talking about you don 't know what you 're saying i don 't know what you 're talking about Slide Credit: Jiwei Li

How old are you? Bad Action i 'm 16. 16? Outcome i don 't know what you 're talking about you don 't know what you 're saying i don 't know what you 're talking about Slide Credit: Jiwei Li

Deep Reinforcement Learning [Li, Monroe, Ritter, Galley, Gao, Jurafsky EMNLP 2016] How old are you? State Encoding how old are you

Deep Reinforcement Learning [Li, Monroe, Ritter, Galley, Gao, Jurafsky EMNLP 2016] How old are you? Action i 'm 16. I m 16. EOS Encoding Decoding how old are you EOS I m 16.

Learning: Policy Gradient REINFORCE Algorithm (Williams,1992) What we want to learn How old are you? Action i 'm 16. I m 16. EOS Encoding Decoding how old are you EOS I m 16.

Q: Rewards?

Q: Rewards? A: Turing Test

Q: Rewards? A: Turing Test Adversarial Learning (Goodfellow et al., 2014)

Adversarial Learning for Neural Dialogue [Li, Monroe, Shi, Jean, Ritter, Jurafsky EMNLP 2016] Real-world conversations sample human response Discriminator Real or Fake? Response Generator generate response

Adversarial Learning for Neural Dialogue [Li, Monroe, Shi, Jean, Ritter, Jurafsky EMNLP 2016] (Alternate Between Training Generator and Discriminator) Real-world conversations sample human response Discriminator Real or Fake? Response Generator generate response

Adversarial Learning for Neural Dialogue [Li, Monroe, Shi, Jean, Ritter, Jurafsky EMNLP 2016] (Alternate Between Training Generator and Discriminator) Real-world conversations sample human response Discriminator Real or Fake? Response Generator generate response REINFORCE Algorithm (Williams,1992)

Adversarial Learning Improves Response Generation vs vanilla generation model Human Evaluator: Machine Evaluator: [Bowman et. al. 2016] Adversarial Win Adversarial Lose Tie 62% 18% 20% Adversarial Success (How often can you fool a machine) Adversarial Learning 8.0% Standard Seq2Seq model 4.9% Slide Credit: Jiwei Li

Q: Why are we so good at Speech, MT (but bad at NLU)? People naturally translate and transcribe. Q: Large, End-to-End Datasets for NLU? Web-scale Conversations? Web-scale Structured Data?

Q: Why are we so good at Speech, MT (but bad at NLU)? People naturally translate and transcribe. Q: Large, End-to-End Datasets for NLU? Web-scale Conversations? Generates fluent open domain replies Web-scale Structured Data?

Q: Why are we so good at Speech, MT (but bad at NLU)? People naturally translate and transcribe. Q: Large, End-to-End Datasets for NLU? Web-scale Conversations? Web-scale Structured Data? Generates fluent open domain replies Really Natural Language Understanding?

Q: Why are we so good at Speech, MT (but bad at NLU)? People naturally translate and transcribe. Q: Large, End-to-End Datasets for NLU? Web-scale Conversations? Web-scale Structured Data? Generates fluent open domain replies Really Natural Language Understanding?

Learning from Distant Supervision [Mintz et. al. 2009] 1) Named Entity Recognition Challenge: highly ambiguous labels [Ritter, et. al. EMNLP 2011] 2) Relation Extraction Challenge: missing data [Ritter, et. al. TACL 2013] 3) Time Normalization Challenge: diversity in noisy text [Tabassum, Ritter, Xu, EMNLP 2016] 4) Event Extraction Challenge: lack of negative examples [Ritter, et. al. WWW 2015] [Konovalov, et. al. WWW 2017] O( ) = NX log p (y i x i ) i {z } Log Likelihood U D( p ˆp unlabeled ) {z } Label regularization

Learning from Distant Supervision [Mintz et. al. 2009] 1) Named Entity Recognition Challenge: highly ambiguous labels [Ritter, et. al. EMNLP 2011] 2) Relation Extraction Challenge: missing data [Ritter, et. al. TACL 2013] 3) Time Normalization Challenge: diversity in noisy text [Tabassum, Ritter, Xu, EMNLP 2016] 4) Event Extraction Challenge: lack of negative examples [Ritter, et. al. WWW 2015] [Konovalov, et. al. WWW 2017] O( ) = NX log p (y i x i ) i {z } Log Likelihood U D( p ˆp unlabeled ) {z } Label regularization

Time Normalization [Tabassum, Ritter, Xu EMNLP 2016] State-ofthe-art time resolvers { } TempEX HeidelTime SUTime UWTime 1 Jan 2016

Time Normalization Distant Supervision (no human labels or rules!) [Tabassum, Ritter, Xu EMNLP 2016] State-ofthe-art time resolvers { } TempEX HeidelTime SUTime UWTime 1 Jan 2016

Distant Supervision Assumption Mercury Transit May 9,2016

Distant Supervision Assumption Mercury Transit May 9,2016

Distant Supervision Assumption Mercury Transit May 9,2016 8 May 9 May 10 May

Distant Supervision Assumption Mercury Transit May 9,2016 8 May 9 May 10 May

Distant Supervision Assumption Mercury Transit May 9,2016 8 May 9 May 10 May

Distant Supervision Assumption Mercury Transit May 9,2016 8 May 9 May 10 May

Distant Supervision Assumption Mercury Transit May 9,2016 8 May 9 May 10 May

Distant Supervision Assumption Mercury Transit May 9,2016 8 May 9 May 10 May

Distant Supervision Assumption Mercury Transit May 9,2016 8 May 9 May 10 May

Distant Supervision Assumption Mercury Transit May 9,2016 8 May 9 May 10 May

Multiple Instance Learning Tagger [ Mercury, 5/9/2016 ] w 1 w 2 w 3 w n Words t 1 t 2 t 3 t 4 Mon Sun 1 31 1 12 Past Present Future Sentence Level Tags [Event Database]

Multiple Instance Learning Tagger [ Mercury, 5/9/2016 ] Local Classifier exp( f(w i,z i )) w 1 w 2 w 3 w n Words z 1 z 2 z 3 z n Word Level Tags t 1 t 2 t 3 t 4 Mon Sun 1 31 1 12 Past Present Future Sentence Level Tags [Event Database]

Multiple Instance Learning Tagger [ Mercury, 5/9/2016 ] Local Classifier exp( f(w i,z i )) w 1 w 2 w 3 w n Words z 1 z 2 z 3 z n Word Level Tags Deterministic OR t 1 t 2 t 3 t 4 Mon Sun 1 31 1 12 Past Present Future Sentence Level Tags [Hoffmann et. al. 2011] [Event Database]

Multiple Instance Learning Tagger [ Mercury, 5/9/2016 ] Local Classifier exp( f(w i,z i )) w 1 w 2 w 3 w n Words z 1 z 2 z 3 z n Word Level Tags Deterministic OR Maximize Conditional X Likelihood: z P (z,t w, ) [Hoffmann et. al. 2011] t 1 t 2 t 3 t 4 Mon Sun 1 31 1 12 [Event Database] Past Present Future Sentence Level Tags

Missing Data Problem Sentence Level Tags: TL = Future MOY= May DOM=9 DOW= Mon

Missing Data Extension w 1 w 2 w 3 w n z 1 z 2 z 3 z n Aggregated Sentence Level Tags t 1 t 2 t 3 t 4 [Event Database]

Missing Data Extension Missing Data Problem In Distant Supervision [Ritter, et. al. TACL 2013] w 1 w 2 w 3 w n z 1 z 2 z 3 z n t 0 1 t 0 2 t 0 3 t 0 4 m 1 m 2 m 3 m 4 [Event Database]

Missing Data Extension Missing Data Problem In Distant Supervision [Ritter, et. al. TACL 2013] w 1 w 2 w 3 w n z 1 z 2 z 3 z n Mentioned in Text t 0 1 t 0 2 t 0 3 t 0 4 m 1 m 2 m 3 m 4 [Event Database]

Missing Data Extension Missing Data Problem In Distant Supervision [Ritter, et. al. TACL 2013] w 1 w 2 w 3 w n z 1 z 2 z 3 z n Mentioned in Text t 0 1 t 0 2 t 0 3 t 0 4 Implied by Event Date m 1 m 2 m 3 m 4 [Event Database]

Missing Data Extension Missing Data Problem In Distant Supervision [Ritter, et. al. TACL 2013] w 1 w 2 w 3 w n z 1 z 2 z 3 z n Mentioned in Text t 0 1 t 0 2 t 0 3 t 0 4 Encourage Agreement Implied by Event Date m 1 m 2 m 3 m 4 [Event Database]

Example Tags Word Im Hella excited for tomorrow Tag NA NA Future NA Future Word Thnks for a Christmas party on fri Tag NA NA NA December NA NA Friday

Evaluation

Evaluation 17% increase in F- score over SUTime

Where can we find NLU? Follow the data!

Where can we find NLU? Follow the data!

Where can we find NLU? Follow the data! Opportunistically Gathered Data: Twitter Events (Time Normalization) Billions of Internet Conversations

Where can we find NLU? Follow the data! Opportunistically Gathered Data: Twitter Events (Time Normalization) Billions of Internet Conversations Design Models for the Data (rather than the other way around)

Where can we find NLU? Follow the data! Opportunistically Gathered Data: Twitter Events (Time Normalization) Billions of Internet Conversations Design Models for the Data (rather than the other way around) Thank You!