What s so Hard about Natural Language Understanding?

What s so Hard about Natural Language Understanding? Alan Ritter Computer Science and Engineering The Ohio State University Collaborators: Jiwei Li, Dan Jurafsky (Stanford) Bill Dolan, Michel Galley, Jianfeng Gao (MSR), Colin Cherry (Google) Jeniya Tabassum (Ohio State), Alexander Konovalov (Ohio State), Wei Xu (Ohio State) Brendan O Connor (Umass)

Q: Why are we so good at Speech, MT (but bad at NLU)? People naturally translate and transcribe.

Q: Why are we so good at Speech, MT (but bad at NLU)? People naturally translate and transcribe. Q: Large, End-to-End Datasets for NLU? Web-scale Conversations? Web-scale Structured Data?

Data-Driven Conversation Twitter: ~ 500 Million Public SMS-Style Conversations per Month Goal: Learn conversational agents directly from massive volumes of data. 6

[Ritter, Cherry, Dolan EMNLP 2011] Noisy Channel Model Input: Who wants to come over for dinner tomorrow? 7

[Ritter, Cherry, Dolan EMNLP 2011] Noisy Channel Model Input: Who wants to come over for dinner tomorrow? { Output: Yum! I 7

[Ritter, Cherry, Dolan EMNLP 2011] Noisy Channel Model Input: Who wants to come over for dinner tomorrow? { { Output: Yum! I want to 7

[Ritter, Cherry, Dolan EMNLP 2011] Noisy Channel Model Input: Who wants to come over for dinner tomorrow? { { { Output: Yum! I want to be there 7

[Ritter, Cherry, Dolan EMNLP 2011] Noisy Channel Model Input: Who wants to come over for dinner tomorrow? { { { { Output: Yum! I want to be there tomorrow! 7

Neural Conversation [Sordoni et. al. 2015] [Xu et. al. 2016] [Wen et. al. 2016] [Li et. al. 2016] [Kannan et. al. 2016] [Serban et. al. 2016] 8

How old are you? 9 Slide Credit: Jiwei Li

How old are you? i 'm 16. 10 Slide Credit: Jiwei Li

How old are you? i 'm 16. 16? 11 Slide Credit: Jiwei Li

How old are you? i 'm 16. 16? i don 't know what you 're talking about 12 Slide Credit: Jiwei Li

How old are you? i 'm 16. 16? i don 't know what you 're talking about you don 't know what you 're saying 13 Slide Credit: Jiwei Li

How old are you? i 'm 16. 16? i don 't know what you 're talking about you don 't know what you 're saying i don 't know what you 're talking about Slide Credit: Jiwei Li

How old are you? Bad Action i 'm 16. 16? i don 't know what you 're talking about you don 't know what you 're saying i don 't know what you 're talking about Slide Credit: Jiwei Li

How old are you? Bad Action i 'm 16. 16? Outcome i don 't know what you 're talking about you don 't know what you 're saying i don 't know what you 're talking about Slide Credit: Jiwei Li

Deep Reinforcement Learning [Li, Monroe, Ritter, Galley, Gao, Jurafsky EMNLP 2016] How old are you? State Encoding how old are you

Deep Reinforcement Learning [Li, Monroe, Ritter, Galley, Gao, Jurafsky EMNLP 2016] How old are you? Action i 'm 16. I m 16. EOS Encoding Decoding how old are you EOS I m 16.

Learning: Policy Gradient REINFORCE Algorithm (Williams,1992) What we want to learn How old are you? Action i 'm 16. I m 16. EOS Encoding Decoding how old are you EOS I m 16.

Q: Rewards?

Q: Rewards? A: Turing Test

Q: Rewards? A: Turing Test Adversarial Learning (Goodfellow et al., 2014)

Adversarial Learning for Neural Dialogue [Li, Monroe, Shi, Jean, Ritter, Jurafsky EMNLP 2016] Real-world conversations sample human response Discriminator Real or Fake? Response Generator generate response

Adversarial Learning for Neural Dialogue [Li, Monroe, Shi, Jean, Ritter, Jurafsky EMNLP 2016] (Alternate Between Training Generator and Discriminator) Real-world conversations sample human response Discriminator Real or Fake? Response Generator generate response

Adversarial Learning Improves Response Generation vs vanilla generation model Human Evaluator: Machine Evaluator: [Bowman et. al. 2016] Adversarial Win Adversarial Lose Tie 62% 18% 20% Adversarial Success (How often can you fool a machine) Adversarial Learning 8.0% Standard Seq2Seq model 4.9% Slide Credit: Jiwei Li

Q: Why are we so good at Speech, MT (but bad at NLU)? People naturally translate and transcribe. Q: Large, End-to-End Datasets for NLU? Web-scale Conversations? Web-scale Structured Data?

Q: Why are we so good at Speech, MT (but bad at NLU)? People naturally translate and transcribe. Q: Large, End-to-End Datasets for NLU? Web-scale Conversations? Generates fluent open domain replies Web-scale Structured Data?

Q: Why are we so good at Speech, MT (but bad at NLU)? People naturally translate and transcribe. Q: Large, End-to-End Datasets for NLU? Web-scale Conversations? Web-scale Structured Data? Generates fluent open domain replies Really Natural Language Understanding?

Learning from Distant Supervision [Mintz et. al. 2009] 1) Named Entity Recognition Challenge: highly ambiguous labels [Ritter, et. al. EMNLP 2011] 2) Relation Extraction Challenge: missing data [Ritter, et. al. TACL 2013] 3) Time Normalization Challenge: diversity in noisy text [Tabassum, Ritter, Xu, EMNLP 2016] 4) Event Extraction Challenge: lack of negative examples [Ritter, et. al. WWW 2015] [Konovalov, et. al. WWW 2017] O( ) = NX log p (y i x i ) i {z } Log Likelihood U D( p ˆp unlabeled ) {z } Label regularization

Time Normalization [Tabassum, Ritter, Xu EMNLP 2016] State-ofthe-art time resolvers { } TempEX HeidelTime SUTime UWTime 1 Jan 2016

Time Normalization Distant Supervision (no human labels or rules!) [Tabassum, Ritter, Xu EMNLP 2016] State-ofthe-art time resolvers { } TempEX HeidelTime SUTime UWTime 1 Jan 2016

Distant Supervision Assumption Mercury Transit May 9,2016

Distant Supervision Assumption Mercury Transit May 9,2016 8 May 9 May 10 May

Multiple Instance Learning Tagger [ Mercury, 5/9/2016 ] w 1 w 2 w 3 w n Words t 1 t 2 t 3 t 4 Mon Sun 1 31 1 12 Past Present Future Sentence Level Tags [Event Database]

Multiple Instance Learning Tagger [ Mercury, 5/9/2016 ] Local Classifier exp( f(w i,z i )) w 1 w 2 w 3 w n Words z 1 z 2 z 3 z n Word Level Tags t 1 t 2 t 3 t 4 Mon Sun 1 31 1 12 Past Present Future Sentence Level Tags [Event Database]

Multiple Instance Learning Tagger [ Mercury, 5/9/2016 ] Local Classifier exp( f(w i,z i )) w 1 w 2 w 3 w n Words z 1 z 2 z 3 z n Word Level Tags Deterministic OR t 1 t 2 t 3 t 4 Mon Sun 1 31 1 12 Past Present Future Sentence Level Tags [Hoffmann et. al. 2011] [Event Database]

Multiple Instance Learning Tagger [ Mercury, 5/9/2016 ] Local Classifier exp( f(w i,z i )) w 1 w 2 w 3 w n Words z 1 z 2 z 3 z n Word Level Tags Deterministic OR Maximize Conditional X Likelihood: z P (z,t w, ) [Hoffmann et. al. 2011] t 1 t 2 t 3 t 4 Mon Sun 1 31 1 12 [Event Database] Past Present Future Sentence Level Tags

Missing Data Problem Sentence Level Tags: TL = Future MOY= May DOM=9 DOW= Mon

Missing Data Extension w 1 w 2 w 3 w n z 1 z 2 z 3 z n Aggregated Sentence Level Tags t 1 t 2 t 3 t 4 [Event Database]

Missing Data Extension Missing Data Problem In Distant Supervision [Ritter, et. al. TACL 2013] w 1 w 2 w 3 w n z 1 z 2 z 3 z n t 0 1 t 0 2 t 0 3 t 0 4 m 1 m 2 m 3 m 4 [Event Database]

Missing Data Extension Missing Data Problem In Distant Supervision [Ritter, et. al. TACL 2013] w 1 w 2 w 3 w n z 1 z 2 z 3 z n Mentioned in Text t 0 1 t 0 2 t 0 3 t 0 4 m 1 m 2 m 3 m 4 [Event Database]

Missing Data Extension Missing Data Problem In Distant Supervision [Ritter, et. al. TACL 2013] w 1 w 2 w 3 w n z 1 z 2 z 3 z n Mentioned in Text t 0 1 t 0 2 t 0 3 t 0 4 Implied by Event Date m 1 m 2 m 3 m 4 [Event Database]

Missing Data Extension Missing Data Problem In Distant Supervision [Ritter, et. al. TACL 2013] w 1 w 2 w 3 w n z 1 z 2 z 3 z n Mentioned in Text t 0 1 t 0 2 t 0 3 t 0 4 Encourage Agreement Implied by Event Date m 1 m 2 m 3 m 4 [Event Database]

Example Tags Word Im Hella excited for tomorrow Tag NA NA Future NA Future Word Thnks for a Christmas party on fri Tag NA NA NA December NA NA Friday

Evaluation

Evaluation 17% increase in F- score over SUTime

Where can we find NLU? Follow the data!

Where can we find NLU? Follow the data! Opportunistically Gathered Data: Twitter Events (Time Normalization) Billions of Internet Conversations

Where can we find NLU? Follow the data! Opportunistically Gathered Data: Twitter Events (Time Normalization) Billions of Internet Conversations Design Models for the Data (rather than the other way around)