Conditional Language Modeling. Chris Dyer

Similar documents
Better Conditional Language Modeling. Chris Dyer

Sequence Modeling with Neural Networks

CSC321 Lecture 7 Neural language models

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Lecture 5 Neural models for NLP

Natural Language Processing

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

From perceptrons to word embeddings. Simon Šuster University of Groningen

Neural Networks Language Models

N-gram Language Modeling

Learning to translate with neural networks. Michael Auli

ANLP Lecture 6 N-gram models and smoothing

Natural Language Processing

CSC321 Lecture 15: Recurrent Neural Networks

Machine Learning for natural language processing

The Noisy Channel Model and Markov Models

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Conditional Language modeling with attention

CS230: Lecture 10 Sequence models II

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

The Naïve Bayes Classifier. Machine Learning Fall 2017

N-gram Language Modeling Tutorial

Naïve Bayes, Maxent and Neural Models

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Chapter 3: Basics of Language Modelling

] Automatic Speech Recognition (CS753)

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

CSC321 Lecture 10 Training RNNs

with Local Dependencies

Topics in Natural Language Processing

Neural Architectures for Image, Language, and Speech Processing

CSC321 Lecture 18: Learning Probabilistic Models

NEURAL LANGUAGE MODELS

Statistical Machine Translation

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Improved Learning through Augmenting the Loss

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Long-Short Term Memory and Other Gated RNNs

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Lecture 15: Recurrent Neural Nets

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Statistical Methods for NLP

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Language Model. Introduction to N-grams

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

Deep Learning for NLP

Natural Language Processing

Neural Networks in Structured Prediction. November 17, 2015

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Forward algorithm vs. particle filtering

Lecture 15: Exploding and Vanishing Gradients

Seq2Seq Losses (CTC)

Midterm sample questions

Introduction to RNNs!

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Language Modeling. Michael Collins, Columbia University

Deep Learning Recurrent Networks 2/28/2018

Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, & Finale

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

CSC321 Lecture 16: ResNets and Attention

CS 188: Artificial Intelligence Fall 2011

Recurrent Neural Network

Cross-Lingual Language Modeling for Automatic Speech Recogntion

CSC321 Lecture 9 Recurrent neural nets

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Recurrent Neural Networks. Jian Tang

1. Markov models. 1.1 Markov-chain

Language Modeling. Hung-yi Lee 李宏毅

Log-Linear Models, MEMMs, and CRFs

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Structured Neural Networks (I)

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Probabilistic Language Modeling

Recurrent and Recursive Networks

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Natural Language Processing and Recurrent Neural Networks

CSC321 Lecture 15: Exploding and Vanishing Gradients

Lecture 12: Algorithms for HMMs

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017

Data Mining & Machine Learning

Natural Language Processing. Statistical Inference: n-grams

Language Modelling. Marcello Federico FBK-irst Trento, Italy. MT Marathon, Edinburgh, M. Federico SLM MT Marathon, Edinburgh, 2012

Speech Translation: from Singlebest to N-Best to Lattice Translation. Spoken Language Communication Laboratories

An overview of word2vec

CSC321 Lecture 9: Generalization

Statistical Machine Learning from Data

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Transcription:

Conditional Language Modeling Chris Dyer

Unconditional LMs A language model assigns probabilities to sequences of words,. w =(w 1,w 2,...,w`) It is convenient to decompose this probability using the chain rule, as follows: p(w) =p(w 1 ) p(w 2 w 1 ) p(w 3 w 1,w 2 ) p(w` w 1,...,w` 1 ) = w Y t=1 p(w t w 1,...,w t 1 ) This reduces the language modeling problem to modeling the probability of the next word, given the history of preceding words.

Evaluating unconditional LMs How good is our unconditional language model? 1. Held-out per-word cross entropy or perplexity H = ppl = b 1 w i=1 P w i=1 log b p(w i w <i ) Same as training criterion. How uncertain is the model at each time position, an average? 2. Task-based evaluation 1 w w X log 2 p(w i w <i ) (units: bits per word) (units: uncertainty per word) Use in a task-model that uses a language model in place of some other language model. Does it improve?

History-based LMs A common strategy is to make a Markov assumption, which is a conditional independence assumption. p(w) =p(w 1 ) p(w 2 w 1 ) p(w 3 w 1,w 2 ) p(w 4 w 1,w 2,w 3 )...

History-based LMs A common strategy is to make a Markov assumption, which is a conditional independence assumption. p(w) =p(w 1 ) p(w 2 w 1 ) p(w 3 w 1,w 2 ) Markov: forget the distant past. Is this valid for language? No Is it practical? Often! p(w 4 w 1,w 2,w 3 )...

History-based LMs A common strategy is to make a Markov assumption, which is a conditional independence assumption. p(w) =p(w 1 ) p(w 2 w 1 ) p(w 3 w 1,w 2 ) Markov: forget the distant past. Is this valid for language? No Is it practical? Often! p(w 4 w 1,w 2,w 3 )... Why RNNs are great for language: no more Markov assumptions!

History-based LMs with RNNs RNN hidden state random variable vector, length= vocab p(w 5 w 1,w 2,w 3,w 4 ) z } { h 1 h 2 h 3 h 4 h 0 w 1 w 2 w 3 w 4 observed context word w 1 w 2 w 3 w 4 vector (word embedding)

History-based LMs with RNNs RNN hidden state random variable vector, length= vocab p(w 5 w 1,w 2,w 3,w 4 ) z } { h 1 h 2 h 3 h 4 h 0 w 1 w 2 w 3 w 4 observed context word w 1 w 2 w 3 w 4 vector (word embedding)

Bridle. (1990) Probabilistic interpretation of feedforward classification Distributions over words Each dimension corresponds to a word in a closed vocabulary, V. h the a and cat dog horse runs says walked walks walking pig Lisbon sardines u = Wh + b p i = exp u i P j exp u j The pi s form a distribution, i.e. p i > 0 8i, X i p i =1

Distributions over words h the a and cat dog horse runs says walked walks walking pig Lisbon sardines u = Wh + b p i = exp u i P j exp u j p(w) =p(w 1 ) p(w 2 w 1 ) p(w 3 w 1,w 2 ) p(w 4 w 1,w 2,w 3 )...

Distributions over words h the a and cat dog horse runs says walked walks walking pig Lisbon sardines u = Wh + b p i = exp u i P j exp u j p(w) =p(w 1 ) p(w 2 w 1 ) p(w 3 w 1,w 2 ) histories are sequences of words p(w 4 w 1,w 2,w 3 )...

Distributions over words h the a and cat dog horse runs says walked walks walking pig Lisbon sardines u = Wh + b p i = exp u i P j exp u j h 2 R d V = 100, 000 What are the dimensions of b? p(w) =p(w 1 ) p(w 2 w 1 ) p(w 3 w 1,w 2 ) histories are sequences of words p(w 4 w 1,w 2,w 3 )...

Distributions over words h the a and cat dog horse runs says walked walks walking pig Lisbon sardines u = Wh + b p i = exp u i P j exp u j h 2 R d V = 100, 000 What are the dimensions of W? p(w) =p(w 1 ) p(w 2 w 1 ) p(w 3 w 1,w 2 ) histories are sequences of words p(w 4 w 1,w 2,w 3 )...

RNN language models

RNN language models h 1 h 0 x 1 <s>

RNN language models ˆp 1 h 1 h 0 x 1 <s>

RNN language models p(tom hsi) tom ˆp 1 h 1 h 0 x 1 <s>

RNN language models p(tom hsi) tom ˆp 1 h 1 h 0 x 1 x 2 <s>

RNN language models p(tom hsi) tom ˆp 1 h 1 h 2 h 0 x 1 x 2 <s>

RNN language models p(tom hsi) p(likes hsi, tom) tom likes ˆp 1 h 1 h 2 h 0 x 1 x 2 <s>

RNN language models p(tom hsi) p(likes hsi, tom) p(beer hsi, tom, likes) tom likes beer ˆp 1 h 1 h 2 h 3 h 0 x 1 x 2 x 3 <s>

RNN language models p(tom hsi) p(likes hsi, tom) p(beer hsi, tom, likes) p(h/si hsi, tom, likes, beer) tom likes beer </s> ˆp 1 h 1 h 2 h 3 h 4 h 0 x 1 x 2 x 3 x 4 <s>

Training RNN language models tom likes beer </s> ˆp 1 h 1 h 2 h 3 h 4 h 0 x 1 x 2 x 3 x 4 <s>

Training RNN language models tom likes beer </s> cost 1 cost 2 cost 3 cost 4 ˆp 1 h 1 h 2 h 3 h 4 h 0 x 1 x 2 x 3 x 4 <s>

Training RNN language models log loss/ cross entropy tom likes beer </s> {cost 1 cost 2 cost 3 cost 4 ˆp 1 h 1 h 2 h 3 h 4 h 0 x 1 x 2 x 3 x 4 <s>

Training RNN language models F log loss/ cross entropy { tom likes beer </s> cost 1 cost 2 cost 3 cost 4 ˆp 1 h 1 h 2 h 3 h 4 h 0 x 1 x 2 x 3 x 4 <s>

Training RNN language models F log loss/ cross entropy { tom likes beer </s> cost 1 cost 2 cost 3 cost 4 ˆp 1 h 1 h 2 h 3 h 4 h 0 x 1 x 2 x 3 x 4 <s>

Training RNN language models The cross-entropy objective seeks the maximum likelihood (MLE) objective. Find the parameters that make the training data most likely.

Training RNN language models The cross-entropy objective seeks the maximum likelihood (MLE) objective. Find the parameters that make the training data most likely. You will overfit. 1. Stop training early, based on a validation set 2. Weight decay / other regularizers 3. Dropout during training. In contrast to count-based models, zeroes aren t a problem.

RNN language models Unlike Markov (n-gram) models, RNNs never forget However, they don t always remember so well (recall Felix s lectures on RNNs vs. LSTMs) Algorithms Sample a sequence from the probability distribution defined by the RNN Train the RNN to minimize cross entropy (aka MLE) What about: what is the most probable sequence?

How well do RNN LMs do? perplexity Word Error Rate (WER) order=5 Markov Kneser-Ney freq. est. 221 13.5 RNN 400 hidden 171 12.5 3xRNN interpolation 151 11.6 Mikolov et al. (2010 Interspeech) Recurrent neural network based language model

How well do RNN LMs do? perplexity Word Error Rate (WER) order=5 Markov Kneser-Ney freq. est. 221 13.5 RNN 400 hidden 171 12.5 3xRNN interpolation 151 11.6 Mikolov et al. (2010 Interspeech) Recurrent neural network based language model

Conditional LMs A conditional language model assigns probabilities to sequences of words, w =(w 1,w 2,...,w`), given some conditioning context, x. As with unconditional models, it is again helpful to use the chain rule to decompose this probability: Ỳ p(w x) = p(w t x,w 1,w 2,...,w t 1 ) t=1 What is the probability of the next word, given the history of previously generated words and conditioning context x?

Conditional LMs x input text output An author A document written by that author A topic label An article about that topic {SPAM, NOT_SPAM} A sentence in French A sentence in English A sentence in English An image A document A document Meterological measurements Acoustic signal Conversational history + database A question + a document A question + an image w An email Its English translation Its French translation Its Chinese translation A text description of the image Its summary Its translation A weather report Transcription of speech Dialogue system response Its answer Its answer

Conditional LMs x input text output An author A document written by that author A topic label An article about that topic {SPAM, NOT_SPAM} A sentence in French A sentence in English A sentence in English An image A document A document Meterological measurements Acoustic signal Conversational history + database A question + a document A question + an image w An email Its English translation Its French translation Its Chinese translation A text description of the image Its summary Its translation A weather report Transcription of speech Dialogue system response Its answer Its answer

Conditional LMs x input text output An author A document written by that author A topic label An article about that topic {SPAM, NOT_SPAM} A sentence in French A sentence in English A sentence in English An image A document A document Meterological measurements Acoustic signal Conversational history + database A question + a document A question + an image w An email Its English translation Its French translation Its Chinese translation A text description of the image Its summary Its translation A weather report Transcription of speech Dialogue system response Its answer Its answer

Data for training conditional LMs To train conditional language models, we need paired samples, {(x i, w i )} N i=1. Data availability varies. It s easy to think of tasks that could be solved by conditional language models, but the data just doesn t exist. Relatively large amounts of data for: Translation, summarisation, caption generation, speech recognition

Evaluating conditional LMs How good is our conditional language model? These are language models, we can use cross-entropy or perplexity. okay to implement, hard to interpret Task-specific evaluation. Compare the model s most likely output to human-generated expected output using a task-specific evaluation metric L. w = arg max w p(w x) L(w, w ref ) Examples of L: BLEU, METEOR, WER, ROUGE. easy to implement, okay to interpret Human evaluation. hard to implement, easy to interpret

Evaluating conditional LMs How good is our conditional language model? These are language models, we can use cross-entropy or perplexity. okay to implement, hard to interpret Task-specific evaluation. Compare the model s most likely output to human-generated expected output using a task-specific evaluation metric L. w = arg max w p(w x) L(w, w ref ) Examples of L: BLEU, METEOR, WER, ROUGE. easy to implement, okay to interpret Human evaluation. hard to implement, easy to interpret

Lecture overview The rest of this lecture will look at encoder-decoder models that learn a function that maps into a fixed-size vector and then uses a language model to decode that vector into a sequence of words, w. x x Kunst kann nicht gelehrt werden encoder representation decoder w Artistry can t be taught

Lecture overview The rest of this lecture will look at encoder-decoder models that learn a function that maps into a fixed-size vector and then uses a language model to decode that vector into a sequence of words, w. x x encoder representation decoder w A dog is playing on the beach.

Lecture overview Two questions How do we encode x as a fixed-size vector, c? - Problem (or at least modality) specific - Think about assumptions How do we condition on c in the decoding model? - Less problem specific - We will review one standard solution: RNNs

Kalchbrenner and Blunsom 2013 Encoder c =embed(x) s = Vc

Kalchbrenner and Blunsom 2013 Encoder c =embed(x) s = Vc Recurrent decoder Recurrent connection Embedding of w t 1 h t = g(w[h t 1 ; w t 1 ] + s + b]) u t = Ph t + b 0 p(w t x, w <t ) = (u t ) Learnt bias Source sentence

Kalchbrenner and Blunsom 2013 Encoder c =embed(x) s = Vc Recurrent decoder Recurrent connection Embedding of w t 1 h t = g(w[h t 1 ; w t 1 ] + s + b]) u t = Ph t + b 0 p(w t x, w <t ) = (u t ) Learnt bias Source sentence Recall unconditional RNN h t = g(w[h t 1 ; w t 1 ]+b])

K&B 2013: Encoder How should we define c =embed(x)? The simplest model possible: c = X i x i x 1 x 2 x 3 x 4 x 5 x 6 x 1 x 2 x 3 x 4 x 5 x 6 What do you think of this model?

K&B 2013: RNN Decoder Encoder c =embed(x) s = Vc Recurrent decoder Recurrent connection Embedding of w t 1 h t = g(w[h t 1 ; w t 1 ] + s + b]) u t = Ph t + b 0 p(w t x, w <t ) = (u t ) Learnt bias Source sentence Recall unconditional RNN h t = g(w[h t 1 ; w t 1 ]+b])

K&B 2013: RNN Decoder s h 1 h 0 x 1 <s>

K&B 2013: RNN Decoder s ˆp 1 h 1 h 0 x 1 <s>

K&B 2013: RNN Decoder p(tom s, hsi) s tom ˆp 1 h 1 h 0 x 1 <s>

K&B 2013: RNN Decoder p(tom s, hsi) p(likes s, hsi, tom) s tom likes ˆp 1 h 1 h 2 h 0 x 1 x 2 <s>

K&B 2013: RNN Decoder p(tom s, hsi) p(likes s, hsi, tom) p(beer s, hsi, tom, likes) s tom likes beer ˆp 1 h 1 h 2 h 3 h 0 x 1 x 2 x 3 <s>

K&B 2013: RNN Decoder p(tom s, hsi) s p(likes s, hsi, tom) p(beer s, hsi, tom, likes) p(h\si s, hsi, tom, likes, beer) tom likes beer </s> ˆp 1 h 1 h 2 h 3 h 4 h 0 x 1 x 2 x 3 x 4 <s>

A word about decoding In general, we want to find the most probable (MAP) output given the input, i.e. w = arg max w = arg max w p(w x) w X t=1 log p(w t x, w <t )

A word about decoding In general, we want to find the most probable (MAP) output given the input, i.e. w = arg max w = arg max w p(w x) w X t=1 undecidable :( log p(w t x, w <t ) This is, for general RNNs, a hard problem. We therefore approximate it with a greedy search: w 1 arg max w 1 p(w 1 x) w 2 arg max w 2 p(w 2 x,w 1).. w t arg max w t p(w t x, w <t)

A word about decoding A slightly better approximation is to use a beam search with beam size b. Key idea: keep track of top b hypothesis. E.g., for b=2: x = Bier trinke ich beer drink I hsi logprob=0 w 0 w 1 w 2 w 3

A word about decoding A slightly better approximation is to use a beam search with beam size b. Key idea: keep track of top b hypothesis. E.g., for b=2: x = Bier trinke ich beer drink I hsi logprob=0 beer logprob=-1.82 I logprob=-2.11 w 0 w 1 w 2 w 3

A word about decoding A slightly better approximation is to use a beam search with beam size b. Key idea: keep track of top b hypothesis. E.g., for b=2: x = Bier trinke ich beer drink I drink logprob=-6.93 hsi logprob=0 beer logprob=-1.82 I logprob=-2.11 I logprob=-5.80 w 0 w 1 w 2 w 3

A word about decoding A slightly better approximation is to use a beam search with beam size b. Key idea: keep track of top b hypothesis. E.g., for b=2: x = Bier trinke ich beer drink I drink logprob=-6.93 hsi logprob=0 beer logprob=-1.82 I logprob=-2.11 I logprob=-5.80 beer logprob=-8.66 drink logprob=-2.87 w 0 w 1 w 2 w 3

A word about decoding A slightly better approximation is to use a beam search with beam size b. Key idea: keep track of top b hypothesis. E.g., for b=2: x = Bier trinke ich beer drink I drink logprob=-6.93 hsi logprob=0 beer logprob=-1.82 I logprob=-2.11 I logprob=-5.80 beer logprob=-8.66 drink logprob=-2.87 w 0 w 1 w 2 w 3

A word about decoding A slightly better approximation is to use a beam search with beam size b. Key idea: keep track of top b hypothesis. E.g., for b=2: x = Bier trinke ich beer drink I drink logprob=-6.93 drink logprob=-6.28 hsi logprob=0 beer logprob=-1.82 I logprob=-2.11 I logprob=-5.80 beer logprob=-8.66 like logprob=-7.31 beer logprob=-3.04 drink logprob=-2.87 wine logprob=-5.12 w 0 w 1 w 2 w 3

A word about decoding A slightly better approximation is to use a beam search with beam size b. Key idea: keep track of top b hypothesis. E.g., for b=2: x = Bier trinke ich beer drink I drink logprob=-6.93 drink logprob=-6.28 hsi logprob=0 beer logprob=-1.82 I logprob=-2.11 I logprob=-5.80 beer logprob=-8.66 like logprob=-7.31 beer logprob=-3.04 drink logprob=-2.87 wine logprob=-5.12 w 0 w 1 w 2 w 3

A word about decoding A slightly better approximation is to use a beam search with beam size b. Key idea: keep track of top b hypothesis. E.g., for b=2: x = Bier trinke ich beer drink I drink logprob=-6.93 drink logprob=-6.28 hsi logprob=0 beer logprob=-1.82 I logprob=-2.11 I logprob=-5.80 beer logprob=-8.66 like logprob=-7.31 beer logprob=-3.04 drink logprob=-2.87 wine logprob=-5.12 w 0 w 1 w 2 w 3

How well does this model do? perplexity (2011) perplexity (2012) order=5 Markov Kneser-Ney freq. est. 222 225 RNN LM 178 181 RNN LM + x 140 142

How well does this model do?

How well does this model do?

How well does this model do?

How well does this model do?

How well does this model do?

How well does this model do?

How well does this model do? (Literal: I will feel bad if you do not find a solution.)

How well does this model do? (Literal: I will feel bad if you do not find a solution.)

Summary Conditional language modeling provides a convenient formulation for a lot of practical applications Two big problems: Model expressivity Decoding difficulties Next time A better encoder for vector to sequence models Attention for better learning Lots of results on machine translation

Questions?