Conditional Language Modeling. Chris Dyer
|
|
- Duane Ramsey
- 5 years ago
- Views:
Transcription
1 Conditional Language Modeling Chris Dyer
2 Unconditional LMs A language model assigns probabilities to sequences of words,. w =(w 1,w 2,...,w`) It is convenient to decompose this probability using the chain rule, as follows: p(w) =p(w 1 ) p(w 2 w 1 ) p(w 3 w 1,w 2 ) p(w` w 1,...,w` 1 ) = w Y t=1 p(w t w 1,...,w t 1 ) This reduces the language modeling problem to modeling the probability of the next word, given the history of preceding words.
3 Evaluating unconditional LMs How good is our unconditional language model? 1. Held-out per-word cross entropy or perplexity H = ppl = b 1 w i=1 P w i=1 log b p(w i w <i ) Same as training criterion. How uncertain is the model at each time position, an average? 2. Task-based evaluation 1 w w X log 2 p(w i w <i ) (units: bits per word) (units: uncertainty per word) Use in a task-model that uses a language model in place of some other language model. Does it improve?
4 History-based LMs A common strategy is to make a Markov assumption, which is a conditional independence assumption. p(w) =p(w 1 ) p(w 2 w 1 ) p(w 3 w 1,w 2 ) p(w 4 w 1,w 2,w 3 )...
5 History-based LMs A common strategy is to make a Markov assumption, which is a conditional independence assumption. p(w) =p(w 1 ) p(w 2 w 1 ) p(w 3 w 1,w 2 ) Markov: forget the distant past. Is this valid for language? No Is it practical? Often! p(w 4 w 1,w 2,w 3 )...
6 History-based LMs A common strategy is to make a Markov assumption, which is a conditional independence assumption. p(w) =p(w 1 ) p(w 2 w 1 ) p(w 3 w 1,w 2 ) Markov: forget the distant past. Is this valid for language? No Is it practical? Often! p(w 4 w 1,w 2,w 3 )... Why RNNs are great for language: no more Markov assumptions!
7 History-based LMs with RNNs RNN hidden state random variable vector, length= vocab p(w 5 w 1,w 2,w 3,w 4 ) z } { h 1 h 2 h 3 h 4 h 0 w 1 w 2 w 3 w 4 observed context word w 1 w 2 w 3 w 4 vector (word embedding)
8 History-based LMs with RNNs RNN hidden state random variable vector, length= vocab p(w 5 w 1,w 2,w 3,w 4 ) z } { h 1 h 2 h 3 h 4 h 0 w 1 w 2 w 3 w 4 observed context word w 1 w 2 w 3 w 4 vector (word embedding)
9 Bridle. (1990) Probabilistic interpretation of feedforward classification Distributions over words Each dimension corresponds to a word in a closed vocabulary, V. h the a and cat dog horse runs says walked walks walking pig Lisbon sardines u = Wh + b p i = exp u i P j exp u j The pi s form a distribution, i.e. p i > 0 8i, X i p i =1
10 Distributions over words h the a and cat dog horse runs says walked walks walking pig Lisbon sardines u = Wh + b p i = exp u i P j exp u j p(w) =p(w 1 ) p(w 2 w 1 ) p(w 3 w 1,w 2 ) p(w 4 w 1,w 2,w 3 )...
11 Distributions over words h the a and cat dog horse runs says walked walks walking pig Lisbon sardines u = Wh + b p i = exp u i P j exp u j p(w) =p(w 1 ) p(w 2 w 1 ) p(w 3 w 1,w 2 ) histories are sequences of words p(w 4 w 1,w 2,w 3 )...
12 Distributions over words h the a and cat dog horse runs says walked walks walking pig Lisbon sardines u = Wh + b p i = exp u i P j exp u j h 2 R d V = 100, 000 What are the dimensions of b? p(w) =p(w 1 ) p(w 2 w 1 ) p(w 3 w 1,w 2 ) histories are sequences of words p(w 4 w 1,w 2,w 3 )...
13 Distributions over words h the a and cat dog horse runs says walked walks walking pig Lisbon sardines u = Wh + b p i = exp u i P j exp u j h 2 R d V = 100, 000 What are the dimensions of W? p(w) =p(w 1 ) p(w 2 w 1 ) p(w 3 w 1,w 2 ) histories are sequences of words p(w 4 w 1,w 2,w 3 )...
14 RNN language models
15 RNN language models h 1 h 0 x 1 <s>
16 RNN language models ˆp 1 h 1 h 0 x 1 <s>
17 RNN language models p(tom hsi) tom ˆp 1 h 1 h 0 x 1 <s>
18 RNN language models p(tom hsi) tom ˆp 1 h 1 h 0 x 1 x 2 <s>
19 RNN language models p(tom hsi) tom ˆp 1 h 1 h 2 h 0 x 1 x 2 <s>
20 RNN language models p(tom hsi) p(likes hsi, tom) tom likes ˆp 1 h 1 h 2 h 0 x 1 x 2 <s>
21 RNN language models p(tom hsi) p(likes hsi, tom) p(beer hsi, tom, likes) tom likes beer ˆp 1 h 1 h 2 h 3 h 0 x 1 x 2 x 3 <s>
22 RNN language models p(tom hsi) p(likes hsi, tom) p(beer hsi, tom, likes) p(h/si hsi, tom, likes, beer) tom likes beer </s> ˆp 1 h 1 h 2 h 3 h 4 h 0 x 1 x 2 x 3 x 4 <s>
23 Training RNN language models tom likes beer </s> ˆp 1 h 1 h 2 h 3 h 4 h 0 x 1 x 2 x 3 x 4 <s>
24 Training RNN language models tom likes beer </s> cost 1 cost 2 cost 3 cost 4 ˆp 1 h 1 h 2 h 3 h 4 h 0 x 1 x 2 x 3 x 4 <s>
25 Training RNN language models log loss/ cross entropy tom likes beer </s> {cost 1 cost 2 cost 3 cost 4 ˆp 1 h 1 h 2 h 3 h 4 h 0 x 1 x 2 x 3 x 4 <s>
26 Training RNN language models F log loss/ cross entropy { tom likes beer </s> cost 1 cost 2 cost 3 cost 4 ˆp 1 h 1 h 2 h 3 h 4 h 0 x 1 x 2 x 3 x 4 <s>
27 Training RNN language models F log loss/ cross entropy { tom likes beer </s> cost 1 cost 2 cost 3 cost 4 ˆp 1 h 1 h 2 h 3 h 4 h 0 x 1 x 2 x 3 x 4 <s>
28 Training RNN language models The cross-entropy objective seeks the maximum likelihood (MLE) objective. Find the parameters that make the training data most likely.
29 Training RNN language models The cross-entropy objective seeks the maximum likelihood (MLE) objective. Find the parameters that make the training data most likely. You will overfit. 1. Stop training early, based on a validation set 2. Weight decay / other regularizers 3. Dropout during training. In contrast to count-based models, zeroes aren t a problem.
30 RNN language models Unlike Markov (n-gram) models, RNNs never forget However, they don t always remember so well (recall Felix s lectures on RNNs vs. LSTMs) Algorithms Sample a sequence from the probability distribution defined by the RNN Train the RNN to minimize cross entropy (aka MLE) What about: what is the most probable sequence?
31 How well do RNN LMs do? perplexity Word Error Rate (WER) order=5 Markov Kneser-Ney freq. est RNN 400 hidden xRNN interpolation Mikolov et al. (2010 Interspeech) Recurrent neural network based language model
32 How well do RNN LMs do? perplexity Word Error Rate (WER) order=5 Markov Kneser-Ney freq. est RNN 400 hidden xRNN interpolation Mikolov et al. (2010 Interspeech) Recurrent neural network based language model
33 Conditional LMs A conditional language model assigns probabilities to sequences of words, w =(w 1,w 2,...,w`), given some conditioning context, x. As with unconditional models, it is again helpful to use the chain rule to decompose this probability: Ỳ p(w x) = p(w t x,w 1,w 2,...,w t 1 ) t=1 What is the probability of the next word, given the history of previously generated words and conditioning context x?
34 Conditional LMs x input text output An author A document written by that author A topic label An article about that topic {SPAM, NOT_SPAM} A sentence in French A sentence in English A sentence in English An image A document A document Meterological measurements Acoustic signal Conversational history + database A question + a document A question + an image w An Its English translation Its French translation Its Chinese translation A text description of the image Its summary Its translation A weather report Transcription of speech Dialogue system response Its answer Its answer
35 Conditional LMs x input text output An author A document written by that author A topic label An article about that topic {SPAM, NOT_SPAM} A sentence in French A sentence in English A sentence in English An image A document A document Meterological measurements Acoustic signal Conversational history + database A question + a document A question + an image w An Its English translation Its French translation Its Chinese translation A text description of the image Its summary Its translation A weather report Transcription of speech Dialogue system response Its answer Its answer
36 Conditional LMs x input text output An author A document written by that author A topic label An article about that topic {SPAM, NOT_SPAM} A sentence in French A sentence in English A sentence in English An image A document A document Meterological measurements Acoustic signal Conversational history + database A question + a document A question + an image w An Its English translation Its French translation Its Chinese translation A text description of the image Its summary Its translation A weather report Transcription of speech Dialogue system response Its answer Its answer
37 Data for training conditional LMs To train conditional language models, we need paired samples, {(x i, w i )} N i=1. Data availability varies. It s easy to think of tasks that could be solved by conditional language models, but the data just doesn t exist. Relatively large amounts of data for: Translation, summarisation, caption generation, speech recognition
38 Evaluating conditional LMs How good is our conditional language model? These are language models, we can use cross-entropy or perplexity. okay to implement, hard to interpret Task-specific evaluation. Compare the model s most likely output to human-generated expected output using a task-specific evaluation metric L. w = arg max w p(w x) L(w, w ref ) Examples of L: BLEU, METEOR, WER, ROUGE. easy to implement, okay to interpret Human evaluation. hard to implement, easy to interpret
39 Evaluating conditional LMs How good is our conditional language model? These are language models, we can use cross-entropy or perplexity. okay to implement, hard to interpret Task-specific evaluation. Compare the model s most likely output to human-generated expected output using a task-specific evaluation metric L. w = arg max w p(w x) L(w, w ref ) Examples of L: BLEU, METEOR, WER, ROUGE. easy to implement, okay to interpret Human evaluation. hard to implement, easy to interpret
40 Lecture overview The rest of this lecture will look at encoder-decoder models that learn a function that maps into a fixed-size vector and then uses a language model to decode that vector into a sequence of words, w. x x Kunst kann nicht gelehrt werden encoder representation decoder w Artistry can t be taught
41 Lecture overview The rest of this lecture will look at encoder-decoder models that learn a function that maps into a fixed-size vector and then uses a language model to decode that vector into a sequence of words, w. x x encoder representation decoder w A dog is playing on the beach.
42 Lecture overview Two questions How do we encode x as a fixed-size vector, c? - Problem (or at least modality) specific - Think about assumptions How do we condition on c in the decoding model? - Less problem specific - We will review one standard solution: RNNs
43 Kalchbrenner and Blunsom 2013 Encoder c =embed(x) s = Vc
44 Kalchbrenner and Blunsom 2013 Encoder c =embed(x) s = Vc Recurrent decoder Recurrent connection Embedding of w t 1 h t = g(w[h t 1 ; w t 1 ] + s + b]) u t = Ph t + b 0 p(w t x, w <t ) = (u t ) Learnt bias Source sentence
45 Kalchbrenner and Blunsom 2013 Encoder c =embed(x) s = Vc Recurrent decoder Recurrent connection Embedding of w t 1 h t = g(w[h t 1 ; w t 1 ] + s + b]) u t = Ph t + b 0 p(w t x, w <t ) = (u t ) Learnt bias Source sentence Recall unconditional RNN h t = g(w[h t 1 ; w t 1 ]+b])
46 K&B 2013: Encoder How should we define c =embed(x)? The simplest model possible: c = X i x i x 1 x 2 x 3 x 4 x 5 x 6 x 1 x 2 x 3 x 4 x 5 x 6 What do you think of this model?
47 K&B 2013: RNN Decoder Encoder c =embed(x) s = Vc Recurrent decoder Recurrent connection Embedding of w t 1 h t = g(w[h t 1 ; w t 1 ] + s + b]) u t = Ph t + b 0 p(w t x, w <t ) = (u t ) Learnt bias Source sentence Recall unconditional RNN h t = g(w[h t 1 ; w t 1 ]+b])
48 K&B 2013: RNN Decoder s h 1 h 0 x 1 <s>
49 K&B 2013: RNN Decoder s ˆp 1 h 1 h 0 x 1 <s>
50 K&B 2013: RNN Decoder p(tom s, hsi) s tom ˆp 1 h 1 h 0 x 1 <s>
51 K&B 2013: RNN Decoder p(tom s, hsi) p(likes s, hsi, tom) s tom likes ˆp 1 h 1 h 2 h 0 x 1 x 2 <s>
52 K&B 2013: RNN Decoder p(tom s, hsi) p(likes s, hsi, tom) p(beer s, hsi, tom, likes) s tom likes beer ˆp 1 h 1 h 2 h 3 h 0 x 1 x 2 x 3 <s>
53 K&B 2013: RNN Decoder p(tom s, hsi) s p(likes s, hsi, tom) p(beer s, hsi, tom, likes) p(h\si s, hsi, tom, likes, beer) tom likes beer </s> ˆp 1 h 1 h 2 h 3 h 4 h 0 x 1 x 2 x 3 x 4 <s>
54 A word about decoding In general, we want to find the most probable (MAP) output given the input, i.e. w = arg max w = arg max w p(w x) w X t=1 log p(w t x, w <t )
55 A word about decoding In general, we want to find the most probable (MAP) output given the input, i.e. w = arg max w = arg max w p(w x) w X t=1 undecidable :( log p(w t x, w <t ) This is, for general RNNs, a hard problem. We therefore approximate it with a greedy search: w 1 arg max w 1 p(w 1 x) w 2 arg max w 2 p(w 2 x,w 1).. w t arg max w t p(w t x, w <t)
56 A word about decoding A slightly better approximation is to use a beam search with beam size b. Key idea: keep track of top b hypothesis. E.g., for b=2: x = Bier trinke ich beer drink I hsi logprob=0 w 0 w 1 w 2 w 3
57 A word about decoding A slightly better approximation is to use a beam search with beam size b. Key idea: keep track of top b hypothesis. E.g., for b=2: x = Bier trinke ich beer drink I hsi logprob=0 beer logprob=-1.82 I logprob=-2.11 w 0 w 1 w 2 w 3
58 A word about decoding A slightly better approximation is to use a beam search with beam size b. Key idea: keep track of top b hypothesis. E.g., for b=2: x = Bier trinke ich beer drink I drink logprob=-6.93 hsi logprob=0 beer logprob=-1.82 I logprob=-2.11 I logprob=-5.80 w 0 w 1 w 2 w 3
59 A word about decoding A slightly better approximation is to use a beam search with beam size b. Key idea: keep track of top b hypothesis. E.g., for b=2: x = Bier trinke ich beer drink I drink logprob=-6.93 hsi logprob=0 beer logprob=-1.82 I logprob=-2.11 I logprob=-5.80 beer logprob=-8.66 drink logprob=-2.87 w 0 w 1 w 2 w 3
60 A word about decoding A slightly better approximation is to use a beam search with beam size b. Key idea: keep track of top b hypothesis. E.g., for b=2: x = Bier trinke ich beer drink I drink logprob=-6.93 hsi logprob=0 beer logprob=-1.82 I logprob=-2.11 I logprob=-5.80 beer logprob=-8.66 drink logprob=-2.87 w 0 w 1 w 2 w 3
61 A word about decoding A slightly better approximation is to use a beam search with beam size b. Key idea: keep track of top b hypothesis. E.g., for b=2: x = Bier trinke ich beer drink I drink logprob=-6.93 drink logprob=-6.28 hsi logprob=0 beer logprob=-1.82 I logprob=-2.11 I logprob=-5.80 beer logprob=-8.66 like logprob=-7.31 beer logprob=-3.04 drink logprob=-2.87 wine logprob=-5.12 w 0 w 1 w 2 w 3
62 A word about decoding A slightly better approximation is to use a beam search with beam size b. Key idea: keep track of top b hypothesis. E.g., for b=2: x = Bier trinke ich beer drink I drink logprob=-6.93 drink logprob=-6.28 hsi logprob=0 beer logprob=-1.82 I logprob=-2.11 I logprob=-5.80 beer logprob=-8.66 like logprob=-7.31 beer logprob=-3.04 drink logprob=-2.87 wine logprob=-5.12 w 0 w 1 w 2 w 3
63 A word about decoding A slightly better approximation is to use a beam search with beam size b. Key idea: keep track of top b hypothesis. E.g., for b=2: x = Bier trinke ich beer drink I drink logprob=-6.93 drink logprob=-6.28 hsi logprob=0 beer logprob=-1.82 I logprob=-2.11 I logprob=-5.80 beer logprob=-8.66 like logprob=-7.31 beer logprob=-3.04 drink logprob=-2.87 wine logprob=-5.12 w 0 w 1 w 2 w 3
64 How well does this model do? perplexity (2011) perplexity (2012) order=5 Markov Kneser-Ney freq. est RNN LM RNN LM + x
65 How well does this model do?
66 How well does this model do?
67 How well does this model do?
68 How well does this model do?
69 How well does this model do?
70 How well does this model do?
71 How well does this model do? (Literal: I will feel bad if you do not find a solution.)
72 How well does this model do? (Literal: I will feel bad if you do not find a solution.)
73 Summary Conditional language modeling provides a convenient formulation for a lot of practical applications Two big problems: Model expressivity Decoding difficulties Next time A better encoder for vector to sequence models Attention for better learning Lots of results on machine translation
74 Questions?
Better Conditional Language Modeling. Chris Dyer
Better Conditional Language Modeling Chris Dyer Conditional LMs A conditional language model assigns probabilities to sequences of words, w =(w 1,w 2,...,w`), given some conditioning context, x. As with
More informationSequence Modeling with Neural Networks
Sequence Modeling with Neural Networks Harini Suresh y 0 y 1 y 2 s 0 s 1 s 2... x 0 x 1 x 2 hat is a sequence? This morning I took the dog for a walk. sentence medical signals speech waveform Successes
More informationCSC321 Lecture 7 Neural language models
CSC321 Lecture 7 Neural language models Roger Grosse and Nitish Srivastava February 1, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 1 / 19 Overview We
More informationDeep Learning Sequence to Sequence models: Attention Models. 17 March 2018
Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:
More informationLecture 5 Neural models for NLP
CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm
More informationNatural Language Processing
Natural Language Processing Info 159/259 Lecture 7: Language models 2 (Sept 14, 2017) David Bamman, UC Berkeley Language Model Vocabulary V is a finite set of discrete symbols (e.g., words, characters);
More informationRecap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP
Recap: Language models Foundations of atural Language Processing Lecture 4 Language Models: Evaluation and Smoothing Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipp
More informationSegmental Recurrent Neural Networks for End-to-end Speech Recognition
Segmental Recurrent Neural Networks for End-to-end Speech Recognition Liang Lu, Lingpeng Kong, Chris Dyer, Noah Smith and Steve Renals TTI-Chicago, UoE, CMU and UW 9 September 2016 Background A new wave
More informationFoundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model
Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipop Koehn) 30 January
More informationEmpirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model
Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model (most slides from Sharon Goldwater; some adapted from Philipp Koehn) 5 October 2016 Nathan Schneider
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More informationNeural Networks Language Models
Neural Networks Language Models Philipp Koehn 10 October 2017 N-Gram Backoff Language Model 1 Previously, we approximated... by applying the chain rule p(w ) = p(w 1, w 2,..., w n ) p(w ) = i p(w i w 1,...,
More informationN-gram Language Modeling
N-gram Language Modeling Outline: Statistical Language Model (LM) Intro General N-gram models Basic (non-parametric) n-grams Class LMs Mixtures Part I: Statistical Language Model (LM) Intro What is a statistical
More informationLearning to translate with neural networks. Michael Auli
Learning to translate with neural networks Michael Auli 1 Neural networks for text processing Similar words near each other France Spain dog cat Neural networks for text processing Similar words near each
More informationANLP Lecture 6 N-gram models and smoothing
ANLP Lecture 6 N-gram models and smoothing Sharon Goldwater (some slides from Philipp Koehn) 27 September 2018 Sharon Goldwater ANLP Lecture 6 27 September 2018 Recap: N-gram models We can model sentence
More informationNatural Language Processing
Natural Language Processing Info 59/259 Lecture 4: Text classification 3 (Sept 5, 207) David Bamman, UC Berkeley . https://www.forbes.com/sites/kevinmurnane/206/04/0/what-is-deep-learning-and-how-is-it-useful
More informationCSC321 Lecture 15: Recurrent Neural Networks
CSC321 Lecture 15: Recurrent Neural Networks Roger Grosse Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 1 / 26 Overview Sometimes we re interested in predicting sequences Speech-to-text and
More informationMachine Learning for natural language processing
Machine Learning for natural language processing N-grams and language models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 25 Introduction Goals: Estimate the probability that a
More informationThe Noisy Channel Model and Markov Models
1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle
More informationperplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)
Chapter 5 Language Modeling 5.1 Introduction A language model is simply a model of what strings (of words) are more or less likely to be generated by a speaker of English (or some other language). More
More informationConditional Language modeling with attention
Conditional Language modeling with attention 2017.08.25 Oxford Deep NLP 조수현 Review Conditional language model: assign probabilities to sequence of words given some conditioning context x What is the probability
More informationCS230: Lecture 10 Sequence models II
CS23: Lecture 1 Sequence models II Today s outline We will learn how to: - Automatically score an NLP model I. BLEU score - Improve Machine II. Beam Search Translation results with Beam search III. Speech
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Language Models Tobias Scheffer Stochastic Language Models A stochastic language model is a probability distribution over words.
More informationThe Naïve Bayes Classifier. Machine Learning Fall 2017
The Naïve Bayes Classifier Machine Learning Fall 2017 1 Today s lecture The naïve Bayes Classifier Learning the naïve Bayes Classifier Practical concerns 2 Today s lecture The naïve Bayes Classifier Learning
More informationN-gram Language Modeling Tutorial
N-gram Language Modeling Tutorial Dustin Hillard and Sarah Petersen Lecture notes courtesy of Prof. Mari Ostendorf Outline: Statistical Language Model (LM) Basics n-gram models Class LMs Cache LMs Mixtures
More informationNaïve Bayes, Maxent and Neural Models
Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words
More informationRecurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves
Recurrent Neural Networks Deep Learning Lecture 5 Efstratios Gavves Sequential Data So far, all tasks assumed stationary data Neither all data, nor all tasks are stationary though Sequential Data: Text
More informationChapter 3: Basics of Language Modelling
Chapter 3: Basics of Language Modelling Motivation Language Models are used in Speech Recognition Machine Translation Natural Language Generation Query completion For research and development: need a simple
More information] Automatic Speech Recognition (CS753)
] Automatic Speech Recognition (CS753) Lecture 17: Discriminative Training for HMMs Instructor: Preethi Jyothi Sep 28, 2017 Discriminative Training Recall: MLE for HMMs Maximum likelihood estimation (MLE)
More informationLanguage Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1
Language Modelling Steve Renals Automatic Speech Recognition ASR Lecture 11 6 March 2017 ASR Lecture 11 Language Modelling 1 HMM Speech Recognition Recorded Speech Decoded Text (Transcription) Acoustic
More informationACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging
ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers
More informationCSC321 Lecture 10 Training RNNs
CSC321 Lecture 10 Training RNNs Roger Grosse and Nitish Srivastava February 23, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 1 / 18 Overview Last time, we saw
More informationwith Local Dependencies
CS11-747 Neural Networks for NLP Structured Prediction with Local Dependencies Xuezhe Ma (Max) Site https://phontron.com/class/nn4nlp2017/ An Example Structured Prediction Problem: Sequence Labeling Sequence
More informationTopics in Natural Language Processing
Topics in Natural Language Processing Shay Cohen Institute for Language, Cognition and Computation University of Edinburgh Lecture 9 Administrativia Next class will be a summary Please email me questions
More informationNeural Architectures for Image, Language, and Speech Processing
Neural Architectures for Image, Language, and Speech Processing Karl Stratos June 26, 2018 1 / 31 Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent
More informationCSC321 Lecture 18: Learning Probabilistic Models
CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling
More informationNEURAL LANGUAGE MODELS
COMP90042 LECTURE 14 NEURAL LANGUAGE MODELS LANGUAGE MODELS Assign a probability to a sequence of words Framed as sliding a window over the sentence, predicting each word from finite context to left E.g.,
More informationStatistical Machine Translation
Statistical Machine Translation Marcello Federico FBK-irst Trento, Italy Galileo Galilei PhD School University of Pisa Pisa, 7-19 May 2008 Part V: Language Modeling 1 Comparing ASR and statistical MT N-gram
More informationAlgorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley
Algorithms for NLP Language Modeling III Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Office hours on website but no OH for Taylor until next week. Efficient Hashing Closed address
More informationNatural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs
Natural Language Understanding Lecture 12: Recurrent Neural Networks and LSTMs Recap: probability, language models, and feedforward networks Simple Recurrent Networks Adam Lopez Credits: Mirella Lapata
More informationPart-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287
Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the
More informationImproved Learning through Augmenting the Loss
Improved Learning through Augmenting the Loss Hakan Inan inanh@stanford.edu Khashayar Khosravi khosravi@stanford.edu Abstract We present two improvements to the well-known Recurrent Neural Network Language
More informationN-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24
L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,
More informationLanguage Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky
Language Modeling Introduction to N-grams Many Slides are adapted from slides by Dan Jurafsky Probabilistic Language Models Today s goal: assign a probability to a sentence Why? Machine Translation: P(high
More informationMachine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26
Machine Translation 10: Advanced Neural Machine Translation Architectures Rico Sennrich University of Edinburgh R. Sennrich MT 2018 10 1 / 26 Today s Lecture so far today we discussed RNNs as encoder and
More informationBasic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology
Basic Text Analysis Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakimnivre@lingfiluuse Basic Text Analysis 1(33) Hidden Markov Models Markov models are
More informationCSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9
CSCI 5832 Natural Language Processing Jim Martin Lecture 9 1 Today 2/19 Review HMMs for POS tagging Entropy intuition Statistical Sequence classifiers HMMs MaxEnt MEMMs 2 Statistical Sequence Classification
More informationLecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)
Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Intro to NLP, CS585, Fall 2014 http://people.cs.umass.edu/~brenocon/inlp2014/ Brendan O Connor (http://brenocon.com) 1 Models for
More informationLong-Short Term Memory and Other Gated RNNs
Long-Short Term Memory and Other Gated RNNs Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Sequence Modeling
More informationLanguage Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky
Language Modeling Introduction to N-grams Many Slides are adapted from slides by Dan Jurafsky Probabilistic Language Models Today s goal: assign a probability to a sentence Why? Machine Translation: P(high
More informationLecture 15: Recurrent Neural Nets
Lecture 15: Recurrent Neural Nets Roger Grosse 1 Introduction Most of the prediction tasks we ve looked at have involved pretty simple kinds of outputs, such as real values or discrete categories. But
More informationRecurrent Neural Networks (Part - 2) Sumit Chopra Facebook
Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech
More informationStatistical Methods for NLP
Statistical Methods for NLP Language Models, Graphical Models Sameer Maskey Week 13, April 13, 2010 Some slides provided by Stanley Chen and from Bishop Book Resources 1 Announcements Final Project Due,
More informationNLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)
NLP: N-Grams Dan Garrette dhg@cs.utexas.edu December 27, 2013 1 Language Modeling Tasks Language idenfication / Authorship identification Machine Translation Speech recognition Optical character recognition
More informationLanguage Model. Introduction to N-grams
Language Model Introduction to N-grams Probabilistic Language Model Goal: assign a probability to a sentence Application: Machine Translation P(high winds tonight) > P(large winds tonight) Spelling Correction
More informationClassification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016
Classification: Naïve Bayes Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016 1 Sentiment Analysis Recall the task: Filled with horrific dialogue, laughable characters,
More informationDeep Learning for NLP
Deep Learning for NLP Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Greg Durrett Outline Motivation for neural networks Feedforward neural networks Applying feedforward neural networks
More informationNatural Language Processing
Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning and Richard Soccer Plan Problem definition Trigram models Evaluation Estimation Interpolation Discounting
More informationNeural Networks in Structured Prediction. November 17, 2015
Neural Networks in Structured Prediction November 17, 2015 HWs and Paper Last homework is going to be posted soon Neural net NER tagging model This is a new structured model Paper - Thursday after Thanksgiving
More informationEmpirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs
Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs (based on slides by Sharon Goldwater and Philipp Koehn) 21 February 2018 Nathan Schneider ENLP Lecture 11 21
More informationForward algorithm vs. particle filtering
Particle Filtering ØSometimes X is too big to use exact inference X may be too big to even store B(X) E.g. X is continuous X 2 may be too big to do updates ØSolution: approximate inference Track samples
More informationLecture 15: Exploding and Vanishing Gradients
Lecture 15: Exploding and Vanishing Gradients Roger Grosse 1 Introduction Last lecture, we introduced RNNs and saw how to derive the gradients using backprop through time. In principle, this lets us train
More informationSeq2Seq Losses (CTC)
Seq2Seq Losses (CTC) Jerry Ding & Ryan Brigden 11-785 Recitation 6 February 23, 2018 Outline Tasks suited for recurrent networks Losses when the output is a sequence Kinds of errors Losses to use CTC Loss
More informationMidterm sample questions
Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts
More informationIntroduction to RNNs!
Introduction to RNNs Arun Mallya Best viewed with Computer Modern fonts installed Outline Why Recurrent Neural Networks (RNNs)? The Vanilla RNN unit The RNN forward pass Backpropagation refresher The RNN
More informationNeural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35
Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How
More informationLanguage Modeling. Michael Collins, Columbia University
Language Modeling Michael Collins, Columbia University Overview The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques: Linear interpolation Discounting
More informationDeep Learning Recurrent Networks 2/28/2018
Deep Learning Recurrent Networks /8/8 Recap: Recurrent networks can be incredibly effective Story so far Y(t+) Stock vector X(t) X(t+) X(t+) X(t+) X(t+) X(t+5) X(t+) X(t+7) Iterated structures are good
More informationNatural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, & Finale
Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, & Finale Noah Smith c 2017 University of Washington nasmith@cs.washington.edu May 22, 2017 1 / 30 To-Do List Online
More informationGoogle s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, et al. Google arxiv:1609.08144v2 Reviewed by : Bill
More informationCSC321 Lecture 16: ResNets and Attention
CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24 Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the
More informationCS 188: Artificial Intelligence Fall 2011
CS 188: Artificial Intelligence Fall 2011 Lecture 20: HMMs / Speech / ML 11/8/2011 Dan Klein UC Berkeley Today HMMs Demo bonanza! Most likely explanation queries Speech recognition A massive HMM! Details
More informationRecurrent Neural Network
Recurrent Neural Network Xiaogang Wang xgwang@ee..edu.hk March 2, 2017 Xiaogang Wang (linux) Recurrent Neural Network March 2, 2017 1 / 48 Outline 1 Recurrent neural networks Recurrent neural networks
More informationCross-Lingual Language Modeling for Automatic Speech Recogntion
GBO Presentation Cross-Lingual Language Modeling for Automatic Speech Recogntion November 14, 2003 Woosung Kim woosung@cs.jhu.edu Center for Language and Speech Processing Dept. of Computer Science The
More informationCSC321 Lecture 9 Recurrent neural nets
CSC321 Lecture 9 Recurrent neural nets Roger Grosse and Nitish Srivastava February 3, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 9 Recurrent neural nets February 3, 2015 1 / 20 Overview You
More informationDeep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang
Deep Learning Basics Lecture 10: Neural Language Models Princeton University COS 495 Instructor: Yingyu Liang Natural language Processing (NLP) The processing of the human languages by computers One of
More informationRecurrent Neural Networks. Jian Tang
Recurrent Neural Networks Jian Tang tangjianpku@gmail.com 1 RNN: Recurrent neural networks Neural networks for sequence modeling Summarize a sequence with fix-sized vector through recursively updating
More information1. Markov models. 1.1 Markov-chain
1. Markov models 1.1 Markov-chain Let X be a random variable X = (X 1,..., X t ) taking values in some set S = {s 1,..., s N }. The sequence is Markov chain if it has the following properties: 1. Limited
More informationLanguage Modeling. Hung-yi Lee 李宏毅
Language Modeling Hung-yi Lee 李宏毅 Language modeling Language model: Estimated the probability of word sequence Word sequence: w 1, w 2, w 3,., w n P(w 1, w 2, w 3,., w n ) Application: speech recognition
More informationLog-Linear Models, MEMMs, and CRFs
Log-Linear Models, MEMMs, and CRFs Michael Collins 1 Notation Throughout this note I ll use underline to denote vectors. For example, w R d will be a vector with components w 1, w 2,... w d. We use expx
More informationDeep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017
Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017 Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion
More informationStructured Neural Networks (I)
Structured Neural Networks (I) CS 690N, Spring 208 Advanced Natural Language Processing http://peoplecsumassedu/~brenocon/anlp208/ Brendan O Connor College of Information and Computer Sciences University
More informationCS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention
CS23: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention Today s outline We will learn how to: I. Word Vector Representation i. Training - Generalize results with word vectors -
More informationCS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev
CS4705 Probability Review and Naïve Bayes Slides from Dragomir Radev Classification using a Generative Approach Previously on NLP discriminative models P C D here is a line with all the social media posts
More informationProbabilistic Language Modeling
Predicting String Probabilities Probabilistic Language Modeling Which string is more likely? (Which string is more grammatical?) Grill doctoral candidates. Regina Barzilay EECS Department MIT November
More informationRecurrent and Recursive Networks
Neural Networks with Applications to Vision and Language Recurrent and Recursive Networks Marco Kuhlmann Introduction Applications of sequence modelling Map unsegmented connected handwriting to strings.
More informationModelling Time Series with Neural Networks. Volker Tresp Summer 2017
Modelling Time Series with Neural Networks Volker Tresp Summer 2017 1 Modelling of Time Series The next figure shows a time series (DAX) Other interesting time-series: energy prize, energy consumption,
More informationNatural Language Processing and Recurrent Neural Networks
Natural Language Processing and Recurrent Neural Networks Pranay Tarafdar October 19 th, 2018 Outline Introduction to NLP Word2vec RNN GRU LSTM Demo What is NLP? Natural Language? : Huge amount of information
More informationCSC321 Lecture 15: Exploding and Vanishing Gradients
CSC321 Lecture 15: Exploding and Vanishing Gradients Roger Grosse Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 1 / 23 Overview Yesterday, we saw how to compute the gradient descent
More informationLecture 12: Algorithms for HMMs
Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 26 February 2018 Recap: tagging POS tagging is a sequence labelling task.
More informationLanguage Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN
Language Models Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN Data Science: Jordan Boyd-Graber UMD Language Models 1 / 8 Language models Language models answer
More informationCOMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017
COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University SEQUENTIAL DATA So far, when thinking
More informationData Mining & Machine Learning
Data Mining & Machine Learning CS57300 Purdue University April 10, 2018 1 Predicting Sequences 2 But first, a detour to Noise Contrastive Estimation 3 } Machine learning methods are much better at classifying
More informationNatural Language Processing. Statistical Inference: n-grams
Natural Language Processing Statistical Inference: n-grams Updated 3/2009 Statistical Inference Statistical Inference consists of taking some data (generated in accordance with some unknown probability
More informationLanguage Modelling. Marcello Federico FBK-irst Trento, Italy. MT Marathon, Edinburgh, M. Federico SLM MT Marathon, Edinburgh, 2012
Language Modelling Marcello Federico FBK-irst Trento, Italy MT Marathon, Edinburgh, 2012 Outline 1 Role of LM in ASR and MT N-gram Language Models Evaluation of Language Models Smoothing Schemes Discounting
More informationSpeech Translation: from Singlebest to N-Best to Lattice Translation. Spoken Language Communication Laboratories
Speech Translation: from Singlebest to N-Best to Lattice Translation Ruiqiang ZHANG Genichiro KIKUI Spoken Language Communication Laboratories 2 Speech Translation Structure Single-best only ASR Single-best
More informationAn overview of word2vec
An overview of word2vec Benjamin Wilson Berlin ML Meetup, July 8 2014 Benjamin Wilson word2vec Berlin ML Meetup 1 / 25 Outline 1 Introduction 2 Background & Significance 3 Architecture 4 CBOW word representations
More informationCSC321 Lecture 9: Generalization
CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data Statistical Machine Learning from Data Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne (EPFL),
More informationNatural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)
Natural Language Processing SoSe 2015 Language Modelling Dr. Mariana Neves April 20th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline 2 Motivation Estimation Evaluation Smoothing Outline 3 Motivation
More information