Conditional Language Modeling. Chris Dyer

Size: px
Start display at page:

Download "Conditional Language Modeling. Chris Dyer"

Transcription

1 Conditional Language Modeling Chris Dyer

2 Unconditional LMs A language model assigns probabilities to sequences of words,. w =(w 1,w 2,...,w`) It is convenient to decompose this probability using the chain rule, as follows: p(w) =p(w 1 ) p(w 2 w 1 ) p(w 3 w 1,w 2 ) p(w` w 1,...,w` 1 ) = w Y t=1 p(w t w 1,...,w t 1 ) This reduces the language modeling problem to modeling the probability of the next word, given the history of preceding words.

3 Evaluating unconditional LMs How good is our unconditional language model? 1. Held-out per-word cross entropy or perplexity H = ppl = b 1 w i=1 P w i=1 log b p(w i w <i ) Same as training criterion. How uncertain is the model at each time position, an average? 2. Task-based evaluation 1 w w X log 2 p(w i w <i ) (units: bits per word) (units: uncertainty per word) Use in a task-model that uses a language model in place of some other language model. Does it improve?

4 History-based LMs A common strategy is to make a Markov assumption, which is a conditional independence assumption. p(w) =p(w 1 ) p(w 2 w 1 ) p(w 3 w 1,w 2 ) p(w 4 w 1,w 2,w 3 )...

5 History-based LMs A common strategy is to make a Markov assumption, which is a conditional independence assumption. p(w) =p(w 1 ) p(w 2 w 1 ) p(w 3 w 1,w 2 ) Markov: forget the distant past. Is this valid for language? No Is it practical? Often! p(w 4 w 1,w 2,w 3 )...

6 History-based LMs A common strategy is to make a Markov assumption, which is a conditional independence assumption. p(w) =p(w 1 ) p(w 2 w 1 ) p(w 3 w 1,w 2 ) Markov: forget the distant past. Is this valid for language? No Is it practical? Often! p(w 4 w 1,w 2,w 3 )... Why RNNs are great for language: no more Markov assumptions!

7 History-based LMs with RNNs RNN hidden state random variable vector, length= vocab p(w 5 w 1,w 2,w 3,w 4 ) z } { h 1 h 2 h 3 h 4 h 0 w 1 w 2 w 3 w 4 observed context word w 1 w 2 w 3 w 4 vector (word embedding)

8 History-based LMs with RNNs RNN hidden state random variable vector, length= vocab p(w 5 w 1,w 2,w 3,w 4 ) z } { h 1 h 2 h 3 h 4 h 0 w 1 w 2 w 3 w 4 observed context word w 1 w 2 w 3 w 4 vector (word embedding)

9 Bridle. (1990) Probabilistic interpretation of feedforward classification Distributions over words Each dimension corresponds to a word in a closed vocabulary, V. h the a and cat dog horse runs says walked walks walking pig Lisbon sardines u = Wh + b p i = exp u i P j exp u j The pi s form a distribution, i.e. p i > 0 8i, X i p i =1

10 Distributions over words h the a and cat dog horse runs says walked walks walking pig Lisbon sardines u = Wh + b p i = exp u i P j exp u j p(w) =p(w 1 ) p(w 2 w 1 ) p(w 3 w 1,w 2 ) p(w 4 w 1,w 2,w 3 )...

11 Distributions over words h the a and cat dog horse runs says walked walks walking pig Lisbon sardines u = Wh + b p i = exp u i P j exp u j p(w) =p(w 1 ) p(w 2 w 1 ) p(w 3 w 1,w 2 ) histories are sequences of words p(w 4 w 1,w 2,w 3 )...

12 Distributions over words h the a and cat dog horse runs says walked walks walking pig Lisbon sardines u = Wh + b p i = exp u i P j exp u j h 2 R d V = 100, 000 What are the dimensions of b? p(w) =p(w 1 ) p(w 2 w 1 ) p(w 3 w 1,w 2 ) histories are sequences of words p(w 4 w 1,w 2,w 3 )...

13 Distributions over words h the a and cat dog horse runs says walked walks walking pig Lisbon sardines u = Wh + b p i = exp u i P j exp u j h 2 R d V = 100, 000 What are the dimensions of W? p(w) =p(w 1 ) p(w 2 w 1 ) p(w 3 w 1,w 2 ) histories are sequences of words p(w 4 w 1,w 2,w 3 )...

14 RNN language models

15 RNN language models h 1 h 0 x 1 <s>

16 RNN language models ˆp 1 h 1 h 0 x 1 <s>

17 RNN language models p(tom hsi) tom ˆp 1 h 1 h 0 x 1 <s>

18 RNN language models p(tom hsi) tom ˆp 1 h 1 h 0 x 1 x 2 <s>

19 RNN language models p(tom hsi) tom ˆp 1 h 1 h 2 h 0 x 1 x 2 <s>

20 RNN language models p(tom hsi) p(likes hsi, tom) tom likes ˆp 1 h 1 h 2 h 0 x 1 x 2 <s>

21 RNN language models p(tom hsi) p(likes hsi, tom) p(beer hsi, tom, likes) tom likes beer ˆp 1 h 1 h 2 h 3 h 0 x 1 x 2 x 3 <s>

22 RNN language models p(tom hsi) p(likes hsi, tom) p(beer hsi, tom, likes) p(h/si hsi, tom, likes, beer) tom likes beer </s> ˆp 1 h 1 h 2 h 3 h 4 h 0 x 1 x 2 x 3 x 4 <s>

23 Training RNN language models tom likes beer </s> ˆp 1 h 1 h 2 h 3 h 4 h 0 x 1 x 2 x 3 x 4 <s>

24 Training RNN language models tom likes beer </s> cost 1 cost 2 cost 3 cost 4 ˆp 1 h 1 h 2 h 3 h 4 h 0 x 1 x 2 x 3 x 4 <s>

25 Training RNN language models log loss/ cross entropy tom likes beer </s> {cost 1 cost 2 cost 3 cost 4 ˆp 1 h 1 h 2 h 3 h 4 h 0 x 1 x 2 x 3 x 4 <s>

26 Training RNN language models F log loss/ cross entropy { tom likes beer </s> cost 1 cost 2 cost 3 cost 4 ˆp 1 h 1 h 2 h 3 h 4 h 0 x 1 x 2 x 3 x 4 <s>

27 Training RNN language models F log loss/ cross entropy { tom likes beer </s> cost 1 cost 2 cost 3 cost 4 ˆp 1 h 1 h 2 h 3 h 4 h 0 x 1 x 2 x 3 x 4 <s>

28 Training RNN language models The cross-entropy objective seeks the maximum likelihood (MLE) objective. Find the parameters that make the training data most likely.

29 Training RNN language models The cross-entropy objective seeks the maximum likelihood (MLE) objective. Find the parameters that make the training data most likely. You will overfit. 1. Stop training early, based on a validation set 2. Weight decay / other regularizers 3. Dropout during training. In contrast to count-based models, zeroes aren t a problem.

30 RNN language models Unlike Markov (n-gram) models, RNNs never forget However, they don t always remember so well (recall Felix s lectures on RNNs vs. LSTMs) Algorithms Sample a sequence from the probability distribution defined by the RNN Train the RNN to minimize cross entropy (aka MLE) What about: what is the most probable sequence?

31 How well do RNN LMs do? perplexity Word Error Rate (WER) order=5 Markov Kneser-Ney freq. est RNN 400 hidden xRNN interpolation Mikolov et al. (2010 Interspeech) Recurrent neural network based language model

32 How well do RNN LMs do? perplexity Word Error Rate (WER) order=5 Markov Kneser-Ney freq. est RNN 400 hidden xRNN interpolation Mikolov et al. (2010 Interspeech) Recurrent neural network based language model

33 Conditional LMs A conditional language model assigns probabilities to sequences of words, w =(w 1,w 2,...,w`), given some conditioning context, x. As with unconditional models, it is again helpful to use the chain rule to decompose this probability: Ỳ p(w x) = p(w t x,w 1,w 2,...,w t 1 ) t=1 What is the probability of the next word, given the history of previously generated words and conditioning context x?

34 Conditional LMs x input text output An author A document written by that author A topic label An article about that topic {SPAM, NOT_SPAM} A sentence in French A sentence in English A sentence in English An image A document A document Meterological measurements Acoustic signal Conversational history + database A question + a document A question + an image w An Its English translation Its French translation Its Chinese translation A text description of the image Its summary Its translation A weather report Transcription of speech Dialogue system response Its answer Its answer

35 Conditional LMs x input text output An author A document written by that author A topic label An article about that topic {SPAM, NOT_SPAM} A sentence in French A sentence in English A sentence in English An image A document A document Meterological measurements Acoustic signal Conversational history + database A question + a document A question + an image w An Its English translation Its French translation Its Chinese translation A text description of the image Its summary Its translation A weather report Transcription of speech Dialogue system response Its answer Its answer

36 Conditional LMs x input text output An author A document written by that author A topic label An article about that topic {SPAM, NOT_SPAM} A sentence in French A sentence in English A sentence in English An image A document A document Meterological measurements Acoustic signal Conversational history + database A question + a document A question + an image w An Its English translation Its French translation Its Chinese translation A text description of the image Its summary Its translation A weather report Transcription of speech Dialogue system response Its answer Its answer

37 Data for training conditional LMs To train conditional language models, we need paired samples, {(x i, w i )} N i=1. Data availability varies. It s easy to think of tasks that could be solved by conditional language models, but the data just doesn t exist. Relatively large amounts of data for: Translation, summarisation, caption generation, speech recognition

38 Evaluating conditional LMs How good is our conditional language model? These are language models, we can use cross-entropy or perplexity. okay to implement, hard to interpret Task-specific evaluation. Compare the model s most likely output to human-generated expected output using a task-specific evaluation metric L. w = arg max w p(w x) L(w, w ref ) Examples of L: BLEU, METEOR, WER, ROUGE. easy to implement, okay to interpret Human evaluation. hard to implement, easy to interpret

39 Evaluating conditional LMs How good is our conditional language model? These are language models, we can use cross-entropy or perplexity. okay to implement, hard to interpret Task-specific evaluation. Compare the model s most likely output to human-generated expected output using a task-specific evaluation metric L. w = arg max w p(w x) L(w, w ref ) Examples of L: BLEU, METEOR, WER, ROUGE. easy to implement, okay to interpret Human evaluation. hard to implement, easy to interpret

40 Lecture overview The rest of this lecture will look at encoder-decoder models that learn a function that maps into a fixed-size vector and then uses a language model to decode that vector into a sequence of words, w. x x Kunst kann nicht gelehrt werden encoder representation decoder w Artistry can t be taught

41 Lecture overview The rest of this lecture will look at encoder-decoder models that learn a function that maps into a fixed-size vector and then uses a language model to decode that vector into a sequence of words, w. x x encoder representation decoder w A dog is playing on the beach.

42 Lecture overview Two questions How do we encode x as a fixed-size vector, c? - Problem (or at least modality) specific - Think about assumptions How do we condition on c in the decoding model? - Less problem specific - We will review one standard solution: RNNs

43 Kalchbrenner and Blunsom 2013 Encoder c =embed(x) s = Vc

44 Kalchbrenner and Blunsom 2013 Encoder c =embed(x) s = Vc Recurrent decoder Recurrent connection Embedding of w t 1 h t = g(w[h t 1 ; w t 1 ] + s + b]) u t = Ph t + b 0 p(w t x, w <t ) = (u t ) Learnt bias Source sentence

45 Kalchbrenner and Blunsom 2013 Encoder c =embed(x) s = Vc Recurrent decoder Recurrent connection Embedding of w t 1 h t = g(w[h t 1 ; w t 1 ] + s + b]) u t = Ph t + b 0 p(w t x, w <t ) = (u t ) Learnt bias Source sentence Recall unconditional RNN h t = g(w[h t 1 ; w t 1 ]+b])

46 K&B 2013: Encoder How should we define c =embed(x)? The simplest model possible: c = X i x i x 1 x 2 x 3 x 4 x 5 x 6 x 1 x 2 x 3 x 4 x 5 x 6 What do you think of this model?

47 K&B 2013: RNN Decoder Encoder c =embed(x) s = Vc Recurrent decoder Recurrent connection Embedding of w t 1 h t = g(w[h t 1 ; w t 1 ] + s + b]) u t = Ph t + b 0 p(w t x, w <t ) = (u t ) Learnt bias Source sentence Recall unconditional RNN h t = g(w[h t 1 ; w t 1 ]+b])

48 K&B 2013: RNN Decoder s h 1 h 0 x 1 <s>

49 K&B 2013: RNN Decoder s ˆp 1 h 1 h 0 x 1 <s>

50 K&B 2013: RNN Decoder p(tom s, hsi) s tom ˆp 1 h 1 h 0 x 1 <s>

51 K&B 2013: RNN Decoder p(tom s, hsi) p(likes s, hsi, tom) s tom likes ˆp 1 h 1 h 2 h 0 x 1 x 2 <s>

52 K&B 2013: RNN Decoder p(tom s, hsi) p(likes s, hsi, tom) p(beer s, hsi, tom, likes) s tom likes beer ˆp 1 h 1 h 2 h 3 h 0 x 1 x 2 x 3 <s>

53 K&B 2013: RNN Decoder p(tom s, hsi) s p(likes s, hsi, tom) p(beer s, hsi, tom, likes) p(h\si s, hsi, tom, likes, beer) tom likes beer </s> ˆp 1 h 1 h 2 h 3 h 4 h 0 x 1 x 2 x 3 x 4 <s>

54 A word about decoding In general, we want to find the most probable (MAP) output given the input, i.e. w = arg max w = arg max w p(w x) w X t=1 log p(w t x, w <t )

55 A word about decoding In general, we want to find the most probable (MAP) output given the input, i.e. w = arg max w = arg max w p(w x) w X t=1 undecidable :( log p(w t x, w <t ) This is, for general RNNs, a hard problem. We therefore approximate it with a greedy search: w 1 arg max w 1 p(w 1 x) w 2 arg max w 2 p(w 2 x,w 1).. w t arg max w t p(w t x, w <t)

56 A word about decoding A slightly better approximation is to use a beam search with beam size b. Key idea: keep track of top b hypothesis. E.g., for b=2: x = Bier trinke ich beer drink I hsi logprob=0 w 0 w 1 w 2 w 3

57 A word about decoding A slightly better approximation is to use a beam search with beam size b. Key idea: keep track of top b hypothesis. E.g., for b=2: x = Bier trinke ich beer drink I hsi logprob=0 beer logprob=-1.82 I logprob=-2.11 w 0 w 1 w 2 w 3

58 A word about decoding A slightly better approximation is to use a beam search with beam size b. Key idea: keep track of top b hypothesis. E.g., for b=2: x = Bier trinke ich beer drink I drink logprob=-6.93 hsi logprob=0 beer logprob=-1.82 I logprob=-2.11 I logprob=-5.80 w 0 w 1 w 2 w 3

59 A word about decoding A slightly better approximation is to use a beam search with beam size b. Key idea: keep track of top b hypothesis. E.g., for b=2: x = Bier trinke ich beer drink I drink logprob=-6.93 hsi logprob=0 beer logprob=-1.82 I logprob=-2.11 I logprob=-5.80 beer logprob=-8.66 drink logprob=-2.87 w 0 w 1 w 2 w 3

60 A word about decoding A slightly better approximation is to use a beam search with beam size b. Key idea: keep track of top b hypothesis. E.g., for b=2: x = Bier trinke ich beer drink I drink logprob=-6.93 hsi logprob=0 beer logprob=-1.82 I logprob=-2.11 I logprob=-5.80 beer logprob=-8.66 drink logprob=-2.87 w 0 w 1 w 2 w 3

61 A word about decoding A slightly better approximation is to use a beam search with beam size b. Key idea: keep track of top b hypothesis. E.g., for b=2: x = Bier trinke ich beer drink I drink logprob=-6.93 drink logprob=-6.28 hsi logprob=0 beer logprob=-1.82 I logprob=-2.11 I logprob=-5.80 beer logprob=-8.66 like logprob=-7.31 beer logprob=-3.04 drink logprob=-2.87 wine logprob=-5.12 w 0 w 1 w 2 w 3

62 A word about decoding A slightly better approximation is to use a beam search with beam size b. Key idea: keep track of top b hypothesis. E.g., for b=2: x = Bier trinke ich beer drink I drink logprob=-6.93 drink logprob=-6.28 hsi logprob=0 beer logprob=-1.82 I logprob=-2.11 I logprob=-5.80 beer logprob=-8.66 like logprob=-7.31 beer logprob=-3.04 drink logprob=-2.87 wine logprob=-5.12 w 0 w 1 w 2 w 3

63 A word about decoding A slightly better approximation is to use a beam search with beam size b. Key idea: keep track of top b hypothesis. E.g., for b=2: x = Bier trinke ich beer drink I drink logprob=-6.93 drink logprob=-6.28 hsi logprob=0 beer logprob=-1.82 I logprob=-2.11 I logprob=-5.80 beer logprob=-8.66 like logprob=-7.31 beer logprob=-3.04 drink logprob=-2.87 wine logprob=-5.12 w 0 w 1 w 2 w 3

64 How well does this model do? perplexity (2011) perplexity (2012) order=5 Markov Kneser-Ney freq. est RNN LM RNN LM + x

65 How well does this model do?

66 How well does this model do?

67 How well does this model do?

68 How well does this model do?

69 How well does this model do?

70 How well does this model do?

71 How well does this model do? (Literal: I will feel bad if you do not find a solution.)

72 How well does this model do? (Literal: I will feel bad if you do not find a solution.)

73 Summary Conditional language modeling provides a convenient formulation for a lot of practical applications Two big problems: Model expressivity Decoding difficulties Next time A better encoder for vector to sequence models Attention for better learning Lots of results on machine translation

74 Questions?

Better Conditional Language Modeling. Chris Dyer

Better Conditional Language Modeling. Chris Dyer Better Conditional Language Modeling Chris Dyer Conditional LMs A conditional language model assigns probabilities to sequences of words, w =(w 1,w 2,...,w`), given some conditioning context, x. As with

More information

Sequence Modeling with Neural Networks

Sequence Modeling with Neural Networks Sequence Modeling with Neural Networks Harini Suresh y 0 y 1 y 2 s 0 s 1 s 2... x 0 x 1 x 2 hat is a sequence? This morning I took the dog for a walk. sentence medical signals speech waveform Successes

More information

CSC321 Lecture 7 Neural language models

CSC321 Lecture 7 Neural language models CSC321 Lecture 7 Neural language models Roger Grosse and Nitish Srivastava February 1, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 1 / 19 Overview We

More information

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018 Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Info 159/259 Lecture 7: Language models 2 (Sept 14, 2017) David Bamman, UC Berkeley Language Model Vocabulary V is a finite set of discrete symbols (e.g., words, characters);

More information

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP Recap: Language models Foundations of atural Language Processing Lecture 4 Language Models: Evaluation and Smoothing Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipp

More information

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Segmental Recurrent Neural Networks for End-to-end Speech Recognition Segmental Recurrent Neural Networks for End-to-end Speech Recognition Liang Lu, Lingpeng Kong, Chris Dyer, Noah Smith and Steve Renals TTI-Chicago, UoE, CMU and UW 9 September 2016 Background A new wave

More information

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipop Koehn) 30 January

More information

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model (most slides from Sharon Goldwater; some adapted from Philipp Koehn) 5 October 2016 Nathan Schneider

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Neural Networks Language Models

Neural Networks Language Models Neural Networks Language Models Philipp Koehn 10 October 2017 N-Gram Backoff Language Model 1 Previously, we approximated... by applying the chain rule p(w ) = p(w 1, w 2,..., w n ) p(w ) = i p(w i w 1,...,

More information

N-gram Language Modeling

N-gram Language Modeling N-gram Language Modeling Outline: Statistical Language Model (LM) Intro General N-gram models Basic (non-parametric) n-grams Class LMs Mixtures Part I: Statistical Language Model (LM) Intro What is a statistical

More information

Learning to translate with neural networks. Michael Auli

Learning to translate with neural networks. Michael Auli Learning to translate with neural networks Michael Auli 1 Neural networks for text processing Similar words near each other France Spain dog cat Neural networks for text processing Similar words near each

More information

ANLP Lecture 6 N-gram models and smoothing

ANLP Lecture 6 N-gram models and smoothing ANLP Lecture 6 N-gram models and smoothing Sharon Goldwater (some slides from Philipp Koehn) 27 September 2018 Sharon Goldwater ANLP Lecture 6 27 September 2018 Recap: N-gram models We can model sentence

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Info 59/259 Lecture 4: Text classification 3 (Sept 5, 207) David Bamman, UC Berkeley . https://www.forbes.com/sites/kevinmurnane/206/04/0/what-is-deep-learning-and-how-is-it-useful

More information

CSC321 Lecture 15: Recurrent Neural Networks

CSC321 Lecture 15: Recurrent Neural Networks CSC321 Lecture 15: Recurrent Neural Networks Roger Grosse Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 1 / 26 Overview Sometimes we re interested in predicting sequences Speech-to-text and

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing N-grams and language models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 25 Introduction Goals: Estimate the probability that a

More information

The Noisy Channel Model and Markov Models

The Noisy Channel Model and Markov Models 1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle

More information

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3) Chapter 5 Language Modeling 5.1 Introduction A language model is simply a model of what strings (of words) are more or less likely to be generated by a speaker of English (or some other language). More

More information

Conditional Language modeling with attention

Conditional Language modeling with attention Conditional Language modeling with attention 2017.08.25 Oxford Deep NLP 조수현 Review Conditional language model: assign probabilities to sequence of words given some conditioning context x What is the probability

More information

CS230: Lecture 10 Sequence models II

CS230: Lecture 10 Sequence models II CS23: Lecture 1 Sequence models II Today s outline We will learn how to: - Automatically score an NLP model I. BLEU score - Improve Machine II. Beam Search Translation results with Beam search III. Speech

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Language Models Tobias Scheffer Stochastic Language Models A stochastic language model is a probability distribution over words.

More information

The Naïve Bayes Classifier. Machine Learning Fall 2017

The Naïve Bayes Classifier. Machine Learning Fall 2017 The Naïve Bayes Classifier Machine Learning Fall 2017 1 Today s lecture The naïve Bayes Classifier Learning the naïve Bayes Classifier Practical concerns 2 Today s lecture The naïve Bayes Classifier Learning

More information

N-gram Language Modeling Tutorial

N-gram Language Modeling Tutorial N-gram Language Modeling Tutorial Dustin Hillard and Sarah Petersen Lecture notes courtesy of Prof. Mari Ostendorf Outline: Statistical Language Model (LM) Basics n-gram models Class LMs Cache LMs Mixtures

More information

Naïve Bayes, Maxent and Neural Models

Naïve Bayes, Maxent and Neural Models Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words

More information

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves Recurrent Neural Networks Deep Learning Lecture 5 Efstratios Gavves Sequential Data So far, all tasks assumed stationary data Neither all data, nor all tasks are stationary though Sequential Data: Text

More information

Chapter 3: Basics of Language Modelling

Chapter 3: Basics of Language Modelling Chapter 3: Basics of Language Modelling Motivation Language Models are used in Speech Recognition Machine Translation Natural Language Generation Query completion For research and development: need a simple

More information

] Automatic Speech Recognition (CS753)

] Automatic Speech Recognition (CS753) ] Automatic Speech Recognition (CS753) Lecture 17: Discriminative Training for HMMs Instructor: Preethi Jyothi Sep 28, 2017 Discriminative Training Recall: MLE for HMMs Maximum likelihood estimation (MLE)

More information

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1 Language Modelling Steve Renals Automatic Speech Recognition ASR Lecture 11 6 March 2017 ASR Lecture 11 Language Modelling 1 HMM Speech Recognition Recorded Speech Decoded Text (Transcription) Acoustic

More information

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers

More information

CSC321 Lecture 10 Training RNNs

CSC321 Lecture 10 Training RNNs CSC321 Lecture 10 Training RNNs Roger Grosse and Nitish Srivastava February 23, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 1 / 18 Overview Last time, we saw

More information

with Local Dependencies

with Local Dependencies CS11-747 Neural Networks for NLP Structured Prediction with Local Dependencies Xuezhe Ma (Max) Site https://phontron.com/class/nn4nlp2017/ An Example Structured Prediction Problem: Sequence Labeling Sequence

More information

Topics in Natural Language Processing

Topics in Natural Language Processing Topics in Natural Language Processing Shay Cohen Institute for Language, Cognition and Computation University of Edinburgh Lecture 9 Administrativia Next class will be a summary Please email me questions

More information

Neural Architectures for Image, Language, and Speech Processing

Neural Architectures for Image, Language, and Speech Processing Neural Architectures for Image, Language, and Speech Processing Karl Stratos June 26, 2018 1 / 31 Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

NEURAL LANGUAGE MODELS

NEURAL LANGUAGE MODELS COMP90042 LECTURE 14 NEURAL LANGUAGE MODELS LANGUAGE MODELS Assign a probability to a sequence of words Framed as sliding a window over the sentence, predicting each word from finite context to left E.g.,

More information

Statistical Machine Translation

Statistical Machine Translation Statistical Machine Translation Marcello Federico FBK-irst Trento, Italy Galileo Galilei PhD School University of Pisa Pisa, 7-19 May 2008 Part V: Language Modeling 1 Comparing ASR and statistical MT N-gram

More information

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Language Modeling III Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Office hours on website but no OH for Taylor until next week. Efficient Hashing Closed address

More information

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs Natural Language Understanding Lecture 12: Recurrent Neural Networks and LSTMs Recap: probability, language models, and feedforward networks Simple Recurrent Networks Adam Lopez Credits: Mirella Lapata

More information

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the

More information

Improved Learning through Augmenting the Loss

Improved Learning through Augmenting the Loss Improved Learning through Augmenting the Loss Hakan Inan inanh@stanford.edu Khashayar Khosravi khosravi@stanford.edu Abstract We present two improvements to the well-known Recurrent Neural Network Language

More information

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24 L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,

More information

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky Language Modeling Introduction to N-grams Many Slides are adapted from slides by Dan Jurafsky Probabilistic Language Models Today s goal: assign a probability to a sentence Why? Machine Translation: P(high

More information

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26 Machine Translation 10: Advanced Neural Machine Translation Architectures Rico Sennrich University of Edinburgh R. Sennrich MT 2018 10 1 / 26 Today s Lecture so far today we discussed RNNs as encoder and

More information

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology Basic Text Analysis Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakimnivre@lingfiluuse Basic Text Analysis 1(33) Hidden Markov Models Markov models are

More information

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9 CSCI 5832 Natural Language Processing Jim Martin Lecture 9 1 Today 2/19 Review HMMs for POS tagging Entropy intuition Statistical Sequence classifiers HMMs MaxEnt MEMMs 2 Statistical Sequence Classification

More information

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Intro to NLP, CS585, Fall 2014 http://people.cs.umass.edu/~brenocon/inlp2014/ Brendan O Connor (http://brenocon.com) 1 Models for

More information

Long-Short Term Memory and Other Gated RNNs

Long-Short Term Memory and Other Gated RNNs Long-Short Term Memory and Other Gated RNNs Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Sequence Modeling

More information

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky Language Modeling Introduction to N-grams Many Slides are adapted from slides by Dan Jurafsky Probabilistic Language Models Today s goal: assign a probability to a sentence Why? Machine Translation: P(high

More information

Lecture 15: Recurrent Neural Nets

Lecture 15: Recurrent Neural Nets Lecture 15: Recurrent Neural Nets Roger Grosse 1 Introduction Most of the prediction tasks we ve looked at have involved pretty simple kinds of outputs, such as real values or discrete categories. But

More information

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Language Models, Graphical Models Sameer Maskey Week 13, April 13, 2010 Some slides provided by Stanley Chen and from Bishop Book Resources 1 Announcements Final Project Due,

More information

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc) NLP: N-Grams Dan Garrette dhg@cs.utexas.edu December 27, 2013 1 Language Modeling Tasks Language idenfication / Authorship identification Machine Translation Speech recognition Optical character recognition

More information

Language Model. Introduction to N-grams

Language Model. Introduction to N-grams Language Model Introduction to N-grams Probabilistic Language Model Goal: assign a probability to a sentence Application: Machine Translation P(high winds tonight) > P(large winds tonight) Spelling Correction

More information

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016 Classification: Naïve Bayes Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016 1 Sentiment Analysis Recall the task: Filled with horrific dialogue, laughable characters,

More information

Deep Learning for NLP

Deep Learning for NLP Deep Learning for NLP Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Greg Durrett Outline Motivation for neural networks Feedforward neural networks Applying feedforward neural networks

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning and Richard Soccer Plan Problem definition Trigram models Evaluation Estimation Interpolation Discounting

More information

Neural Networks in Structured Prediction. November 17, 2015

Neural Networks in Structured Prediction. November 17, 2015 Neural Networks in Structured Prediction November 17, 2015 HWs and Paper Last homework is going to be posted soon Neural net NER tagging model This is a new structured model Paper - Thursday after Thanksgiving

More information

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs (based on slides by Sharon Goldwater and Philipp Koehn) 21 February 2018 Nathan Schneider ENLP Lecture 11 21

More information

Forward algorithm vs. particle filtering

Forward algorithm vs. particle filtering Particle Filtering ØSometimes X is too big to use exact inference X may be too big to even store B(X) E.g. X is continuous X 2 may be too big to do updates ØSolution: approximate inference Track samples

More information

Lecture 15: Exploding and Vanishing Gradients

Lecture 15: Exploding and Vanishing Gradients Lecture 15: Exploding and Vanishing Gradients Roger Grosse 1 Introduction Last lecture, we introduced RNNs and saw how to derive the gradients using backprop through time. In principle, this lets us train

More information

Seq2Seq Losses (CTC)

Seq2Seq Losses (CTC) Seq2Seq Losses (CTC) Jerry Ding & Ryan Brigden 11-785 Recitation 6 February 23, 2018 Outline Tasks suited for recurrent networks Losses when the output is a sequence Kinds of errors Losses to use CTC Loss

More information

Midterm sample questions

Midterm sample questions Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts

More information

Introduction to RNNs!

Introduction to RNNs! Introduction to RNNs Arun Mallya Best viewed with Computer Modern fonts installed Outline Why Recurrent Neural Networks (RNNs)? The Vanilla RNN unit The RNN forward pass Backpropagation refresher The RNN

More information

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35 Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How

More information

Language Modeling. Michael Collins, Columbia University

Language Modeling. Michael Collins, Columbia University Language Modeling Michael Collins, Columbia University Overview The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques: Linear interpolation Discounting

More information

Deep Learning Recurrent Networks 2/28/2018

Deep Learning Recurrent Networks 2/28/2018 Deep Learning Recurrent Networks /8/8 Recap: Recurrent networks can be incredibly effective Story so far Y(t+) Stock vector X(t) X(t+) X(t+) X(t+) X(t+) X(t+5) X(t+) X(t+7) Iterated structures are good

More information

Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, & Finale

Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, & Finale Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, & Finale Noah Smith c 2017 University of Washington nasmith@cs.washington.edu May 22, 2017 1 / 30 To-Do List Online

More information

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, et al. Google arxiv:1609.08144v2 Reviewed by : Bill

More information

CSC321 Lecture 16: ResNets and Attention

CSC321 Lecture 16: ResNets and Attention CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24 Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the

More information

CS 188: Artificial Intelligence Fall 2011

CS 188: Artificial Intelligence Fall 2011 CS 188: Artificial Intelligence Fall 2011 Lecture 20: HMMs / Speech / ML 11/8/2011 Dan Klein UC Berkeley Today HMMs Demo bonanza! Most likely explanation queries Speech recognition A massive HMM! Details

More information

Recurrent Neural Network

Recurrent Neural Network Recurrent Neural Network Xiaogang Wang xgwang@ee..edu.hk March 2, 2017 Xiaogang Wang (linux) Recurrent Neural Network March 2, 2017 1 / 48 Outline 1 Recurrent neural networks Recurrent neural networks

More information

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Cross-Lingual Language Modeling for Automatic Speech Recogntion GBO Presentation Cross-Lingual Language Modeling for Automatic Speech Recogntion November 14, 2003 Woosung Kim woosung@cs.jhu.edu Center for Language and Speech Processing Dept. of Computer Science The

More information

CSC321 Lecture 9 Recurrent neural nets

CSC321 Lecture 9 Recurrent neural nets CSC321 Lecture 9 Recurrent neural nets Roger Grosse and Nitish Srivastava February 3, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 9 Recurrent neural nets February 3, 2015 1 / 20 Overview You

More information

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang Deep Learning Basics Lecture 10: Neural Language Models Princeton University COS 495 Instructor: Yingyu Liang Natural language Processing (NLP) The processing of the human languages by computers One of

More information

Recurrent Neural Networks. Jian Tang

Recurrent Neural Networks. Jian Tang Recurrent Neural Networks Jian Tang tangjianpku@gmail.com 1 RNN: Recurrent neural networks Neural networks for sequence modeling Summarize a sequence with fix-sized vector through recursively updating

More information

1. Markov models. 1.1 Markov-chain

1. Markov models. 1.1 Markov-chain 1. Markov models 1.1 Markov-chain Let X be a random variable X = (X 1,..., X t ) taking values in some set S = {s 1,..., s N }. The sequence is Markov chain if it has the following properties: 1. Limited

More information

Language Modeling. Hung-yi Lee 李宏毅

Language Modeling. Hung-yi Lee 李宏毅 Language Modeling Hung-yi Lee 李宏毅 Language modeling Language model: Estimated the probability of word sequence Word sequence: w 1, w 2, w 3,., w n P(w 1, w 2, w 3,., w n ) Application: speech recognition

More information

Log-Linear Models, MEMMs, and CRFs

Log-Linear Models, MEMMs, and CRFs Log-Linear Models, MEMMs, and CRFs Michael Collins 1 Notation Throughout this note I ll use underline to denote vectors. For example, w R d will be a vector with components w 1, w 2,... w d. We use expx

More information

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017 Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017 Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion

More information

Structured Neural Networks (I)

Structured Neural Networks (I) Structured Neural Networks (I) CS 690N, Spring 208 Advanced Natural Language Processing http://peoplecsumassedu/~brenocon/anlp208/ Brendan O Connor College of Information and Computer Sciences University

More information

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention CS23: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention Today s outline We will learn how to: I. Word Vector Representation i. Training - Generalize results with word vectors -

More information

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev CS4705 Probability Review and Naïve Bayes Slides from Dragomir Radev Classification using a Generative Approach Previously on NLP discriminative models P C D here is a line with all the social media posts

More information

Probabilistic Language Modeling

Probabilistic Language Modeling Predicting String Probabilities Probabilistic Language Modeling Which string is more likely? (Which string is more grammatical?) Grill doctoral candidates. Regina Barzilay EECS Department MIT November

More information

Recurrent and Recursive Networks

Recurrent and Recursive Networks Neural Networks with Applications to Vision and Language Recurrent and Recursive Networks Marco Kuhlmann Introduction Applications of sequence modelling Map unsegmented connected handwriting to strings.

More information

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017 Modelling Time Series with Neural Networks Volker Tresp Summer 2017 1 Modelling of Time Series The next figure shows a time series (DAX) Other interesting time-series: energy prize, energy consumption,

More information

Natural Language Processing and Recurrent Neural Networks

Natural Language Processing and Recurrent Neural Networks Natural Language Processing and Recurrent Neural Networks Pranay Tarafdar October 19 th, 2018 Outline Introduction to NLP Word2vec RNN GRU LSTM Demo What is NLP? Natural Language? : Huge amount of information

More information

CSC321 Lecture 15: Exploding and Vanishing Gradients

CSC321 Lecture 15: Exploding and Vanishing Gradients CSC321 Lecture 15: Exploding and Vanishing Gradients Roger Grosse Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 1 / 23 Overview Yesterday, we saw how to compute the gradient descent

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 26 February 2018 Recap: tagging POS tagging is a sequence labelling task.

More information

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN Language Models Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN Data Science: Jordan Boyd-Graber UMD Language Models 1 / 8 Language models Language models answer

More information

COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017

COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017 COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University SEQUENTIAL DATA So far, when thinking

More information

Data Mining & Machine Learning

Data Mining & Machine Learning Data Mining & Machine Learning CS57300 Purdue University April 10, 2018 1 Predicting Sequences 2 But first, a detour to Noise Contrastive Estimation 3 } Machine learning methods are much better at classifying

More information

Natural Language Processing. Statistical Inference: n-grams

Natural Language Processing. Statistical Inference: n-grams Natural Language Processing Statistical Inference: n-grams Updated 3/2009 Statistical Inference Statistical Inference consists of taking some data (generated in accordance with some unknown probability

More information

Language Modelling. Marcello Federico FBK-irst Trento, Italy. MT Marathon, Edinburgh, M. Federico SLM MT Marathon, Edinburgh, 2012

Language Modelling. Marcello Federico FBK-irst Trento, Italy. MT Marathon, Edinburgh, M. Federico SLM MT Marathon, Edinburgh, 2012 Language Modelling Marcello Federico FBK-irst Trento, Italy MT Marathon, Edinburgh, 2012 Outline 1 Role of LM in ASR and MT N-gram Language Models Evaluation of Language Models Smoothing Schemes Discounting

More information

Speech Translation: from Singlebest to N-Best to Lattice Translation. Spoken Language Communication Laboratories

Speech Translation: from Singlebest to N-Best to Lattice Translation. Spoken Language Communication Laboratories Speech Translation: from Singlebest to N-Best to Lattice Translation Ruiqiang ZHANG Genichiro KIKUI Spoken Language Communication Laboratories 2 Speech Translation Structure Single-best only ASR Single-best

More information

An overview of word2vec

An overview of word2vec An overview of word2vec Benjamin Wilson Berlin ML Meetup, July 8 2014 Benjamin Wilson word2vec Berlin ML Meetup 1 / 25 Outline 1 Introduction 2 Background & Significance 3 Architecture 4 CBOW word representations

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data Statistical Machine Learning from Data Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne (EPFL),

More information

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi) Natural Language Processing SoSe 2015 Language Modelling Dr. Mariana Neves April 20th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline 2 Motivation Estimation Evaluation Smoothing Outline 3 Motivation

More information