Chapter 3: Basics of Language Modelling

Similar documents
Chapter 3: Basics of Language Modeling

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Machine Learning for natural language processing

Cross-Lingual Language Modeling for Automatic Speech Recogntion

The Noisy Channel Model and Markov Models

Natural Language Processing SoSe Words and Language Model

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

Statistical Methods for NLP

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Natural Language Processing. Statistical Inference: n-grams

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

Statistical Machine Translation

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

DT2118 Speech and Speaker Recognition

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Probabilistic Language Modeling

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Language Models. Philipp Koehn. 11 September 2018

N-gram N-gram Language Model for Large-Vocabulary Continuous Speech Recognition

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Language Modeling. Michael Collins, Columbia University

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

N-gram Language Modeling

TnT Part of Speech Tagger

N-gram Language Modeling Tutorial

Conditional Language Modeling. Chris Dyer

Natural Language Processing

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

N-gram Language Model. Language Models. Outline. Language Model Evaluation. Given a text w = w 1...,w t,...,w w we can compute its probability by:

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Bayesian Learning. Examples. Conditional Probability. Two Roles for Bayesian Methods. Prior Probability and Random Variables. The Chain Rule P (B)

Natural Language Processing (CSE 490U): Language Models

Language Modelling. Marcello Federico FBK-irst Trento, Italy. MT Marathon, Edinburgh, M. Federico SLM MT Marathon, Edinburgh, 2012

A fast and simple algorithm for training neural probabilistic language models

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

Language Model. Introduction to N-grams

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

CMPT-825 Natural Language Processing

Advanced topics in language modeling: MaxEnt, features and marginal distritubion constraints

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

Language Technology. Unit 1: Sequence Models. CUNY Graduate Center Spring Lectures 5-6: Language Models and Smoothing. required hard optional

SYNTHER A NEW M-GRAM POS TAGGER

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Language Processing with Perl and Prolog

Language Models. Hongning Wang

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Learning to translate with neural networks. Michael Auli

CS 6120/CS4120: Natural Language Processing

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Sparse Models for Speech Recognition

Variational Decoding for Statistical Machine Translation

Exploring Asymmetric Clustering for Statistical Language Modeling

CS230: Lecture 10 Sequence models II

CSC321 Lecture 7 Neural language models

Sequences and Information

Introduction to N-grams

Kneser-Ney smoothing explained

Statistical Phrase-Based Speech Translation

Statistical methods in NLP, lecture 7 Tagging and parsing

Machine Learning for natural language processing

Topic Modeling: Beyond Bag-of-Words

Log-linear models (part 1)

Midterm sample questions

Lecture 5 Neural models for NLP

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

Ranked Retrieval (2)

Lecture 3: ASR: HMMs, Forward, Viterbi

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Hierarchical Bayesian Nonparametric Models of Language and Text

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Lecture 9: Hidden Markov Model

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Fun with weighted FSTs

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Hidden Markov Models. x 1 x 2 x 3 x N

Naïve Bayes, Maxent and Neural Models

Hidden Markov Modelling

Speech Translation: from Singlebest to N-Best to Lattice Translation. Spoken Language Communication Laboratories

Statistical methods for NLP Estimation

(today we are assuming sentence segmentation) Wednesday, September 10, 14

From perceptrons to word embeddings. Simon Šuster University of Groningen

What to Expect from Expected Kneser-Ney Smoothing

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

ANLP Lecture 6 N-gram models and smoothing

Topics in Natural Language Processing

Transcription:

Chapter 3: Basics of Language Modelling

Motivation Language Models are used in Speech Recognition Machine Translation Natural Language Generation Query completion For research and development: need a simple evaluation metric for fast turnaround times in experiments 2 2

Test your Language Model What s in your hometown newspaper??? 3 3

Test your Language Model What s in your hometown newspaper today 4 4

Test your Language Model It s raining cats and??? 5 5

Test your Language Model It s raining cats and dogs 6 6

Test your Language Model President Bill??? 7 7

Test your Language Model President Bill Gates 8 8

Definition Language Model A language model either: Predicts the next word w given a sequence of predecessor words h (history) P(w i hi ) P(w i w1,... wi 1) with i : position in text or Predicts the probability of a sequence of words P( w, w,... w,. w ) P( W 1 2 N 1 N ) 9 9

Language Model in Speech Recognition (ASR) Speech input Stream of feature vectors A Feature Extraction Search ^ W=argmax [P(A W) P(W)] [W] Acoustic Model P(A W) LanguageModel P(W) Recognised Word Sequence ^ W 10 10

Language Model in Machine Translation (MT) Source Sentence W F LexiconModel P(W F W E ) Search ^ W E =argmax [P(W F W E ) P(W E )] [W E ] LanguageModel P(W) E ) Recognised Word Sequence ^ W E 11 11

Sentence Completion Directly use of P(w h) Needs good search algorithms 12 12

The Markov Assumption: truncate the history Cutting history to M-1 words: Special cases: P(wi w1, w2,... wi 1) P(wi wi M 1,... wi 1) Zerogram (uniform distribution) P(wi w1, w2,... w i 1) 1 W Unigram P(w i w, w 1 2,... w i 1 ) P(w ) i Bigram Trigram P(wi w1, w2,... w i 1) P(wi wi 1) P(wi w1, w2,... wi 1) P(wi wi 2, wi 1). 13 13

How would you measure the quality of an LM? 14 14

Definition of Perplexity Let w 1, w 2, w N be an independent test corpus not used during training. Perplexity PP P( w -1/N 1, w2,... wn ) Interpretation: Normalized probability of test corpus 15 15

Example: Zerogram Language Model Zerogram P(wi w1, w2,... w i 1) (uniform distribution) 1 W PP W N i1 P( w 1 W 1, w 2 1/ N,... w N ) -1/N Interpretation: perplexity is the average de-facto size of vocabulary 16 16

Alternate Formulation Assumption: Use an M-gram language model History h i =w i-m+1 w i-1 Idea: collapse all identical histories 17 17

Alternate Formulation Example: M=2 Corpus: to be or not to be h 2 = to w 2 = be h 6 = to w 6 = be a calculate P( be to ) only once and scale it by 2 18 18

Alternate Formulation PP P( w N P( w i h i ) i1 exp log -1/N 1, w2,... wn ) N i1 1/ N P( w i h i ) 1/ N Use Bayes decomposition Try to get rid of product exp 1 log N N i1 P( w i h i ) 19 19

Alternate Formulation exp 1 log N N i1 P( w i h i ) exp N 1 logp( w i h i ) N 1 i exp 1 N( w, h)logp( w h) N, w h N(w,h): absolute frequency of sequence h,w on test corpus exp w, h f ( w, h)log P( w h) f(w,h) relative frequency of sequence h,w on test corpus 20 20

Alternate Formulation: Final Result PP P( w exp 1 w, h... w f N ) 1/ N ( w, h)log P( w h) Perplexity can also be calculated using conditional probabilities trained in the training corpus and relative frequencies from the test corpus. 21 21

Alternate Formulation: Zerogram Example Zerogram (uniform distribution) P(wi w1, w2,... w i 1) 1 W PP exp w, h f ( w, h)log 1 exp log f ( w, h) W w, h P( w h) exp w, h 1 f ( w, h)log W log W * W exp 1 Identical result 22 22

Perplexity and Error Rate Perplexity is correlated to word error rate Power law relation 23 23 Klakow + Peters in Speech Communication

Quality Measure: Mean Rank Definition: Let w follow h in the test text Sort all words after a given history according to p(w h) Determine position of correct word w Average over all events in the test text 24 24

Alternative: Average Rank Task: guess the next word Metric: Measure how many attempts it takes 25 25 Source: Witten, Bell: Text Compression

Quality Measure: Mean Rank Measure Perplexity 0.955 Mean Rank 0.957 Correlation with ASR error rate Mean rank rank equally good in predicting ASR performance 26 26

Summary Language models predict the next word Applications: Speech Recognition Machine Translation Natural Language Generation Information retrieval Sentence Completion Perplexity: Measures quality of language model 27 27