Chapter 3: Basics of Language Modeling

Similar documents
Chapter 3: Basics of Language Modelling

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

Machine Learning for natural language processing

Natural Language Processing SoSe Words and Language Model

Statistical Methods for NLP

Cross-Lingual Language Modeling for Automatic Speech Recogntion

The Noisy Channel Model and Markov Models

DT2118 Speech and Speaker Recognition

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Hidden Markov Model and Speech Recognition

Probabilistic Language Modeling

Language Models. Philipp Koehn. 11 September 2018

N-gram Language Modeling

N-gram N-gram Language Model for Large-Vocabulary Continuous Speech Recognition

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Natural Language Processing. Statistical Inference: n-grams

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

CMPT-825 Natural Language Processing

Language Modeling. Michael Collins, Columbia University

N-gram Language Modeling Tutorial

TnT Part of Speech Tagger

Statistical Machine Translation

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Natural Language Processing

Advanced topics in language modeling: MaxEnt, features and marginal distritubion constraints

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

N-gram Language Model. Language Models. Outline. Language Model Evaluation. Given a text w = w 1...,w t,...,w w we can compute its probability by:

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

Exploring Asymmetric Clustering for Statistical Language Modeling

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

Sparse Models for Speech Recognition

Language Technology. Unit 1: Sequence Models. CUNY Graduate Center Spring Lectures 5-6: Language Models and Smoothing. required hard optional

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Natural Language Processing

] Automatic Speech Recognition (CS753)

Conditional Language Modeling. Chris Dyer

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

A fast and simple algorithm for training neural probabilistic language models

Language Modelling. Marcello Federico FBK-irst Trento, Italy. MT Marathon, Edinburgh, M. Federico SLM MT Marathon, Edinburgh, 2012

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Language Processing with Perl and Prolog

Log-linear models (part 1)

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

CSC321 Lecture 7 Neural language models

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

Lecture 5 Neural models for NLP

Bayesian Learning. Examples. Conditional Probability. Two Roles for Bayesian Methods. Prior Probability and Random Variables. The Chain Rule P (B)

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

SYNTHER A NEW M-GRAM POS TAGGER

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Discriminative Training

From perceptrons to word embeddings. Simon Šuster University of Groningen

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Hidden Markov Modelling

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

CS 6120/CS4120: Natural Language Processing

Hidden Markov Models and Gaussian Mixture Models

CS230: Lecture 10 Sequence models II

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

Speech Translation: from Singlebest to N-Best to Lattice Translation. Spoken Language Communication Laboratories

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

What to Expect from Expected Kneser-Ney Smoothing

Language Model. Introduction to N-grams

Unsupervised Vocabulary Induction

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

Topic Modeling: Beyond Bag-of-Words

Introduction to N-grams

Graphical models for part of speech tagging

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Collapsed Variational Bayesian Inference for Hidden Markov Models

Introduction to Probablistic Natural Language Processing

(today we are assuming sentence segmentation) Wednesday, September 10, 14

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Multi-level Gaussian selection for accurate low-resource ASR systems

Natural Language Processing (CSE 490U): Language Models

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

Variational Decoding for Statistical Machine Translation

Statistical Natural Language Processing

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Department of Computer Science and Engineering Indian Institute of Technology, Kanpur. Spatial Role Labeling

Naïve Bayes, Maxent and Neural Models

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Machine Learning for natural language processing

Phrase Finding; Stream-and-Sort vs Request-and-Answer. William W. Cohen

PROBABILISTIC PASSWORD MODELING: PREDICTING PASSWORDS WITH MACHINE LEARNING

Lecture 9: Hidden Markov Model

Transcription:

Chapter 3: Basics of Language Modeling

Section 3.1. Language Modeling in Automatic Speech Recognition (ASR) All graphs in this section are from the book by Schukat-Talamazzini unless indicated otherwise http://www.minet.uni-jena.de/fakultaet/schukat/mypub/schukattalamazzini95:asg.pdf 2 2

Complexity of Speech Recognition Applications 3 3 From: Schukat-Talamazzini: Spracherkennung

What makes Speech Recognition difficult 4 4 From: Schukat-Talamazzini: Spracherkennung

Milestones in ASR 5 5 From: Schukat-Talamazzini: Spracherkennung

Introduction into ASR 6 6 Source: Steve Young: the HTK book

Basic Components of an ASR-System Speech input Stream of feature vectors A Feature Extraction Search h ^ W=argmax [P(A W) P(W)] [W] Acoustic Model P(A W) LanguageModel P(W) Recognised Word Sequence ^ W 7 7

The Markov Assumption Cutting history to M-1 words: Special cases: Trigram P(wi w1, w2,... wi 1) P(wi wi M 1,... wi 1) P(wi w1, w2,... wi 1) P(wi wi 2, wi 1) Bigram P(wi w1, w2,... w i 1) P(wi wi 1) Unigram P(w i w, w 1 2,... w i 1 ) P(w ) i Zerogram (uniform distribution) P(wi w1, w2,... w i 1) 1 W 8 8

Section 3.2. A Measure for LM Quality: Preplexity 9 9

Motivation Speech recognition can be computationally expensive For research and development: need a simple evaluation metric for fast turn-around times in experiments 10 10

Test your Language Model What s in your hometown newspaper??? 11 11

Test your Language Model What s in your hometown newspaper today 12 12

Test your Language Model It s raining cats and??? 13 13

Test your Language Model It s raining cats and dogs 14 14

Test your Language Model President Bill??? 15 15

Test your Language Model President Bill Gates 16 16

Definition of Perplexity Let w 1, w 2, w N be an independent test corpus not used during training. Perplexity PP P( w -1/N 1, w2,... wn ) Normalized probability of test corpus 17 17

Example: Zerogram Language Model Zerogram P(wi w1, w2,... w i 1) (uniform distribution) 1 W PP W N i1 P( w 1 W 1, w 2 1/ N,... w N ) -1/N Interpretation: perplexity is the average de-facto size of vocabulary 18 18

Alternate Formulation Assumption: Use an M-gram language model History h i =w i-m+1 w i-1 Idea: collapse all identical histories 19 19

Alternate Formulation Example: M=2 Corpus: to be or not to be h 2 = to w 2 = be h 6 = to w 6 = be a calculate P( be to ) only once and scale it by 2 20 20

Alternate Formulation PP P( w N P( w i h i ) i1 exp log -1/N 1, w2,... wn ) N i1 1/ N P( w i h i ) 1/ N Use Bayes decomposition Try to get rid of product exp 1 log N N i1 P( w i h i ) 21 21

Alternate Formulation exp 1 log N N i1 P( w i h i ) exp N 1 logp( w i h i ) N 1 i exp 1 N( w, h)logp( w h) N, w h N(w,h): absolute frequency of sequence h,w on test corpus exp w, h f ( w, h)log P( w h) f(w,h) relative frequency of sequence h,w on test corpus 22 22

Alternate Formulation: Final Result PP exp P( w... w 1 w, h f N ) 1/ N ( w, h)log P( w h) 23 23

Alternate Formulation: Zerogram Example Zerogram (uniform distribution) P(wi w1, w2,... w i 1) 1 W PP exp w, h f ( w, h)log 1 exp log f ( w, h) W w, h P( w h) exp w, h 1 f ( w, h)log W log W * W exp 1 Identical result 24 24

Perplexity and Error Rate Perplexity is correlated to word error rate Power law relation 25 25 Klakow + Peters in Speech Communication

Quality Measure: Mean Rank Definition: Let w follow h in the test text Sort all words after a given history according to p(w_i h) Determine position of correct word w Avarage over all events in the test text 26 26

Alternative: Average Rank 27 27 Source: Witten, Bell: Text Compression

Quality Measure: Mean Rank Measure Perplexity Mean Rank Correlation 0.955 0.957 28 28

Quality Measure: worst Score Which fraction of the test text have a score worse than S c Correlation: 0.919 29 29

Summary Speech recognition Basic task Language model as a component Perplexity: Measures quality of language model 30 30