DT2118 Speech and Speaker Recognition

Similar documents
Machine Learning for natural language processing

Natural Language Processing SoSe Words and Language Model

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

N-gram Language Modeling Tutorial

Language Modeling. Michael Collins, Columbia University

CMPT-825 Natural Language Processing

N-gram Language Modeling

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Natural Language Processing. Statistical Inference: n-grams

Probabilistic Language Modeling

Advanced Natural Language Processing Syntactic Parsing

Language Models. Philipp Koehn. 11 September 2018

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

LECTURER: BURCU CAN Spring

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Language Processing with Perl and Prolog

Probabilistic Context-free Grammars

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

Language Model. Introduction to N-grams

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Sequences and Information

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Chapter 3: Basics of Language Modelling

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Statistical Methods for NLP

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

Probabilistic Context-Free Grammar

CS460/626 : Natural Language

Probabilistic Linguistics

10/17/04. Today s Main Points

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Parsing with Context-Free Grammars

The Noisy Channel Model and Markov Models

Natural Language Processing (CSE 490U): Language Models

Chapter 3: Basics of Language Modeling

SYNTHER A NEW M-GRAM POS TAGGER

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

ANLP Lecture 6 N-gram models and smoothing

Introduction to Probablistic Natural Language Processing

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Probabilistic Context Free Grammars. Many slides from Michael Collins

IBM Model 1 for Machine Translation

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

Latent Variable Models in NLP

Kneser-Ney smoothing explained

Administrivia. Lecture 5. Part I. Outline. The Big Picture IBM IBM IBM IBM

Lecture 5. The Big Picture/Language Modeling. Bhuvana Ramabhadran, Michael Picheny, Stanley F. Chen

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

Parsing with CFGs L445 / L545 / B659. Dept. of Linguistics, Indiana University Spring Parsing with CFGs. Direction of processing

Parsing with CFGs. Direction of processing. Top-down. Bottom-up. Left-corner parsing. Chart parsing CYK. Earley 1 / 46.

Processing/Speech, NLP and the Web

CS 6120/CS4120: Natural Language Processing

Features of Statistical Parsers

Natural Language Processing

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

N-gram N-gram Language Model for Large-Vocabulary Continuous Speech Recognition

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

A DOP Model for LFG. Rens Bod and Ronald Kaplan. Kathrin Spreyer Data-Oriented Parsing, 14 June 2005

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

Expectation Maximization (EM)

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Cross-Lingual Language Modeling for Automatic Speech Recogntion

{Probabilistic Stochastic} Context-Free Grammars (PCFGs)

Statistical Natural Language Processing

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Lecture 4: Smoothing, Part-of-Speech Tagging. Ivan Titov Institute for Logic, Language and Computation Universiteit van Amsterdam

CSEP 517 Natural Language Processing Autumn 2013

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

To make a grammar probabilistic, we need to assign a probability to each context-free rewrite

On Using Selectional Restriction in Language Models for Speech Recognition

Spectral Unsupervised Parsing with Additive Tree Metrics

A* Search. 1 Dijkstra Shortest Path

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models.

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Hidden Markov Models in Language Processing

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

Ling/CMSC 773 Take-Home Midterm Spring 2008

Chapter 14 (Partially) Unsupervised Parsing

arxiv:cmp-lg/ v1 9 Jun 1997

Lecture 3: ASR: HMMs, Forward, Viterbi

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

Statistical Machine Translation

Statistical Methods for NLP

Language Models. Hongning Wang

Multiword Expression Identification with Tree Substitution Grammars

Hidden Markov Models

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Transcription:

DT2118 Speech and Speaker Recognition Language Modelling Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 56

Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language Models N-gram Smoothing Class N-grams Adaptive Language Models Language Model Evaluation 2 / 56

Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language Models N-gram Smoothing Class N-grams Adaptive Language Models Language Model Evaluation 3 / 56

Components of ASR System Representation Speech Signal Spectral Analysis Feature Extraction Constraints - Knowledge Acoustic Models Lexical Models Language Models Decoder Search and Match Recognised Words 4 / 56

Why do we need language models? Bayes rule: P(words sounds) = P(sounds words)p(words) P(sounds) where P(words): a priori probability of the words (Language Model) We could use non informative priors (P(words) = 1/N), but... 5 / 56

Branching Factor if we have N words in the dictionary at every word boundary we have to consider N equally likely alternatives N can be in the order of millions word1 word word2... wordn 6 / 56

Ambiguity ice cream vs I scream /ai s k ô i: m/ 7 / 56

Language Models in ASR We want to: 1. limit the branching factor in the recognition network 2. augment and complete the acoustic probabilities we are only interested to know if the sequence of words is plausible grammatically or not this kind of grammar is integrated in the recognition network prior to decoding 8 / 56

Language Models in Dialogue Systems we want to assign a class to each word (noun, verb, attribute... parts of speech) parsing is usually performed on the output of a speech recogniser The grammar is used twice in a Dialogue System!! 9 / 56

Language Models in ASR small vocabulary: often formal grammar specified by hand example: loop of digits as in the HTK exercise large vocabulary: often stochastic grammar estimated from data 10 / 56

Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language Models N-gram Smoothing Class N-grams Adaptive Language Models Language Model Evaluation 11 / 56

Formal Language Theory grammar: formal specification of permissible structures for the language parser: algorithm that can analyse a sentence and determine if its structure is compliant with the grammar 12 / 56

Chomsky s formal grammar Noam Chomsky: linguist, philosopher,... 13 / 56

Chomsky s formal grammar Noam Chomsky: linguist, philosopher,... where G = (V, T, P, S) V : set of non-terminal constituents T : set of terminals (lexical items) P: set of production rules S: start symbol 13 / 56

Example S = sentence V = {NP (noun phrase), NP1, VP (verb phrase), NAME, ADJ, V (verb), N (noun)} T = {Mary, person, loves, that,... } P = {S NP VP NP NAME NP ADJ NP1 NP1 N VP VERB NP NAME Mary V loves N person ADJ that } 14 / 56

Example S = sentence V = {NP (noun phrase), NP1, VP (verb phrase), NAME, ADJ, V (verb), N (noun)} T = {Mary, person, loves, that,... } NAME P = {S NP VP NP NAME NP ADJ NP1 Mary NP1 N VP VERB NP NAME Mary V loves N person ADJ that } NP loves S V that VP ADJ person NP N NP1 14 / 56

Chomsky s hierarchy Greek letters: sequence of terminals or non-terminals Upper-case Latin letters: single non-terminal Lower-case Latin letters: single terminal Types Constraints Automata Phrase structure α β. This is the most general Turing machine grammar grammar Context-sensitive grammar length of α length of β Linear bounded Context-free A β. Equivalent to A w, A Push down grammar BC Regular grammar A w, A wb Finite-state 15 / 56

Chomsky s hierarchy Greek letters: sequence of terminals or non-terminals Upper-case Latin letters: single non-terminal Lower-case Latin letters: single terminal Types Constraints Automata Phrase structure α β. This is the most general Turing machine grammar grammar Context-sensitive grammar length of α length of β Linear bounded Context-free A β. Equivalent to A w, A Push down grammar BC Regular grammar A w, A wb Finite-state Context-free and regular grammars are used in practice 15 / 56

Are languages context-free? Mostly true, with exceptions Swiss German:... das mer d chind em Hans es huus lönd häfte aastriiche Word-by-word:... that we the children Hans the house let help paint Translation:... that we let the children help Hans paint the house 16 / 56

Parsers assign each word in a sentence to a part of speech originally developed for programming languages (no ambiguities) only available for context-free and regular grammars top-down: start with S and generate rules until you reach the words (terminal symbols) bottom-up: start with the words and work your way up until you reach S 17 / 56

Example: Top-down parser Parts of speech S Rules 18 / 56

Example: Top-down parser Parts of speech S NP VP Rules S NP VP 18 / 56

Example: Top-down parser Parts of speech Rules S NP VP S NP VP NAME VP NP NAME 18 / 56

Example: Top-down parser Parts of speech Rules S NP VP S NP VP NAME VP NP NAME Mary VP NAME Mary 18 / 56

Example: Top-down parser Parts of speech Rules S NP VP S NP VP NAME VP NP NAME Mary VP NAME Mary Mary V NP VP V NP 18 / 56

Example: Top-down parser Parts of speech Rules S NP VP S NP VP NAME VP NP NAME Mary VP NAME Mary Mary V NP VP V NP Mary loves NP V loves 18 / 56

Example: Top-down parser Parts of speech Rules S NP VP S NP VP NAME VP NP NAME Mary VP NAME Mary Mary V NP VP V NP Mary loves NP V loves Mary loves ADJ NP1 NP ADJ NP1 18 / 56

Example: Top-down parser Parts of speech Rules S NP VP S NP VP NAME VP NP NAME Mary VP NAME Mary Mary V NP VP V NP Mary loves NP V loves Mary loves ADJ NP1 NP ADJ NP1 Mary loves that NP1 ADJ that 18 / 56

Example: Top-down parser Parts of speech Rules S NP VP S NP VP NAME VP NP NAME Mary VP NAME Mary Mary V NP VP V NP Mary loves NP V loves Mary loves ADJ NP1 NP ADJ NP1 Mary loves that NP1 ADJ that Mary loves that N NP1 N 18 / 56

Example: Top-down parser Parts of speech Rules S NP VP S NP VP NAME VP NP NAME Mary VP NAME Mary Mary V NP VP V NP Mary loves NP V loves Mary loves ADJ NP1 NP ADJ NP1 Mary loves that NP1 ADJ that Mary loves that N NP1 N Mary loves that person N person 18 / 56

Example: Bottom-up parser Parts of speech Mary loves that person Rules 19 / 56

Example: Bottom-up parser Parts of speech Mary loves that person NAME loves that person Rules NAME Mary 19 / 56

Example: Bottom-up parser Parts of speech Rules Mary loves that person NAME loves that person NAME Mary NAME V that person V loves 19 / 56

Example: Bottom-up parser Parts of speech Rules Mary loves that person NAME loves that person NAME Mary NAME V that person V loves NAME V ADJ person ADJ that 19 / 56

Example: Bottom-up parser Parts of speech Rules Mary loves that person NAME loves that person NAME Mary NAME V that person V loves NAME V ADJ person ADJ that NAME V ADJ N N person 19 / 56

Example: Bottom-up parser Parts of speech Rules Mary loves that person NAME loves that person NAME Mary NAME V that person V loves NAME V ADJ person ADJ that NAME V ADJ N N person NP V ADJ N NP NAME 19 / 56

Example: Bottom-up parser Parts of speech Rules Mary loves that person NAME loves that person NAME Mary NAME V that person V loves NAME V ADJ person ADJ that NAME V ADJ N N person NP V ADJ N NP NAME NP V ADJ NP1 NP1 N 19 / 56

Example: Bottom-up parser Parts of speech Rules Mary loves that person NAME loves that person NAME Mary NAME V that person V loves NAME V ADJ person ADJ that NAME V ADJ N N person NP V ADJ N NP NAME NP V ADJ NP1 NP1 N NP V NP NP ADJ NP1 19 / 56

Example: Bottom-up parser Parts of speech Rules Mary loves that person NAME loves that person NAME Mary NAME V that person V loves NAME V ADJ person ADJ that NAME V ADJ N N person NP V ADJ N NP NAME NP V ADJ NP1 NP1 N NP V NP NP ADJ NP1 NP VP VP V NP 19 / 56

Example: Bottom-up parser Parts of speech Rules Mary loves that person NAME loves that person NAME Mary NAME V that person V loves NAME V ADJ person ADJ that NAME V ADJ N N person NP V ADJ N NP NAME NP V ADJ NP1 NP1 N NP V NP NP ADJ NP1 NP VP VP V NP S S NP VP 19 / 56

Top-down vs bottom-up parsers Top-down characteristics: + very predictive + only consider grammatical combinations predict constituents that do not have a match in the text Bottom-up characteristics: + check input text only once + suitable for robust language processing may build trees that do not lead to full parse All in all, similar performance 20 / 56

Chart parsing (dynamic programming) Name[1,1] Mary Mary loves that person 21 / 56

Chart parsing (dynamic programming) Name Mary NP Name S NP VP V[2,2] loves Mary loves that person 21 / 56

Chart parsing (dynamic programming) Name Mary V loves NP Name VP V NP S NP VP ADJ that Mary loves that person 21 / 56

Chart parsing (dynamic programming) ADJ that Name Mary V loves NP ADJ NP1 NP Name VP V NP S NP VP S NP VP N person Mary loves that person 21 / 56

Chart parsing (dynamic programming) S NP VP VP V NP NP ADJ NP1 Name Mary NP Name S NP VP V VP loves V NP ADJ that NP ADJ NP1 S NP VP N person NP1 N Mary loves that person 21 / 56

Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language Models N-gram Smoothing Class N-grams Adaptive Language Models Language Model Evaluation 22 / 56

Stochastic Language Models (SLM) 1. formal grammars lack coverage (for general domains) 2. spoken language does not follow strictly the grammar Model sequences of words statistically: P(W ) = P(w 1 w 2... w n ) 23 / 56

Probabilistic Context-free grammars (PCFGs) Assign probabilities to generative rules: P(A α G) Then calculate probability of generating a word sequence w 1 w 2... w n as probability of the rules necessary to go from S to w 1 w 2... w n : P(S w 1 w 2... w n G) 24 / 56

Training PCFGs If annotated corpus, Maximum Likelihood estimate: P(A α j ) = C(A α j ) m i=1 C(A α i) If non-annotated corpus: inside-outside algorithm (similar to HMM training, forward-backward) 25 / 56

Independence assumption S NP VP NAME V NP Mary loves ADJ NP1 that N person 26 / 56

Inside-outside probabilities Chomsky s normal forms: A i A m A n or A i w l inside(s, A i, t) = P(A i w s w s+1... w t ) outside(s, A i, t) = P(S w 1... w s 1 A i w t+1... w T ) S A i w... w w... w w... w 1 s- 1 s t t + 1 T 27 / 56

Probabilistic Context-free grammars:limitations probabilities help sorting alternative explanations, but still problem with coverage: the production rules are hand made P(A α G) 28 / 56

N-gram Language Models Flat model: no hierarchical structure P(W) = P(w 1, w 2,..., w n ) = P(w 1 )P(w 2 w 1 )P(w 3 w 1, w 2 ) P(w n w 1, w 2..., w n 1 ) n = P(w i w 1, w 2,..., w i 1 ) i=1 29 / 56

N-gram Language Models Flat model: no hierarchical structure P(W) = P(w 1, w 2,..., w n ) = P(w 1 )P(w 2 w 1 )P(w 3 w 1, w 2 ) P(w n w 1, w 2..., w n 1 ) n = P(w i w 1, w 2,..., w i 1 ) i=1 Approximations: P(w i w 1, w 2,..., w i 1 ) = P(w i ) P(w i w 1, w 2,..., w i 1 ) = P(w i w i 1 ) P(w i w 1, w 2,..., w i 1 ) = P(w i w i 2, w i 1 ) P(w i w 1, w 2,..., w i 1 ) = P(w i w i N+1,..., w i 1 ) (Unigram) (Bigram) (Trigram) (N-gram) 29 / 56

Example (Bigram) P(Mary, loves, that, person) = P(Mary <s>)p(loves Mary)P(that loves) P(person that)p(</s> person) 30 / 56

N-gram estimation (Maximum Likelihood) N {}}{ P(w i w i N+1,..., w i 1 ) = C( w i N+1,..., w i 1, w i ) C(w i N+1,..., w i 1 ) }{{} N 1 = C(w i N+1,..., w i 1, w i ) w i C(w i N+1,..., w i 1, w i ) 31 / 56

N-gram estimation (Maximum Likelihood) N {}}{ P(w i w i N+1,..., w i 1 ) = C( w i N+1,..., w i 1, w i ) C(w i N+1,..., w i 1 ) }{{} N 1 = C(w i N+1,..., w i 1, w i ) w i C(w i N+1,..., w i 1, w i ) Problem: data sparseness 31 / 56

N-gram estimation example Corpus: 1: John read her book 2: I read a different book 3: John read a book by Mulan P(John < s >) P(read John) P(a read) P(book a) = C(<s>,John) C(<s>) = 2 3 = C(John,read) C(John) = C(read,a) C(read) = 2 2 = 2 3 = C(a,book) C(a) = 1 2 P(< /s > book) = C(book,</s>) C(book) = 2 3 32 / 56

N-gram estimation example Corpus: 1: John read her book 2: I read a different book 3: John read a book by Mulan P(John < s >) P(read John) P(a read) P(book a) = C(<s>,John) C(<s>) = 2 3 = C(John,read) C(John) = C(read,a) C(read) = 2 2 = 2 3 = C(a,book) C(a) = 1 2 P(< /s > book) = C(book,</s>) C(book) = 2 3 P(John, read, a, book) = P(John < s >)P(read John)P(a read) P(book a)p(< /s > book) = 0.148 32 / 56

N-gram estimation example Corpus: 1: John read her book 2: I read a different book 3: John read a book by Mulan P(John < s >) P(read John) P(a read) P(book a) = C(<s>,John) C(<s>) = 2 3 = C(John,read) C(John) = C(read,a) C(read) = 2 2 = 2 3 = C(a,book) C(a) = 1 2 P(< /s > book) = C(book,</s>) C(book) = 2 3 P(John, read, a, book) = P(John < s >)P(read John)P(a read) P(book a)p(< /s > book) = 0.148 P(Mulan, read, a, book) = P(Mulan < s >) = 0 32 / 56

N-gram Smoothing Problem: Many very possible word sequences may have been observed in zero or very low numbers in the training data Leads to extremely low probabilities, effectively disabling this word sequence, no matter how strong the acoustic evidence is Solution: smoothing produce more robust probabilities for unseen data at the cost of modelling the training data slightly worse 33 / 56

Simplest Smoothing technique Instead of ML estimate Use P(w i w i N+1,..., w i 1 ) = C(w i N+1,..., w i 1, w i ) w i C(w i N+1,..., w i 1, w i ) P(w i w i N+1,..., w i 1 ) = 1 + C(w i N+1,..., w i 1, w i ) w i (1 + C(w i N+1,..., w i 1, w i )) prevents zero probabilities but still very low probabilities 34 / 56

N-gram simple smoothing example Corpus: 1: John read her book 2: I read a different book 3: John read a book by Mulan P(John < s >) P(read John)... P(Mulan < s >) = 1+C(<s>,John) 11+C(<s>) = 3 14 = 1+C(John,read) 11+C(John) = 3 13 = 1+C(<s>,Mulan) 11+C(<s>) = 1 14 35 / 56

N-gram simple smoothing example Corpus: 1: John read her book 2: I read a different book 3: John read a book by Mulan P(John < s >) P(read John)... P(Mulan < s >) = 1+C(<s>,John) 11+C(<s>) = 3 14 = 1+C(John,read) 11+C(John) = 3 13 = 1+C(<s>,Mulan) 11+C(<s>) = 1 14 P(John, read, a, book) = P(John < s >)P(read John)P(a read) P(book a)p(< /s > book) = 0.00035(0.148) 35 / 56

N-gram simple smoothing example Corpus: 1: John read her book 2: I read a different book 3: John read a book by Mulan P(John < s >) P(read John)... P(Mulan < s >) = 1+C(<s>,John) 11+C(<s>) = 3 14 = 1+C(John,read) 11+C(John) = 3 13 = 1+C(<s>,Mulan) 11+C(<s>) = 1 14 P(John, read, a, book) = P(John < s >)P(read John)P(a read) P(book a)p(< /s > book) = 0.00035(0.148) P(Mulan, read, a, book) = P(Mulan < s >)P(read Mulan)P(a read) P(book a)p(< /s > book) = 0.000084(0) 35 / 56

Interpolation vs Backoff smoothing Interpolation models: Linear combination with lower order n-grams Modifies the probabilities of both nonzero and zero count n-grams Backoff models: Use lower order n-grams when the requested n-gram has zero or very low count in the training data Nonzero count n-grams are unchanged Discounting: Reduce the probability of seen n-grams and distribute among unseen ones 36 / 56

Interpolation vs Backoff smoothing Interpolation models: P smooth (w i w i N+1,..., w i 1 ) = Backoff models: N {}}{ λ P ML (w i w i N+1,..., w i 1 ) + N 1 {}}{ (1 λ) P smooth (w i w i N+2,..., w i 1 ) P smooth (w i w i N+1,..., w i 1 ) = N {}}{ α P(w i w i N+1,..., w i 1 ) if C(w i w i N+1,..., w i 1 ) > 0 N 1 {}}{ γ P smooth (w i w i N+2,..., w i 1 ) if C(w i w i N+1,..., w i 1 ) = 0 37 / 56

Deleted interpolation smoothing Recursively interpolate with n-grams of lower order: if history n = w i n+1,..., w i 1 P I (w i history n ) = λ historyn P(w i history n ) + (1 λ historyn )P I (w i history n 1 ) hard to estimate λ historyn for every history cluster into moderate number of weights 38 / 56

Backoff smoothing Use P(w i history n 1 ) only if you lack data for P(w i history n ) 39 / 56

Good-Turing estimate Partition n-grams into groups depending on their frequency in the training data Change the number of occurrences of an n-gram according to r = (r + 1) n r+1 n r where r is the occurrence number, n r is the number of n-grams that occur r times 40 / 56

Katz smoothing based on Good-Turing: combine higher and lower order n-grams For every N-gram: 1. if count r is large (> 5 or 8), do not change it 2. if count r is small but non-zero, discount with r 3. if count r = 0, reassign discounted counts with lower order N-gram C (w i 1, w i ) = α(w i 1 )P(w i ) 41 / 56

Kneser-Ney smoothing: motivation Background Problem Lower order n-grams are often used as backoff model if the count of a higher-order n-gram is too low (e.g. unigram instead of bigram) Some words with relatively high unigram probability only occur in a few bigrams. E.g. Francisco, which is mainly found in San Francisco. However, infrequent word pairs, such as New Francisco, will be given too high probability if the unigram probabilities of New and Francisco are used. Maybe instead, the Francisco unigram should have a lower value to prevent it from occurring in other contexts. I can t see without my reading... 42 / 56

Kneser-Ney intuition If a word has been seen in many contexts it is more likely to be seen in new contexts as well. instead of backing off to lower order n-gram, use continuation probability Example: instead of unigram P(w i ), use P CONTINUATION (w i ) = {w i 1 : C(w i 1 w i ) > 0} w i {w i 1 : C(w i 1 w i ) > 0} 43 / 56

Kneser-Ney intuition If a word has been seen in many contexts it is more likely to be seen in new contexts as well. instead of backing off to lower order n-gram, use continuation probability Example: instead of unigram P(w i ), use P CONTINUATION (w i ) = {w i 1 : C(w i 1 w i ) > 0} w i {w i 1 : C(w i 1 w i ) > 0} I can t see without my reading... glasses 43 / 56

Class N-grams 1. Group words into semantic or grammatical classes 2. build n-grams for class sequences: P(w i c i N+1... c i 1 ) = P(w i c i )P(c i c i N+1... c i 1 ) rapid adaptation, small training sets, small models works on limited domains classes can be rule-based or data-driven 44 / 56

Combining PCFGs and N-grams Only N-grams: Meeting at three with Zhou Li Meeting at four PM with Derek P(Zhou three, with) and P(Derek PM, with)) N-grams + CFGs: Meeting {at three: TIME} with {Zhou Li: NAME} Meeting {at four PM: TIME} with {Derek: NAME} P(NAME TIME, with) 45 / 56

Adaptive Language Models conversational topic is not stationary topic stationary over some period of time build more specialised models that can adapt in time Techniques Cache Language Models Topic-Adaptive Models Maximum Entropy Models 46 / 56

Cache Language Models 1. build a full static n-gram model 2. during conversation accumulate low order n-grams 3. interpolate between 1 and 2 47 / 56

Topic-Adaptive Models 1. cluster documents into topics (manually or data-driven) 2. use information retrieval techniques with current recognition output to select the right cluster 3. if off-line run recognition in several passes 48 / 56

Maximum Entropy Models Instead of linear combination: 1. reformulate information sources into constraints 2. choose maximum entropy distribution that satisfies the constraints 49 / 56

Maximum Entropy Models Instead of linear combination: 1. reformulate information sources into constraints 2. choose maximum entropy distribution that satisfies the constraints Constraints general form: P(X )f i (X ) = E i X Example: unigram f wi = { 1 if w = wi 0 otherwise 49 / 56

Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language Models N-gram Smoothing Class N-grams Adaptive Language Models Language Model Evaluation 50 / 56

Language Model Evaluation Evaluation in combination with Speech Recogniser hard to separate contribution of the two Evaluation based on probabilities assigned to text in the training and test set 51 / 56

Information, Entropy, Perplexity Information: Entropy: I (x i ) = log 1 P(x i ) H(X ) = E[I (X )] = i P(x i ) log P(x i ) Perplexity: PP(X ) = 2 H(X ) 52 / 56

Perplexity of a model We do not know the true distribution p(w 1,..., w n ). But we have a model m(w 1,..., w n ). The cross-entropy is: H(p, m) = p(w 1,..., w n ) log m(w 1,..., w n ) w 1,...,w n Cross-entropy is upper bound to entropy: H H(p, m) The better the model, the lower the cross-entropy and the lower the perplexity (on the same data) 53 / 56

Test-set Perplexity Estimate the distribution p(w 1,..., w n ) on the training data Evaluate it on the test data H = p(w 1,..., w n ) log p(w 1,..., w n ) w 1,...,w n test set PP = 2 H 54 / 56

Perplexity and branching factor Perplexity is roughly the geometric mean of the branching factor word word1 word2... wordn Shannon: 2.39 for English letters and 130 for English words Digit strings: 10 n-gram English: 50 1000 Wall Street Journal test set: 180 (bigram) 91 (trigram) 55 / 56

Performance of N-grams Models Perplexity Word Error Rate Unigram Katz 1196.45 14.85% Unigram Kneser-Ney 1199.59 14.86% Bigram Katz 176.31 11.38% Bigram Kneser-Ney 176.11 11.34% Trigram Katz 95.19 9.69% Trigram Kneser-Ney 91.47 9.60% Wall Street Journal database Dictionary: 60 000 words Training set: 260 000 000 words 56 / 56