Statistical Machine Translation

Similar documents
Language Modelling. Marcello Federico FBK-irst Trento, Italy. MT Marathon, Edinburgh, M. Federico SLM MT Marathon, Edinburgh, 2012

N-gram Language Model. Language Models. Outline. Language Model Evaluation. Given a text w = w 1...,w t,...,w w we can compute its probability by:

Language Models. Philipp Koehn. 11 September 2018

Chapter 3: Basics of Language Modelling

Statistical Machine Translation. Part III: Search Problem. Complexity issues. DP beam-search: with single and multi-stacks

N-gram Language Modeling

N-gram Language Modeling Tutorial

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Natural Language Processing. Statistical Inference: n-grams

Phrase-Based Statistical Machine Translation with Pivot Languages

CMPT-825 Natural Language Processing

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

SYNTHER A NEW M-GRAM POS TAGGER

Machine Learning for natural language processing

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

DT2118 Speech and Speaker Recognition

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

Probabilistic Language Modeling

Conditional Language Modeling. Chris Dyer

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Language Modeling. Michael Collins, Columbia University

Statistical Methods for NLP

Exploring Asymmetric Clustering for Statistical Language Modeling

Chapter 3: Basics of Language Modeling

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

Advanced topics in language modeling: MaxEnt, features and marginal distritubion constraints

ANLP Lecture 6 N-gram models and smoothing

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Language Model. Introduction to N-grams

Hierarchical Bayesian Nonparametric Models of Language and Text

Natural Language Processing SoSe Words and Language Model

Hierarchical Bayesian Nonparametric Models of Language and Text

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

What to Expect from Expected Kneser-Ney Smoothing

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

The Noisy Channel Model and Markov Models

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Natural Language Processing (CSE 490U): Language Models

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

Phrasetable Smoothing for Statistical Machine Translation

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18 Alignment in SMT and Tutorial on Giza++ and Moses)

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

TnT Part of Speech Tagger

Language Model Rest Costs and Space-Efficient Storage

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

Variational Decoding for Statistical Machine Translation

Speech Translation: from Singlebest to N-Best to Lattice Translation. Spoken Language Communication Laboratories

Multiple System Combination. Jinhua Du CNGL July 23, 2008

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

Sequences and Information

Improved Decipherment of Homophonic Ciphers

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Statistical NLP Spring The Noisy Channel Model

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Midterm sample questions

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

Ranked Retrieval (2)

Natural Language Processing

CS 6120/CS4120: Natural Language Processing

IBM Model 1 for Machine Translation

arxiv: v1 [cs.cl] 5 Mar Introduction

3F1 Information Theory, Lecture 3

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

CS230: Lecture 10 Sequence models II

1. Markov models. 1.1 Markov-chain

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:

FINITE STATE LANGUAGE MODELS SMOOTHED USING n-grams

Statistical NLP Spring Digitizing Speech

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

Language Modeling. Introduction to N-grams. Klinton Bicknell. (borrowing from: Dan Jurafsky and Jim Martin)

Natural Language Processing

Kneser-Ney smoothing explained

Fast Text Compression with Neural Networks

(today we are assuming sentence segmentation) Wednesday, September 10, 14

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II)

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Deep Learning. Language Models and Word Embeddings. Christof Monz

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

Statistical Phrase-Based Speech Translation

arxiv: v1 [cs.cl] 24 May 2012

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

arxiv:cmp-lg/ v1 9 Jun 1997

Naïve Bayes, Maxent and Neural Models

Transcription:

Statistical Machine Translation Marcello Federico FBK-irst Trento, Italy Galileo Galilei PhD School University of Pisa Pisa, 7-19 May 2008 Part V: Language Modeling 1 Comparing ASR and statistical MT N-gram LMs Perplexity Frequency smoothing LM representation Efficient handling of huge LMs

2 Fundamental Equation of ASR Let x be a sequence of acoustic observations, the most probable transcription w is searched through the following statistical decision criterion: The computational problems of ASR: w = arg max Pr(x w) Pr(w) (1) w language modeling: estimating the language model probability Pr(w) acoustic modeling: estimating the acoustic model probability Pr(x w) search problem: carrying out the optimization criterion (1) The acoustic model is defined by introducing an hidden alignment variable s: Pr(x w) = s Pr(x, s w) corresponding to state sequences of a Markov model generating x and s. Fundamental Equation of SMT Let f be any text in a Foreign source language. The most probable translation into English is searched among texts e in the target language by: The computational problems of SMT: e = arg max Pr(f e) Pr(e) (2) e language modeling: estimating the language model probability Pr(e) translation modeling: estimating the translation model probability Pr(f e) search problem: carrying out the optimization criterion (2) The translation model is defined in terms of an hidden alignment variable a: 3 Pr(f e) = a Pr(f, a e) that map source to target positions, and a multinomial process for f and a.

ASR/MT Architecture 4 Input Preprocessing Target Texts Parallel Data Source Search LM AM/TM Target Output Postprocessing Parallel data are weakly aligned observations of source and target symbols. In other words, we do not need to observe the hidden variables s or a. N-gram LMs 5 The purpose of LMs is to compute the probability Pr(w T 1 ) of any sequence of words w T 1 = w 1..., w t,..., w T. The probability Pr(w T 1 ) can be expressed as: Pr(w T 1 ) = P (w 1 ) T Pr(w t h t ) (3) t=2 where h t = w 1,..., w t 1 indicates the history of word w t. Pr(w t h t ) become difficult to estimate as the sequence of words h t grows. We approximate by defining equivalence classes on histories h t. n-gram approximation let each word depend on the most recent n 1 words: h t w t n+1... w t 1. (4)

Normalization Requirement 6 Pr(T ) T =1 w 1...w T Pr(w 1,..., w T T ) = 1 N-gram LMs guarantee that probabilities sum up over one, for a given length T : X w 1...w T TY Pr(w T h T ) Pr(w t h t ) = X Pr(w 1 ) X Pr(w 2 h 1 )... X Pr(w T 1 h T 1 ) X t=1 w 1 w 2 w T 1 w T {z } =1 = X Pr(w 1 ) X Pr(w 2 h 1 )... X Pr(w T 1 h T 1 ) 1 w 1 w 2 wt 1 =... {z } =1 = X Pr(w 1 ) 1... 1 1 = 1 (5) w 1 {z } =1 String Length Model 7 Hence we just need a length model P (T ), some examples are: Exponential model: p(t ) = (a 1)a T with any a > 1 Bad: favors short strings, which are already rewarded by the n-gram product Uniform model: p(t ) = 1/K( f ) for lengths up to K( f ) and 0 otherwise Good: for a given input f, K is constant and can be disregarded Shorter sentences are anyway favoured Word Insertion model is also added to reward any added word feature function: h(e, f, a) = exp( e )

How to measure LM quality 8 LMs for ASR are evaluated with respect to their Impact on recognition accuracy = Word Error Rate Capability of predicting words in a text = Perplexity The perplexity measure (PP) is defined as follows: P P = 2 LP where LP = 1 M log 2 ˆP (w M 1 ) (6) w1 M = w 1... w M is a sufficiently long test sample ˆPr(w 1 M ) is the probability of w1 M computed with a given a stochastic LM. According to basic Information Theory, perplexity indicates that the prediction task of the LM is as difficult as guessing a word among P P equally likely words. Example: guessing random digits has PP=10. N-gram LM and data sparseness 9 Even estimating n-gram probabilities is not a trivial task: high number of parameters: e.g. a 3-gram LM with a vocabulary of 1,000 words requires, in principle, to estimate 10 9 probabilities! data sparseness of real texts: i.e. most of correct n-grams are rare events Test samples will contain many never observed n-grams: assigning one of them probability zero will rise PP to infinite! Experimentally, in the 1.2Mw (million word) Lancaster-Oslo-Bergen corpus: more than 20% of bigrams and 60% of trigrams occur only once 85% of trigrams occur less than five times. expected chances of finding new 2-grams is 22% expected change of finding new 3-grams is 65% We need frequency smoothing or discounting!

Frequency Discounting 10 Discount relative frequency to assign some positive prob to every possible n-gram 0 f (w x y) f(w x y) x y w V 3 The zero-frequency probability λ(x y), defined by: λ(x y) = 1.0 w V f (w x y), is redistributed over the set of words never observed after history x y. Redistribution is proportional to the less specific n 1-gram model p(w y). 1 1 Notice: c(x, y) = 0 implies that λ(x y) = 1. Smoothing Schemes 11 Discounting of f(w x y) and redistribution of λ(x y) can be combined by: Back-off, i.e. select the most significant approximation available: { f p(w x y) = (w x y) if f (w x y) > 0 α x y λ(x y)p(w y) otherwise (7) where α xy is an appropriate normalization term Interpolation, i.e. sum up the two approximations: p(w x y) = f (w x y) + λ(x y)p(w y). (8)

Smoothing Methods 12 Witten-Bell estimate [Witten & Bell, 1991] λ(xy) n(xy) i.e. # different words observed after xy in the training data: λ(xy) = def n(xy) c(xy) + n(xy) which gives: f (w xy) = c(xyw) c(xy) + n(xy) Absolute discounting [Ney & Essen, 1991] subtract constant β (0 < β 1) from all observed n-gram counts 2 { } c(xyw) β f (w xy) = max, 0 c(xy) which gives λ(xy) = β w:c(xyw)>1 1 c(xy) 2 β n 1 n 1 +2n 2 < 1 where n c is # of different n-grams which occurr c times in the training data. Improved Absolute Discounting 13 Kneser-Ney smoothing [Kneser & Ney, 1995] Absolute discounting with corrected counts for lower order n-grams. Rationale: the lower order count c(y, w) is made proportional to the number of different words xy follows. Example: let c(los, angeles) = 1000 and c(angeles) = 1000 corrected count is c (angeles) = 1, hence the unigram prob p(angeles) will be small. Improved Kneser-Ney [Chen & Goodman, 1998] In addition use specific discounting coefficients for rare n-grams: f (w x y) = c(xy w) β(c(xyw)) c(xy) where β(0) = 0, β(1) = D 1, β(2) = D 2, β(c) = D 3+ if c 3.

LM representation: ARPA File Format 14 Contains all the ingredients needed to compute LM probabilities: \data\ ngram 1= 86700 ngram 2= 1948935 ngram 3= 2070512 \1-grams: -2.88382! -2.38764-2.94351 world -0.514311-6.09691 pisa -0.15553... \2-grams: -3.91009 world! -0.351469-3.91257 hello world -0.24-3.87582 hello pisa -0.0312.. \3-grams: -0.00108858 hello world! -0.000271867, hi hello!... \end\ logpr(! hello pisa) = -0.0312 + logpr(! pisa) logpr(! pisa) = -0.15553-2.88382 Large Scale Language Models 15 Availability of large scale corpora has pushed research toward using huge LMs At 2006 NIST WS best systems used LMs trained on at least 1.6G words Google presented results using a 5-gram LM trained on 1.3T words Handling of such huge LMs with available tools (e.g. SRILM) is prohibitive if you use standard computer equipment (4 to to 8Gb of RAM) Trend of technology is towards distributed processing using PC farms We developed IRSTLM, a LM library addressing these needs open-source LGPL library under sourceforge.net integrated into the Moses SMT Toolkit and FBK-irst s speech decoder

IRSTLM library (open source) 16 Important Features Distributed training on single machine or SGE queue split dictionary into balanced n-gram prefix lists collect n-grams for each prefix lists estimate single LMs for each prefix list quickly merge single LMs into one ARPA file Space optimization n-gram collection uses dynamic storage to encode counters LM estimation just requires reading disk files probs and back-off weights are quantized LM data structure is loaded on demand LM caching computations of probs, access to internal lists, LM states,... Data Structure to Collect N-grams 17 1-gr 2-gr 3-gr 3 1 w fr 3 6 3 8 1 w fr succ ptr flags Dynamic prefix-tree data structure Successor lists are allocated on demand through memory pools Storage of counts from 1 to 6 bytes, according to max value Permits to manage few huge counts, such as in the google n-grams

Distributed Training on English Gigaword list dictionary number of 5-grams: index size observed distinct non-singletons 0 4 217M 44.9M 16.2M 1 11 164M 65.4M 20.7M 2 8 208M 85.1M 27.0M 3 44 191M 83.0M 26.0M 4 64 143M 56.6M 17.8M 5 137 142M 62.3M 19.1M 6 190 142M 64.0M 19.5M 7 548 142M 66.0M 20.1M 8 783 142M 63.3M 19.2M 9 1.3K 141M 67.4M 20.2M 10 2.5K 141M 69.7M 20.5M 11 6.1K 141M 71.8M 20.8M 12 25.4K 141M 74.5M 20.9M 13 4.51M 141M 77.4M 20.6M total 4.55M 2.2G 951M 289M 18 Data Structure to Compute LM Probs 19 1-gr 2-gr 3-gr 3 1 w pr 3 1 1 4 w bo pr idx First used in CMU-Cambridge LM Toolkit (Clarkson and Rosenfeld, 1997] Slower access but less memory than structure used by SRILM Toolkit IRSTLM in addition compresses probabilities and back-off weights into 1 byte!

Compression Through Quantization 20 How does quantization work? 1. Partition observed probabilities into regions (clusters) 2. Assign a code and probability value to each region (codebook) 3. Encode the probabilities of all observations (quantization) We investigate two quantization methods: Lloyd s K-Means Algorithm first applied to LM for ASR by [Whittaker & Raj, 2000] computes clusters minimizing average distance between data and centroids Binning Algorithm first applied to term-frequencies for IR by [Franz & McCarley, 2002] computes clusters that partition data into uniformly populated intervals Notice: a codebook of n centers means a quantization level of log 2 n bits. LM Quantization 21 Codebooks One codebook for each word and back-off probability level For instance, a 5-gram LM needs in total 9 codebooks. Use codebook of at least 256 entries for 1-gram distributions. Motivation Distributions of these probabilities can be quite different. 1-gram distributions contain relatively few probabilities Memory cost of a few codebooks is irrelevant. Composition of codebooks LM probs are computed by multiplying entries of different codebooks actual resolution of lower order n-grams is higher than that of its codebook! Very little loss in performance with 8 bit quantization[federico & Bertoldi 06]

LM Accesses by Search Algorithm 22 SMT Decode calls to a 3-gram LM while decoding from German to English the text: ich bin kein christdemokrat und glaube daher nicht an wunder. doch ich möchte dem europäischen parlament, so wie es gegenwürtig beschaffen ist, für seinen grossen beitrag zu diesen arbeiten danken. LM Accesses by SMT Search Algorithm 23 1.7M calls only involving 120K different 3-grams Decoder tends to access LM n-grams in nonuniform, highly localized patterns First call of an n-gram is easily followed by other calls of the same n-gram.

Memory Mapping of LM on Disk 24 Disk file 1-gr 2-gr 3-gr Memory Our LM structure permits to exploit so-called memory mapped file access. Memory mapping permits to include a file in the address space of a process, whose access is managed as virtual memory Only memory pages (grey blocks) that are accessed by decoding are loaded