Language Model Rest Costs and Space-Efficient Storage

Similar documents
N-gram Language Model. Language Models. Outline. Language Model Evaluation. Given a text w = w 1...,w t,...,w w we can compute its probability by:

Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

Language Models. Philipp Koehn. 11 September 2018

Applications of Tree Automata Theory Lecture VI: Back to Machine Translation

Learning to translate with neural networks. Michael Auli

DT2118 Speech and Speaker Recognition

Probabilistic Counting with Randomized Storage

Tuning as Linear Regression

Statistical Machine Translation

Analysing Soft Syntax Features and Heuristics for Hierarchical Phrase Based Machine Translation

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Improving Relative-Entropy Pruning using Statistical Significance

Statistical Machine Translation. Part III: Search Problem. Complexity issues. DP beam-search: with single and multi-stacks

Syntax-Based Decoding

Language Modelling. Marcello Federico FBK-irst Trento, Italy. MT Marathon, Edinburgh, M. Federico SLM MT Marathon, Edinburgh, 2012

N-gram Language Modeling Tutorial

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Kneser-Ney smoothing explained

CMPT-825 Natural Language Processing

A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

N-gram Language Modeling

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Decoding and Inference with Syntactic Translation Models

Statistical Machine Translation

Phrase-Based Statistical Machine Translation with Pivot Languages

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Decoding in Statistical Machine Translation. Mid-course Evaluation. Decoding. Christian Hardmeier

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Exploring Asymmetric Clustering for Statistical Language Modeling

Decoding Revisited: Easy-Part-First & MERT. February 26, 2015

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Advanced topics in language modeling: MaxEnt, features and marginal distritubion constraints

Analysis of techniques for coarse-to-fine decoding in neural machine translation

Discriminative Training

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

ANLP Lecture 6 N-gram models and smoothing

Multi-Task Word Alignment Triangulation for Low-Resource Languages

Neural Networks Language Models

Efficient Path Counting Transducers for Minimum Bayes-Risk Decoding of Statistical Machine Translation Lattices

Natural Language Processing. Statistical Inference: n-grams

Machine Learning for natural language processing

Syntax-based Statistical Machine Translation

Phrasetable Smoothing for Statistical Machine Translation

Exact Decoding for Phrase-Based Statistical Machine Translation

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

Count-Min Tree Sketch: Approximate counting for NLP

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

Variational Decoding for Statistical Machine Translation

Pre-Initialized Composition For Large-Vocabulary Speech Recognition

The Geometry of Statistical Machine Translation

Naïve Bayes, Maxent and Neural Models

CS 6120/CS4120: Natural Language Processing

Natural Language Processing (CSE 490U): Language Models

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

Sequences and Information

Natural Language Processing (CSEP 517): Machine Translation

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Number Representation and Waveform Quantization

Chapter 3: Basics of Language Modeling

Multi-Source Neural Translation

Language Modeling. Michael Collins, Columbia University

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

Statistical Natural Language Processing

Generalized Stack Decoding Algorithms for Statistical Machine Translation

TALP Phrase-Based System and TALP System Combination for the IWSLT 2006 IWSLT 2006, Kyoto

Efficient Incremental Decoding for Tree-to-String Translation

What to Expect from Expected Kneser-Ney Smoothing

Richer Interpolative Smoothing Based on Modified Kneser-Ney Language Modeling

Consensus Training for Consensus Decoding in Machine Translation

ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

Part I - Introduction Part II - Rule Extraction Part III - Decoding Part IV - Extensions

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

arxiv: v1 [cs.cl] 22 Jun 2015

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Data Compression Techniques

Minimum Error Rate Training Semiring

Conditional Language Modeling. Chris Dyer

A Discriminative Model for Semantics-to-String Translation

Lecture 9: Decoding. Andreas Maletti. Stuttgart January 20, Statistical Machine Translation. SMT VIII A. Maletti 1

Improved Decipherment of Homophonic Ciphers

Language Modeling. Introduction to N-grams. Klinton Bicknell. (borrowing from: Dan Jurafsky and Jim Martin)

Machine Translation: Examples. Statistical NLP Spring Levels of Transfer. Corpus-Based MT. World-Level MT: Examples

Fast Consensus Decoding over Translation Forests

Deep Learning. Language Models and Word Embeddings. Christof Monz

Exact Sampling and Decoding in High-Order Hidden Markov Models

Dynamic Programming: Hidden Markov Models

Natural Language Processing

Multi-Source Neural Translation

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Algorithms for NLP. Machine Translation II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

A Recursive Statistical Translation Model

Succinct Approximate Counting of Skewed Data

Probabilistic Language Modeling

Language Processing with Perl and Prolog

Transcription:

Language Model Rest Costs and Space-Efficient Storage Kenneth Heafield Philipp Koehn Alon Lavie Carnegie Mellon, University of Edinburgh July 14, 2012

Complaint About Language Models Make Search Expensive p 5 (is one of the) p 5 (is one)p 5 (of the) 1 1 Better fragment scores

Complaints About Language Models Make Search Expensive Use Too Much Memory p 5 (is one of the) p 5 (is one)p 5 (of the) 1 log p 5 (the is one of) = 0.5 log b 5 (is one of the) = 1.2 1 Better fragment scores 2 Collapse probability and backoff

Language Model Probability of Sentence Fragments log p 5 (is one of the few) = 6.62 Why does it matter? Decoders prune hypotheses based on score.

Baseline: How to Score a Fragment log p 5 (is) = 2.63 log p 5 (one is) = 2.03 log p 5 (of is one) = 0.24 log p 5 (the is one of) = 0.47 + log p 5 (few is one of the) = 1.26 = log p 5 (is one of the few) = 6.62

The Problem: Lower Order Entries 5-Gram Model: log p 5 (is) = 2.63 Unigram Model: log p 1 (is) = 2.30 Same training data.

Backoff Smoothing p 5 (is) should be used when a bigram was not found. In the language model log p 5 (is australia) = 2.21 Not in the language model log p 5 (is periwinkle) = log b 5 (periwinkle) + log p 5 (is) = 2.95

Backoff Smoothing p 5 (is) should be used when a bigram was not found. In the language model log p 5 (is australia) = 2.21 Not in the language model log p 5 (is periwinkle) = log b 5 (periwinkle) + log p 5 (is) = 2.95 In Kneser-Ney smoothing, lower order probabilities assume backoff.

Use Lower Order Models for the First Few Words Baseline Lower log p 5 (is) = 2.63 2.30 = log p 1 log p 5 (one is) = 2.03 1.92 = log p 2 log p 5 (of is one) = 0.24 0.08 = log p 3 log p 5 (the is one of) = 0.47 0.21 = log p 4 + log p 5 (few is one of the) = 1.26 1.26 = log p 5 = log p 5 (is one of the few) = 6.62 5.77 = log p Low

Which is Better? Baseline: log p 5 (is one of the few) = 6.62 Lower Order: log p Low (is one of the few) = 5.77

Which is Better: Prediction Task Error Baseline: log p 5 (is one of the few) = 6.62 2.52 Lower Order: log p Low (is one of the few) = 5.77 1.67 Actual: log p 5 (is one of the few <s> australia) = 4.10

The Lower Order Estimate is Better Run the decoder and log error every time context is revealed. Length 1 2 3 4 Baseline.87.24.10.09 Lower Order.84.18.07.04 Table : Mean squared error in predicting log probability.

Storing Lower Order Models One extra float per entry, except for longest order. Unigrams Words log p 5 log b 5 log p 1 australia -3.9-0.6-3.6 is -2.6-1.5-2.3 one -3.4-1.0-2.9 of -2.5-1.1-1.7 No need for backoff b 1 If backoff occurs, the Kneser-Ney assumption holds and p 5 is used.

Lower Order Summary Fragment scores are more accurate, but require more memory.

Related Work Score with and without sentence boundaries. [Sankaran et al, 2012] Peek at future phrases. [Zens and Ney, 2008] [Wuebker et al, Wed.] Coarse pass predicts scores for a finer pass. [Vilar and Ney, 2011]

Related Work Score with and without sentence boundaries. [Sankaran et al, 2012] Peek at future phrases. [Zens and Ney, 2008] [Wuebker et al, Wed.] Coarse pass predicts scores for a finer pass. [Vilar and Ney, 2011] All of these use fragment scores as a subroutine.

Related Work II: Carter et al, Yesterday This Work p(is one of the) p(is one)p(of the) Their Work p(is one of the) p(is one)p(of the) Implementing Upper Bounds Within This Work Store upper bound probabilities instead of averages Account for positive backoff with the context Three values per n-gram instead of their four.

Lower Order Summary Previously Fragment scores are more accurate, but require more memory. Next Save memory but make fragment scores less accurate.

Saving Memory Unigrams Words log p 5 log b 5 australia -3.9-0.6 is -2.6-1.5 one -3.4-1.0 of -2.5-1.1 Compress Unigrams Words log q 5 australia -4.5 is -4.1 one -4.4 of -3.6 One less float per entry, except for longest order.

Related Work Store counts instead of probability and backoff [Brants et al, 2007] RandLM, ShefLM, BerkeleyLM This Work Memory comparable to storing counts. Higher quality Kneser-Ney smoothing.

How Backoff Works p(periwinkle is one of) = p(periwinkle of)b(is one of)b(one of) because of periwinkle appears but one of periwinkle does not.

Pessimism Assume backoff all the way to unigrams. q(is one of) = p(is one of)b(is one of)b(one of)b(of)

Pessimism Assume backoff all the way to unigrams. q(is one of) = p(is one of)b(is one of)b(one of)b(of) Sentence Scores Are Unchanged q(<s> </s>) = p(<s> </s>) because b( </s>) = 1

Incremental Pessimism q(is) = p(is)b(is) b(is one)b(one) q(one is) = p(one is) b(is) These are terms in a telescoping series: q(is one) = q(is)q(one is)

Using q log q(is) = -4.10 log q(one is) = -2.51 log q(of is one) = -0.94 log q(the is one of) = -1.61 + log q(few is one of the) = 1.03 = log q(is one of the few) = -8.13 Store q, forget probability and backoff.

Using q log q(is) = -4.10 log q(one is) = -2.51 log q(of is one) = -0.94 log q(the is one of) = -1.61 + log q(few is one of the) = 1.03 = log q(is one of the few) = -8.13 Store q, forget probability and backoff. q is not a proper probability distribution.

Summary Collapse probability and backoff from two values to one value.

Stacking Lower Order and Pessimistic Combined Same memory (one extra float, one less float). Better on the left, worse on the right.

Cube Pruning: Approximate Search For each constituent, going bottom-up: 1 Make a priority queue over possible rule applications. 2 Pop a fixed number of hypotheses: the pop limit. Larger pop limit = more accurate search.

Cube Pruning: Approximate Search For each constituent, going bottom-up: 1 Make a priority queue over possible rule applications. 2 Pop a fixed number of hypotheses: the pop limit. Larger pop limit = more accurate search. Accurate fragment scores

Experiments Task WMT 2011 German-English Decoder Moses LM 5-gram from Europarl, news commentary, and news Grammar Hierarchical and target-syntax systems Parser Collins

Hierarchical Model Score and BLEU -101.38 22.3 Average model score -101.39-101.40-101.41-101.42 Lower Combined Baseline Pessimistic 0 5 10 15 20 25 CPU seconds/sentence Uncased BLEU 22.2 22.1 22.0 Lower Combined Baseline Pessimistic 0 5 10 15 20 25 CPU seconds/sentence

Target-Syntax Model Score and BLEU -104.1 21.4-104.2 Average model score -104.3-104.4-104.5-104.6 Lower Combined Baseline Pessimistic 0 10 20 30 40 50 60 70 80 90 CPU seconds/sentence Uncased BLEU 21.3 21.2 21.1 21.0 Lower Combined Baseline Pessimistic 0 10 20 30 40 50 60 70 80 90 CPU seconds/sentence

Memory Cost to add or savings from removing a float per entry. Structure Baseline (MB) Change (MB) % Probing 4,072 517 13% Trie 2,647 506 19% 8-bit quantized trie 1,236 140 11% 8-bit minimal perfect hash 540 140 26%

Summary Lower Order Models 21-63% less CPU 13-26% more memory 27% more CPU 13-26% less memory Lower Order+Pessimistic 3% less CPU Same memory as baseline

Code kheafield.com/code/kenlm Also distributed with Moses and cdec. Lower Order build binary -r 1.arpa 2.arpa 3.arpa 4.arpa 5.arpa 5.binary Release planned