A fast and simple algorithm for training neural probabilistic language models

Similar documents
Naïve Bayes, Maxent and Neural Models

Natural Language Processing. Statistical Inference: n-grams

From perceptrons to word embeddings. Simon Šuster University of Groningen

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

An overview of word2vec

Midterm sample questions

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Notes on Noise Contrastive Estimation (NCE)

Language Model. Introduction to N-grams

arxiv: v2 [cs.cl] 1 Jan 2019

Natural Language Processing SoSe Words and Language Model

Linear Models for Classification

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Chapter 3: Basics of Language Modelling

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Statistical Methods for NLP

Seman&cs with Dense Vectors. Dorota Glowacka

Machine Learning for natural language processing

Collapsed Variational Bayesian Inference for Hidden Markov Models

Generative Models for Sentences

Algorithm-Independent Learning Issues

CSC321 Lecture 18: Learning Probabilistic Models

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

SYNTHER A NEW M-GRAM POS TAGGER

CSC321 Lecture 7 Neural language models

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Latent Dirichlet Allocation Based Multi-Document Summarization

word2vec Parameter Learning Explained

The Noisy Channel Model and Markov Models

N-gram Language Modeling

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Complementary Sum Sampling for Likelihood Approximation in Large Scale Classification

Notes on Latent Semantic Analysis

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

GloVe: Global Vectors for Word Representation 1

STA 4273H: Sta-s-cal Machine Learning

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

TnT Part of Speech Tagger

Maxent Models and Discriminative Estimation

Ranked Retrieval (2)

Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model Jean-Sébastien Senécal a b IDIAP RR 03-35

Andriy Mnih and Ruslan Salakhutdinov

Expectation Maximization

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

CS281 Section 4: Factor Analysis and PCA

Lecture 21: Spectral Learning for Graphical Models

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Language Modeling. Michael Collins, Columbia University

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

LEARNING SPARSE STRUCTURED ENSEMBLES WITH STOCASTIC GTADIENT MCMC SAMPLING AND NETWORK PRUNING

arxiv: v1 [cs.cl] 21 May 2017

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

Support Vector Machines

Conditional Random Field

with Local Dependencies

STA 4273H: Statistical Machine Learning

STA414/2104 Statistical Methods for Machine Learning II

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Conditional Language Modeling. Chris Dyer

Maschinelle Sprachverarbeitung

Recurrent neural network grammars

Maschinelle Sprachverarbeitung

Probabilistic Graphical Models

Cross-Lingual Language Modeling for Automatic Speech Recogntion

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

Optimizing recurrent neural network language model GPU training

Lecture 16 Deep Neural Generative Models

Introduction to Gaussian Process

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Machine Learning for Signal Processing Bayes Classification and Regression

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng

Chapter 3: Basics of Language Modeling

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Lecture 7: Con3nuous Latent Variable Models

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Bayesian Support Vector Machines for Feature Ranking and Selection

Autoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Ch 4. Linear Models for Classification

Least Mean Squares Regression. Machine Learning Fall 2018

A brief introduction to Conditional Random Fields

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Lecture 7: Word Embeddings

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Transcription:

A fast and simple algorithm for training neural probabilistic language models Andriy Mnih Joint work with Yee Whye Teh Gatsby Computational Neuroscience Unit University College London 25 January 2013 1 / 22

Statistical language modelling Goal: Model the joint distribution of words in a sentence. Applications: speech recognition machine translation information retrieval Markov assumption: The distribution of the next word depends on only a fixed number of words that immediately precede it. Though false, makes the task much more tractable without making it trivial. 2 / 22

n-gram models Task: predict the next word w n from n 1 preceding words h = w 1,..., w n 1, called the context. n-gram models are conditional probability tables for P(w n h): Estimated by counting the number of occurrences of each word n-tuple and normalizing. Smoothing is essential for good performance. n-gram models are the most widely used statistical language models due to their simplicity and good performance. Curse of dimensionality: The number of model parameters is exponential in the context size. Cannot take advantage of large contexts. 3 / 22

Neural probabilistic language modelling Neural probabilistic language models (NPLMs) use distributed representations of words to deal with the curse of dimensionality. Neural language modelling: Words are represented with real-valued feature vectors learned from data. A neural network maps a context (a sequence of word feature vectors) to a distribution for the next word. Word feature vectors and neural net parameters are learned jointly. NPLMs generalize well because smooth functions map nearby inputs to nearby outputs. Similar representations are learned for words with similar usage patterns. Main drawback: very long training times. 4 / 22

t-sne embedding of learned word representations, of in for on with by ( at about after when if before only over just more_than against like did under made? between through out including make get do until without left at_least near take around see got to_do outside nearly give keep across put took pay does off along making held such_as up_to despite within received gave behind hit leave include almost! came estimated seen working doing done was_in taking appeared following included taken come caused cause inside based_on worked below saw hold all_of paid bring brought 5 / 22

Defining the next-word distribution A NPLM quantifies the compatibility between a context h and a candidate next word w using a scoring function s θ (w, h). The distribution for the next word is defined in terms of scores: P h θ (w) = 1 Z θ (h) exp(s θ(w, h)), where Z θ (h) = w exp(s θ(w, h)) is the normalizer for context h. Example: Log-bilinear model (LBL) performs linear prediction in the space of word representations: ˆr(h) is the predicted representation for the next word obtained by linearly combining the representations of the context words: n 1 ˆr(h) = C i r wi. i=1 The scoring function is sθ (w, h) = ˆr(h) r w. 6 / 22

Maximum-likelihood learning For a single context, the gradient of the log-likelihood is θ log Ph θ (w) = θ s θ(w, h) θ log Z θ(h) = θ s θ(w, h) w P h θ (w ) θ s θ(w, h). Computing θ log Z θ(h) is expensive: the time complexity is linear in the vocabulary size (typically tens of thousands of words). Importance sampling approximation (Bengio and Senécal, 2003): Sample words from a proposal distribution Q h (x) and reweight the gradients: k θ log Z v(x j ) θ(h) V θ s θ(x j, h) where v(x) = exp(s θ(x,h)) and V = k Q h (x) j=1 v(x j). Stability issues: need either a lot of samples or an adaptive proposal distribution. j=1 7 / 22

Noise-contrastive estimation NCE idea: Fit a density model by learning to discriminate between samples from the data distribution and samples from a known noise distribution (Gutmann and Hyvärinen, 2010). If noise samples are k times more frequent than data samples, the posterior probability that a sample came from the data distribution is P(D = 1 x) = P d (x) P d (x) + kp n (x). To fit a model P θ (x) to the data, use P θ (x) in place of P d (x) and maximize J(θ) =E Pd [log P(D = 1 x, θ)] + ke Pn [log P(D = 0 x, θ)] ] [ =E Pd [log + ke Pn log P θ (x) P θ (x) + kp n (x) kp n (x) P θ (x) + kp n (x) ]. 8 / 22

The advantages of NCE NCE allows working with unnormalized distributions P u θ (x): Set P θ (x) = Pθ u (x)/z and learn Z (or log Z ). The gradient of the objective is [ ] θ J(θ) =E kp n (x) P d P θ (x) + kp n (x) θ log P θ(x) [ ] P θ (x) ke Pn P θ (x) + kp n (x) θ log P θ(x). Much easier to estimate than the importance sampling gradient because the weights on θ log P θ(x) are always between 0 and 1. Can use far fewer noise samples as a result. 9 / 22

NCE properties The NCE gradient can be written as θ J(θ) = x kp n (x) P θ (x) + kp n (x) (P d(x) P θ (x)) θ log P θ(x). This is a pointwise reweighting of the ML gradient. In fact, as k, the NCE gradient converges to the ML gradient. If the noise distribution is non-zero everywhere and P θ (x) is unconstrained, P θ (x) = P d (x) is the only optimum. If the model class does not contain P d (x), the location of the optimum depends on P n. 10 / 22

NCE for training neural language models A neural language model specifies a large collection of distributions. One distribution per context. These distributions share parameters. We train the model by optimizing the sum of per-context NCE objectives weighted by the empirical context probabilities. If Pθ h (w) is the probability of word w in context h under the model, the NCE objective for context h is Pθ J h (θ) = E P h [log h(w) ] [ ] kp n (w) d Pθ h(w) + kp + ke Pn log n(w) Pθ h(w) + kp. n(w) The overall objective is J(θ) = h probability of context h. P(h)J h (θ), where P(h) is the empirical 11 / 22

The speedup due to using NCE The NCE parameter update is cd+v cd+k times faster than the ML update. c is the context size d is the representation dimensionality v is the vocabulary size k is the number of noise samples Using diagonal context matrices increases the speedup to c+v c+k. 12 / 22

Practicalities NCE learns a different normalizing parameter for each context present in the training set. For large context sizes and datasets the number of such parameters can get very large. Fortunately, learning works just as well if the normalizing parameters are fixed to 1. When evaluating the model, the model distributions are normalized explicitly. Noise distribution: a unigram model estimated from the training data. Use several noise samples per datapoint. Generate new noise samples before each parameter update. 13 / 22

Penn Treebank results Model: LBL model with 100D feature vectors and a 2-word context. Dataset: Penn Treebank news stories from Wall Street Journal. Training set: 930K words Validation set: 74K words Test set: 82K words Vocabulary: 10K words Models are evaluated based on their test set perplexity. Perplexity is the geometric average of 1 P(w h). The perplexity of a uniform distribution over N values is N. 14 / 22

Results: varying the number of noise samples TRAINING NUMBER OF TEST TRAINING ALGORITHM SAMPLES PPL TIME (H) ML 163.5 21 NCE 1 192.5 1.5 NCE 5 172.6 1.5 NCE 25 163.1 1.5 NCE 100 159.1 1.5 NCE training is 14 times faster than ML training in this setup. The number of samples has little effect on the training time because the cost of computing the predicted representation dominates the cost of the NCE-specific computations. 15 / 22

Results: the effect of the noise distribution NUMBER OF PPL USING PPL USING SAMPLES UNIGRAM NOISE UNIFORM NOISE 1 192.5 291.0 5 172.6 233.7 25 163.1 195.1 100 159.1 173.2 The empirical unigram distribution works much better than the uniform distribution for generating noise samples. As the number of noise samples increases the choice of the noise distribution becomes less important. 16 / 22

Application: MSR Sentence Completion Challenge Large-scale application: MSR Sentence Completion Challenge Task: given a sentence with a missing word, find the correct completion from a list of candidate words. Test set: 1,040 sentences from five Sherlock Holmes novels Training data: 522 19th-century novels from Project Gutenberg (48M words) Five candidate completions per sentence. Random guessing gives 20% accuracy. 17 / 22

Sample questions The stage lost a fine, even as science lost an acute reasoner, when he became a specialist in crime. a) linguist b) hunter c) actor d) estate e) horseman During two years I have had three and one small job, and that is absolutely all that my profession has brought me. a) cheers b) jackets c) crackers d) fishes e) consultations 18 / 22

Question generation process (MSR) Automatic candidate generation: 1. Pick a sentence with an infrequent target word (frequency < 10 4 ) 2. Sample 150 unique infrequent candidates for replacing the target word from an LM with a context of size 2. 3. If the correct completion scores lower than any of the candidates discard the sentence. 4. Compute the probability of the word after the candidate using the LM and keep the 30 highest-scoring completions. Human judges pick the top 4 completions using the following guidelines: 1. Discard grammatically incorrect sentences. 2. The correct completion should be clearly better than the alternatives. 3. Prefer alternatives that require some thought to answer correctly. 4. Prefer alternatives that require understanding properties of entities that are mentioned in the sentence. 19 / 22

LBL for sentence completion We used LBL models with two extensions: Diagonal context matrices for better scalability w.r.t word representation dimensionality. Separate representation tables for context words and the next word. Handling sentence boundaries: Use a special out-of-sentence token for words in context positions outside of the sentence containing the word being predicted. Word representation dimensionality: 100, 200, or 300. Context size: 2-10. Training time (48M words, 80K vocabulary): 1-2 days on a single core. Estimated ML training time: 1-2 months. 20 / 22

Sentence completion results METHOD CONTEXT LATENT TEST PERCENT SIZE DIM PPL CORRECT CHANCE 0 20.0 3-GRAM 2 130.8 36.0 4-GRAM 3 122.1 39.1 5-GRAM 4 121.5 38.7 6-GRAM 5 121.7 38.4 LSA SENTENCE 300 49 RNN SENTENCE?? 45 LBL 2 100 145.5 41.5 LBL 3 100 135.6 45.1 LBL 5 100 129.8 49.3 LBL 10 100 124.0 50.0 LBL 10 200 117.7 52.8 LBL 10 300 116.4 54.7 LBL 10 2 100 38.6 44.5 LBL with a 10-word context and 300D word feature vectors sets a new accuracy record for the dataset. 21 / 22

Conclusions Noise-contrastive estimation provides a fast and simple way of training neural language models: Over an order of magnitude faster than maximum-likelihood estimation. Very stable even when using one noise sample per datapoint. Models trained using NCE with 25 noise samples per datapoint perform as well as the ML-trained ones. Large LBL models trained with NCE achieve state-of-the-art performance on the MSR Sentence Completion Challenge dataset. 22 / 22