Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Similar documents
Natural Language Processing SoSe Words and Language Model

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

N-gram Language Modeling

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Natural Language Processing and Recurrent Neural Networks

Natural Language Processing. Statistical Inference: n-grams

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

From perceptrons to word embeddings. Simon Šuster University of Groningen

Language Model. Introduction to N-grams

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Statistical Methods for NLP

Natural Language Processing

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Bayesian Learning (II)

Conditional Language Modeling. Chris Dyer

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Machine Learning for natural language processing

An overview of word2vec

The Noisy Channel Model and Markov Models

Hidden Markov Models in Language Processing

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

N-gram Language Modeling Tutorial

Linear Classifiers IV

Neural Networks Language Models

Lecture 13: Structured Prediction

GloVe: Global Vectors for Word Representation 1

Midterm sample questions

CS230: Lecture 10 Sequence models II

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Conditional Random Field

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

CSC321 Lecture 7 Neural language models

Statistical Machine Learning from Data

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

A fast and simple algorithm for training neural probabilistic language models

TnT Part of Speech Tagger

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Dynamic Programming: Hidden Markov Models

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Statistical Machine Translation

Deep Learning. Language Models and Word Embeddings. Christof Monz

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

Gaussian Models

Statistical Methods for NLP

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Statistical Natural Language Processing

Lecture 5 Neural models for NLP

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Design and Implementation of Speech Recognition Systems

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Log-Linear Models, MEMMs, and CRFs

CS 188: Artificial Intelligence Fall 2011

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Language Modeling. Michael Collins, Columbia University

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Feedforward Neural Networks

Probabilistic Language Modeling

Lecture 6: Neural Networks for Representing Word Meaning

10/17/04. Today s Main Points

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Introduction to Machine Learning Midterm Exam

Sequence Modeling with Neural Networks

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Machine Learning for natural language processing

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Neural Networks. Tobias Scheffer

Improved Learning through Augmenting the Loss

Sequences and Information

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Deep Learning Recurrent Networks 2/28/2018

NEURAL LANGUAGE MODELS

Chapter 3: Basics of Language Modelling

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Forward algorithm vs. particle filtering

Variational Decoding for Statistical Machine Translation

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models.

Statistical Methods for NLP

Lecture 7: Word Embeddings

with Local Dependencies

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM

Statistical methods in NLP, lecture 7 Tagging and parsing

Seq2Seq Losses (CTC)

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

arxiv: v1 [cs.cl] 21 May 2017

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Transcription:

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Language Models Tobias Scheffer

Stochastic Language Models A stochastic language model is a probability distribution over words. Given a string of words, w 1,, w m, a language model assigns a probability p(w 1,, w m ). Words w 1,, w m can be words, letters, keystrokes. Useful for many (most) NLP tasks. Speech recognition, Spell checking, auto-corrrect, auto-complete, Machine translation, Text classification, Many non-standard NLP problems. 2

Language Models: Why? Speech recognition Acoustic model + language model. Acoustic model Most likely words Language model P I saw a tree = P Eyes awe entry = 3

Language Models: Why? Hand-written text recognition: Pattern-recognition model + language model Pattern recognition model Language model P I saw a tree P 1 saw a free = = 4

Language Models: Why? Machine translation Translation model + language model Translation model Language model P I saw a tree = P I saw one tree = 5

Language Models: Why? Auto-correct, auto-complete Keyboard model + language model Keyboard model Language model P I saw a tree P O saq a trew = = 6

The n-gram Model Basic tool for language modeling. Based on the Markov assumption of order n 1: p X t X t 1,, X 1 = p X t X t 1,, X t n+1 n-gram model: p X 1,, X T = p X 1 p X T X T 1,, X T n+1 T = P X 1 p X n 1 X n 2,, X 1 p X t X t 1,, X t n+1 t=n Categorical distributions 7

The n-gram Model See lecture on basic models. Inference: determine p(w 1,, w T ). Parameter estimation: Determine θ x1,,x m by counting occurances. Laplace smooting for regularization. Laplace smooting assigns positive probability to unseen n-grams. 8

Implementing the n-gram Model Probabilities of word sequences decrease exponentially in T. Fall below floating-point precision quickly. Instead of θ x1,,x m, use parameters θ xm x 1,,x m. Do all calculations using logarithmic values. log t p(w t w t 1,,,,. w t+n 1 ) = t log p(w t w t 1,,,,. w t+n 1 ) 9

n-gram Model: Long-Term Dependencies The computer that I just installed the new operating system on crashed. Small values of n: Better estimates of n-gram probabilities but lack of context. Increasing n to 4, 5, 6, There will be increasingly many combinations of word n-grams that have never been observed. 1

Linear Interpolation Simple interpolation: P w n w n 1,, w 1 = λ n P n w n w n 1,, w 1 + + λ 2 P 2 w n w n 1 + λ 1 P 1 w n Context-dependent interpolation weights: P w n w n 1,, w 1 = λ n (w n 1, w n 2 )P n w n w n 1,, w 1 + + λ 2 (w n 1, w n 2 )P 2 w n w n 1 + λ 1 (w n 1, w n 2 )P 1 w n 11

Linear Interpolation Setting the interpolation coefficients. Split training data into 8% training and 2% tuning data. Estimate parameters θ x1,,x m on training part. Then, tune parameters λ i to maximize likelihood of the tuning data. 12

Resources Google has published a large n-gram corpus. http://googleresearch.blogspot.de/26/8/all-ourn-gram-are-belong-to-you.html 13

Resources Google has published data on the evolution of ngram counts over time from Google Books. 14

Limitations of the n-gram Model Capability of reflecting contect limited to n words. Independent parameter for each n-word combination. Semantically similar terms have independent parameters. Idea: Improve genralization by treating semantically similar words in a similar way. Linear interpolation is an attempt at improving generalization. Also, n-gram class models are an attempt at improving generalization. Continuous-space language models. 15

Continuous-Space Language Models Predict w t based on features extracted from w t 1,, w t n+1. Maximize P(w t w t 1,, w t n+1 ) t Output x t 1 x t x t+1 x T Language model Input x 1 x t n+1 x t 1 x T. 16

Continuous-Space Language Models Predict w t based on features extracted from w t n,, w t+n. Maximize P(w t w t+j ) t j= n +n j Model looks into the future. Output x t 1 x t x t+1 x T Language model Input x 1 x t n x t n x T. 17

Continuous-Space Language Models Skip-gram model: Predict w t n,, w t 1, w t+1, w t+n based on features extracted from w t. Maximize P(w t+j w t ) t j= n +n j Output x t 1 x t x t+1 x T Language model Input x 1 x t n x t n x T. 18

Continuous-Space Language Models For a word-level language model, words are usually represented by one-hot coded feature vector. x t = 1 Hippopotamus Dictionary: Aardvark Hippie Hippopotamus Hipster Zyzzogeton 19

Continuous-Space Language Models For a letter-level language model, latters are usually represented by one-hot coded feature vector. x t = 1 Dictionary: A Z " "! 9 2

Continuous-Space Language Models Continuous-space language models are usually implemented by neural networks. Forward propagation leads to activation of the hidden units. The activation of the hidden units creates the embedding (feature representation) φ(x t ) of each word. This feature representation φ(x t ) is useful for many tasks. Words that occur in similar contexts have similar feature representations. 21

Neural Language Model Output x t 1 x t x t+1 x T Input x 1 x t n+1 x t 1 x T. =.1.6.1.2 1 = φ(x t ) = 1 Also uses Markov assumption of order n 1! Softmax output layer Rectified linear units 22

Neural Skip-Gram Language Model Output x t 1 x t x t+1 x T Input x 1 x t n+1 x t n+1 x T. = 1 =.1.6.1.2 φ(x t ) =.1.1.2.8 23

Language Model with Context Memory Additionally provide paragraph ID as input. Allows model to lean an embedding for paragraphs in addition to the embedding for words. Paragraph vector: one-hot endocing of paragraph ID. p t = 1 Paragraph 17 24

Neural Language Model with Context.1.6.1.2 Output x t 1 x t x t+1 x T = φ(x t ) Softmax output layer Rectified linear units Input (x 1, p 1 ) (x t n+1, p t n+1 ) (x t 1, p t 1 ) x T. = 1, 1 = 1, 1 25

Using Word Embeddings Feature vectors φ(x t ) can replace one-hot coding of words in many applications. Text classification (aggregate over text), Sentiment analysis, Information extraction, 26

Using Language Models Auto-correct: find most likely intended sentence w 1,, w T given word entries x 1,, x T. w 1,, w T = arg max w1,,w T P w 1,, w T x 1,, x T = arg max P x 1,, x T w 1,, w T P w 1,, w T w 1,,w T = arg max w 1,,w T Maximization is Exponential in T t P x t w t Learn from corpus of spelling mistakes P w 1,, w T Language model 27

Using Language Models Auto-correct: find most likely intended sentence w 1,, w T given word entries x 1,, x T. w 1,, w T = arg max w1,,w T P w 1,, w T x 1,, x T = arg max P x 1,, x T w 1,, w T P w 1,, w T w 1,,w T = arg max w 1,,w T t P x t w t P w 1,, w T When the language model is based on n-th order Markov assumption, maximization by Viterbi is exponential in n (not T). 28

Using Language Models Auto-correct: find most likely intended sentence w 1,, w T given word entries x 1,, x T. w 1,, w T = arg max w1,,w T P w 1,, w T x 1,, x T = arg max P x 1,, x T w 1,, w T P w 1,, w T w 1,,w T = arg max w 1,,w T P x t w t If O(k n ) is still too slow, use beam search instead of maximizing over all sequences. t P w 1,, w T 29

Evaluating Language Models Extrinsic evaluation: how well deoes the model perform at the task that it is intended for? Machine translation: Apply translation system with different language models. The have human judge score the output sencences. Auto completion: Install two models on smartphones. Measure how frequently users accept the proposed completions. 3

Evaluating Language Models Intrinsic evaluation: Does the language model assign a high likelihood to actual sentences? Cannot evaluate on training corpus: Has been used to train the model. Does not imply that model will assign high likelyhood to any out-of-corpus sentence. Training-Test-Split: Estimate parameters on 8% of the corpus. Evaluate model on remaining 2%. 31

Evaluating Language Models What is a good evaluation measure? Log-Likelihood of the test corpus? Possible to compare multiple language models. Not possible to compare sencences because short sentences tend to have a higher likelihood than long one (fewer factors). Perplexity of the test corpus: Inverse likelihood, normalized by number of words. PP w 1,, w T = P w 1,, w T 1 T = T t 1 P(w t w t 1, ) 32

Evaluating Language Models Perplexity of the test corpus: Inverse likelihood, normalized by number of words. PP w 1,, w T = P w 1,, w T 1 T = Example: p w t = 1 1 PP w 1,, w T = 1 T 1 T 1 = 1 1 T 1 = 1 1 t P(w t w t 1, ) Perplexity: average branching factor. 33

Evaluating Language Models Different corpora reflect different distributions. A model that has been trained on the Wall Street Journal corpus may assign a log likelihood to sentences from a belletristic corpus. If a language model infers P w 1,, w T =, then the perplexity is undefined. Sign of overfitting to the training data, lack of regularization. 34

Summary Stochastic language models quantify the likelihood of a sencence. Markov assumption of order n 1: word w t only dependent on w t 1,, w t n+1. The n-gram model has a discrete parameter for each n-word combination. Continuous-space (neural) language models learn embedding of words into a feature space; this gives better generalization. Language models are a useful tools for many applications. 35