Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Similar documents
Language Models. Philipp Koehn. 11 September 2018

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013

GANs. Machine Learning: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM GRAHAM NEUBIG

The Noisy Channel Model and Markov Models

N-gram Language Modeling

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Machine Learning for natural language processing

Natural Language Processing SoSe Words and Language Model

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Natural Language Processing. Statistical Inference: n-grams

Introduction to Machine Learning

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Chapter 3: Basics of Language Modelling

Midterm sample questions

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Statistical Methods for NLP

Language Modeling. Michael Collins, Columbia University

Probabilistic Language Modeling

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Language Model. Introduction to N-grams

ANLP Lecture 6 N-gram models and smoothing

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

Expectations and Entropy

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

N-gram Language Modeling Tutorial

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Random Variables and Events

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

DT2118 Speech and Speaker Recognition

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

Introduction to Machine Learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

TnT Part of Speech Tagger

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

Statistical Machine Translation

Discrete Probability Distributions

Deep Learning. Ali Ghodsi. University of Waterloo

Classification: Logistic Regression from Data

Kneser-Ney smoothing explained

Lecture 12: Algorithms for HMMs

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

CMPT-825 Natural Language Processing

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

Decoding in Statistical Machine Translation. Mid-course Evaluation. Decoding. Christian Hardmeier

Inexact Search is Good Enough

Lecture 12: Algorithms for HMMs

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

Language Technology. Unit 1: Sequence Models. CUNY Graduate Center Spring Lectures 5-6: Language Models and Smoothing. required hard optional

Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology

CSCI 5832 Natural Language Processing. Today 1/31. Probability Basics. Lecture 6. Probability. Language Modeling (N-grams)

Logistic Regression. Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul SLIDES ADAPTED FROM HINRICH SCHÜTZE

Log-linear models (part 1)

Word Alignment. Chris Dyer, Carnegie Mellon University

Conditional Language Modeling. Chris Dyer

SYNTHER A NEW M-GRAM POS TAGGER

Tuning as Linear Regression

N-gram Language Model. Language Models. Outline. Language Model Evaluation. Given a text w = w 1...,w t,...,w w we can compute its probability by:

Natural Language Processing (CSE 490U): Language Models

Introduction to N-grams

Logistic Regression. INFO-2301: Quantitative Reasoning 2 Michael Paul and Jordan Boyd-Graber SLIDES ADAPTED FROM HINRICH SCHÜTZE

Language Modelling. Marcello Federico FBK-irst Trento, Italy. MT Marathon, Edinburgh, M. Federico SLM MT Marathon, Edinburgh, 2012

Neural Networks Language Models

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Language Models. Hongning Wang

Natural Language Processing

Lecture 3: ASR: HMMs, Forward, Viterbi

Statistical methods in NLP, lecture 7 Tagging and parsing

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

Overview (Fall 2007) Machine Translation Part III. Roadmap for the Next Few Lectures. Phrase-Based Models. Learning phrases from alignments

Introduction to Machine Learning

Chapter 3: Basics of Language Modeling

Classification. Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY. Slides adapted from Mohri. Jordan Boyd-Graber UMD Classification 1 / 13

Latent Dirichlet Allocation Based Multi-Document Summarization

Ranked Retrieval (2)

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)

1 Evaluation of SMT systems: BLEU

Statistical Machine Translation. Part III: Search Problem. Complexity issues. DP beam-search: with single and multi-stacks

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Fast Logistic Regression for Text Categorization with Variable-Length N-grams

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

NEURAL LANGUAGE MODELS

CS 188: Artificial Intelligence Fall 2011

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Statistical Natural Language Processing

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Prenominal Modifier Ordering via MSA. Alignment

AN INTRODUCTION TO TOPIC MODELS

CS 6120/CS4120: Natural Language Processing

Probabilistic Linguistics

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Transcription:

Language Models Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN Data Science: Jordan Boyd-Graber UMD Language Models 1 / 8

Language models Language models answer the question: How likely is a string of English words good English? Autocomplete on phones and websearch Creating English-looking documents Very common in machine translation systems Help with reordering / style p lm (the house is small) > p lm (small the is house) Help with word choice p lm (I am going home) > p lm (I am going house) Use conditional probabilities Data Science: Jordan Boyd-Graber UMD Language Models 2 / 8

N-Gram Language Models Given: a string of English words W = w1,w 2,w 3,...,w n Question: what is p(w)? Sparse data: Many good English sentences will not have been seen before Decomposing p(w) using the chain rule: p(w 1,w 2,w 3,...,w n ) = p(w 1 ) p(w 2 w 1 ) p(w 3 w 1,w 2 )...p(w n w 1,w 2,...w n 1 ) (not much gained yet, p(w n w 1,w 2,...w n 1 ) is equally sparse) Data Science: Jordan Boyd-Graber UMD Language Models 3 / 8

Markov Chain Markov independence assumption: only previous history matters limited memory: only last k words are included in history (older words less relevant) kth order Markov model For instance 2-gram language model: p(w 1,w 2,w 3,...,w n ) p(w 1 ) p(w 2 w 1 ) p(w 3 w 2 )...p(w n w n 1 ) What is conditioned on, here wi 1 is called the history How do we estimate these probabilities? Data Science: Jordan Boyd-Graber UMD Language Models 4 / 8

How do we estimate a probability? Suppose we want to estimate P(wn = home h = go). Data Science: Jordan Boyd-Graber UMD Language Models 5 / 8

How do we estimate a probability? Suppose we want to estimate P(wn = home h = go). home home big with to big with to and money and home big and home money home and big to Data Science: Jordan Boyd-Graber UMD Language Models 5 / 8

How do we estimate a probability? Suppose we want to estimate P(wn = home h = go). home home big with to big with to and money and home big and home money home and big to Maximum likelihood (ML) estimate of the probability is: ˆθ i = n i k n k (1) Data Science: Jordan Boyd-Graber UMD Language Models 5 / 8

Example: 3-Gram Counts for trigrams and estimated word probabilities the red (total: 225) word c. prob. cross 123 0.547 tape 31 0.138 army 9 0.040 card 7 0.031, 5 0.022 225 trigrams in the Europarl corpus start with the red 123 of them end with cross maximum likelihood probability is 123 = 225 0.547. Data Science: Jordan Boyd-Graber UMD Language Models 6 / 8

Example: 3-Gram Counts for trigrams and estimated word probabilities the red (total: 225) word c. prob. cross 123 0.547 tape 31 0.138 army 9 0.040 card 7 0.031, 5 0.022 225 trigrams in the Europarl corpus start with the red 123 of them end with cross maximum likelihood probability is 123 = 225 0.547. Is this reasonable? Data Science: Jordan Boyd-Graber UMD Language Models 6 / 8

The problem with maximum likelihood estimates: Zeros If there were no occurrences of bageling in a history go, we d get a zero estimate: ˆP( bageling go) = T go, bageling w V T go,w = 0 We will get P( go d) = 0 for any sentence that contains go bageling! Zero probabilities cannot be conditioned away. Data Science: Jordan Boyd-Graber UMD Language Models 7 / 8

Add-One Smoothing Equivalent to assuming a uniform prior over all possible distributions over the next word (you ll learn why later) But there are many more unseen n-grams than seen n-grams Example: Europarl 2-bigrams: 86, 700 distinct words 86,700 2 = 7,516,890,000 possible bigrams but only about 30,000,000 words (and bigrams) in corpus Data Science: Jordan Boyd-Graber UMD Language Models 8 / 8